Apache kylin進階——配置篇

在Apache kylin的日常運維中,通常根據日常運行產生的日誌調整相關配置參數,從而達到性能的提升和運行的穩定性,kylin官網並沒有給出這些配置的相關說明和解釋,下面介紹一下kylin的配置。

在${KYLIN_HOME}/conf 下一共4個配置文件:

  • kylin_hive_conf.xml

  • kylin_job_conf_inmem.xml

  • kylin_job_conf.xml

  • kylin.properties

kylin_hive_conf.xml是kylin提交任務到hive的配置文件,kylin_job_conf_inmem.xml、kylin_job_conf.xml是 kylin提交任務到yarn中的配置文件,用戶可根據自己的情況酌情修改,下面介紹一下kylin.properties的重要配置項:

kylin.server.mode=alln

kylin伺服器的運行模式,有all、job、query,涵義參見:zhuanlan.zhihu.com/p/22

kylin.rest.servers=hostname1:7070,hostname2:7070,hostname3:7070n

kylin實例伺服器列表,注意:不包括以job模式運行的伺服器實例!

kylin.metadata.url=kylin_metadata@hbasen

kylin元數據配置,涵義參見:zhuanlan.zhihu.com/p/22

kylin.job.retry=0n

kylin job的重試次數,注意:這個job指cube build、fresh時生成的job,而不是每一個step 的mapreduce job。

kylin.job.mapreduce.default.reduce.input.mb=500n

kylin提交作業到hadoop中時,每個reduce的最大輸入,該參數用來確定mapreduce的reduce個數,參見以下代碼:

public double getDefaultHadoopJobReducerInputMB() {n return Double.parseDouble(getOptional("kylin.job.mapreduce.default.reduce.input.mb", "500"));}n

protected void setReduceTaskNum(Job job, KylinConfig config, String cubeName, int level) throws ClassNotFoundException, IOException, InterruptedException, JobException {n Configuration jobConf = job.getConfiguration();n KylinConfig kylinConfig = KylinConfig.getInstanceFromEnv();nn CubeDesc cubeDesc = CubeManager.getInstance(config).getCube(cubeName).getDescriptor();n kylinConfig = cubeDesc.getConfig();nn double perReduceInputMB = kylinConfig.getDefaultHadoopJobReducerInputMB();n double reduceCountRatio = kylinConfig.getDefaultHadoopJobReducerCountRatio();nn // total map input MBn double totalMapInputMB = this.getTotalMapInputMB();nn // output / input ration int preLevelCuboids, thisLevelCuboids;n if (level == 0) { // base cuboidn preLevelCuboids = thisLevelCuboids = 1;n } else { // n-cuboidn int[] allLevelCount = CuboidCLI.calculateAllLevelCount(cubeDesc);n preLevelCuboids = allLevelCount[level - 1];n thisLevelCuboids = allLevelCount[level];n }n// total reduce input MB ndouble totalReduceInputMB = totalMapInputMB * thisLevelCuboids / preLevelCuboids; n// number of reduce tasks nint numReduceTasks = (int) Math.round(totalReduceInputMB / perReduceInputMB * reduceCountRatio); n// adjust reducer number for cube which has DISTINCT_COUNT measures for better performance nif (cubeDesc.hasMemoryHungryMeasures()) {numReduceTasks = numReduceTasks * 4; }n// at least 1 reducer nnumReduceTasks = Math.max(1, numReduceTasks); n// no more than 5000 reducer by default nnumReduceTasks = Math.min(kylinConfig.getHadoopJobMaxReducerNumber(), numReduceTasks); njobConf.setInt(MAPRED_REDUCE_TASKS, numReduceTasks); nlogger.info("Having total map input MB " + Math.round(totalMapInputMB)); logger.info("Having level " + level + ", pre-level cuboids " + preLevelCuboids + ", this level cuboids " + thisLevelCuboids); logger.info("Having per reduce MB " + perReduceInputMB + ", reduce count ratio " + reduceCountRatio); logger.info("Setting " + MAPRED_REDUCE_TASKS + "=" + numReduceTasks);}n

用戶可根據自己的數據量大小,性能要求及hadoop集群中的mapred-site.xml配置,酌情修改該項。

kylin.job.run.as.remote.cmd=falsen

該項配置表示,是否以ssh命令方式,向hadoop、hbase、hive等發起CLI命令。一般將kylin部署在hadoop集群的客戶機上,所以該值為false。假如kylin服務不部署在hadoop的客戶機上,則該值為true;這樣kylin訪問hadoop集群,需要給出以下配置項的值:

# Only necessary when kylin.job.run.as.remote.cmd=truenkylin.job.remote.cli.hostname=nn# Only necessary when kylin.job.run.as.remote.cmd=truenkylin.job.remote.cli.username=nn# Only necessary when kylin.job.run.as.remote.cmd=truenkylin.job.remote.cli.password=n

---------------------------------------------分割線---------------------------------------------------------------------

以下配置項是kylin並發執行job的最大值:

kylin.job.concurrent.max.limit=10

kylin檢查提交yarn中的mapreduce任務狀態的時間間隔:

kylin.job.yarn.app.rest.check.interval.seconds=10

代碼如下:

while (!isDiscarded()) {n JobStepStatusEnum newStatus = statusChecker.checkStatus();n if (status == JobStepStatusEnum.KILLED) {n executableManager.updateJobOutput(getId(), ExecutableState.ERROR, Collections.<String, String> emptyMap(), "killed by admin");n return new ExecuteResult(ExecuteResult.State.FAILED, "killed by admin");n }n if (status == JobStepStatusEnum.WAITING && (newStatus == JobStepStatusEnum.FINISHED || newStatus == JobStepStatusEnum.ERROR || newStatus == JobStepStatusEnum.RUNNING)) {n final long waitTime = System.currentTimeMillis() - getStartTime();n setMapReduceWaitTime(waitTime);n }n status = newStatus;n executableManager.addJobInfo(getId(), hadoopCmdOutput.getInfo());n if (status.isComplete()) {n final Map<String, String> info = hadoopCmdOutput.getInfo();n readCounters(hadoopCmdOutput, info);n executableManager.addJobInfo(getId(), info);nn if (status == JobStepStatusEnum.FINISHED) {n return new ExecuteResult(ExecuteResult.State.SUCCEED, output.toString());n } else {n return new ExecuteResult(ExecuteResult.State.FAILED, output.toString());n }n }n Thread.sleep(context.getConfig().getYarnStatusCheckIntervalSeconds() * 1000);n }n

以下配置項是kylin build cube時的第一步建立hive中間表所在的資料庫:

kylin.job.hive.database.for.intermediatetable=default

以下是kylin build cube時在hbase中建表後,存儲數據的壓縮演算法:

kylin.hbase.default.compression.codec=snappy

注意,設值時,先要檢驗hbase所指向的hadoop支不支持該壓縮演算法,檢驗命令如下:

hadoop checknative -a

檢驗結果如下:

該hadoop集群不支持snappy壓縮演算法,所以需修改默認值。


推薦閱讀:

五分鐘零基礎搞懂Hadoop
Hadoop mapreduce的核心組件

TAG:Hadoop | 大数据分析 | 数据仓库 |