標籤:

Hadoop入門-WordCount示例

WordCount的過程如圖,這裡記錄下入門的過程,雖然有很多地方理解的只是皮毛。

Hadoop的安裝

安裝比較簡單,安裝完成後進行單機環境的配置。

hadoop-env.sh:指定JAVA_HOME。

# The only required environment variable is JAVA_HOME. All others aren# optional. When running a distributed configuration it is best ton# set JAVA_HOME in this file, so that it is correctly defined onn# remote nodes.nn# The java implementation to use.nexport JAVA_HOME="$(/usr/libexec/java_home)"n

core-site.xml:設置Hadoop使用的臨時目錄,NameNode的地址。

<configuration>n <property>n <name>hadoop.tmp.dir</name>n <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>n </property>n <property>n <name>fs.default.name</name>n <value>hdfs://localhost:9000</value>n </property>n</configuration>n

hdfs-site.xml:一個節點,副本個數設為1。

<configuration>n <property>n <name>dfs.replication</name>n <value>1</value>n </property>n</configuration>n

mapred-site.xml:指定JobTracker的地址。

<configuration>n <property>n <name>mapred.job.tracker</name>n <value>localhost:9010</value>n </property>n</configuration>n

啟動Hadoop相關的所有進程。

? sbin git:(master) ./start-all.shnThis script is Deprecated. Instead use start-dfs.sh and start-yarn.shn16/12/03 19:32:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablenStarting namenodes on [localhost]nPassword:nlocalhost: starting namenode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-vonzhou-namenode-vonzhoudeMacBook-Pro.local.outnPassword:nlocalhost: starting datanode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-vonzhou-datanode-vonzhoudeMacBook-Pro.local.outnStarting secondary namenodes [0.0.0.0]nPassword:n0.0.0.0: starting secondarynamenode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-vonzhou-secondarynamenode-vonzhoudeMacBook-Pro.local.outn16/12/03 19:33:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablenstarting yarn daemonsnstarting resourcemanager, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/yarn-vonzhou-resourcemanager-vonzhoudeMacBook-Pro.local.outnPassword:nlocalhost: starting nodemanager, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/yarn-vonzhou-nodemanager-vonzhoudeMacBook-Pro.local.outn

(可以配置ssh無密碼登錄方式,否則啟動hadoop的時候總是要密碼。)

看看啟動了哪些組件。

? sbin git:(master) jps -ln5713 org.apache.hadoop.hdfs.server.namenode.NameNoden6145 org.apache.hadoop.yarn.server.nodemanager.NodeManagern6044 org.apache.hadoop.yarn.server.resourcemanager.ResourceManagern5806 org.apache.hadoop.hdfs.server.datanode.DataNoden5918 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNoden

訪問localhost:50070/可以看到DFS的一些狀態。

WordCount 單詞計數

WordCount就是Hadoop學習的hello world,代碼如下:

public class WordCount {nn public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {n private final static IntWritable one = new IntWritable(1);n private Text word = new Text();nn public void map(LongWritable key, Text value, Context context)n throws IOException, InterruptedException {n String line = value.toString();n StringTokenizer tokenizer = new StringTokenizer(line);n while (tokenizer.hasMoreTokens()) {n word.set(tokenizer.nextToken());n context.write(word, one);n }n }n }nn public static class Reduce extendsn Reducer<Text, IntWritable, Text, IntWritable> {nn public void reduce(Text key, Iterable<IntWritable> values,n Context context) throws IOException, InterruptedException {n int sum = 0;n for (IntWritable val : values) {n sum += val.get();n }n context.write(key, new IntWritable(sum));n }n }nn public static void main(String[] args) throws Exception {n Configuration conf = new Configuration();nn Job job = new Job(conf, "wordcount");n job.setJarByClass(WordCount.class);nn job.setOutputKeyClass(Text.class);n job.setOutputValueClass(IntWritable.class);nn job.setMapperClass(Map.class);n job.setReducerClass(Reduce.class);n /**n * 設置一個本地combine,可以極大的消除本節點重複單詞的計數,減小網路傳輸的開銷n */n job.setCombinerClass(Reduce.class);nn job.setInputFormatClass(TextInputFormat.class);n job.setOutputFormatClass(TextOutputFormat.class);nn FileInputFormat.addInputPath(job, new Path(args[0]));n FileOutputFormat.setOutputPath(job, new Path(args[1]));nn job.waitForCompletion(true);n }n}n

構造兩個文本文件, 把本地的兩個文件拷貝到HDFS中:

? hadoop-examples git:(master) ? ln /usr/local/Cellar/hadoop/2.7.1/bin/hadoop hadoopn? hadoop-examples git:(master) ? ./hadoop dfs -put wordcount-input/file* inputnDEPRECATED: Use of this script to execute hdfs command is deprecated.nInstead use the hdfs command for it.nn16/12/03 23:17:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablen? hadoop-examples git:(master) ? ./hadoop dfs -ls input/ nDEPRECATED: Use of this script to execute hdfs command is deprecated.nInstead use the hdfs command for it.nn16/12/03 23:21:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablenFound 2 itemsn-rw-r--r-- 1 vonzhou supergroup 42 2016-12-03 23:17 input/file1n-rw-r--r-- 1 vonzhou supergroup 43 2016-12-03 23:17 input/file2n

編譯程序得到jar:

mvn clean packagen

運行程序(指定main class的時候需要全包名限定):

? hadoop-examples git:(master) ? ./hadoop jar target/hadoop-examples-1.0-SNAPSHOT.jar com.vonzhou.learnhadoop.simple.WordCount input outputn16/12/03 23:31:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablen16/12/03 23:31:20 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-idn16/12/03 23:31:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=n16/12/03 23:33:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.n16/12/03 23:33:21 INFO input.FileInputFormat: Total input paths to process : 2n16/12/03 23:33:21 INFO mapreduce.JobSubmitter: number of splits:2n16/12/03 23:33:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local524341653_0001n16/12/03 23:33:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/n16/12/03 23:33:22 INFO mapreduce.Job: Running job: job_local524341653_0001n16/12/03 23:33:22 INFO mapred.LocalJobRunner: OutputCommitter set in config nulln16/12/03 23:33:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1n16/12/03 23:33:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommittern16/12/03 23:33:22 INFO mapred.LocalJobRunner: Waiting for map tasksn16/12/03 23:33:22 INFO mapred.LocalJobRunner: Starting task: attempt_local524341653_0001_m_000000_0n16/12/03 23:33:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1n16/12/03 23:33:22 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.n16/12/03 23:33:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : nulln16/12/03 23:33:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/vonzhou/input/file2:0+43n16/12/03 23:33:22 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)n16/12/03 23:33:22 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100n16/12/03 23:33:22 INFO mapred.MapTask: soft limit at 83886080n16/12/03 23:33:22 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600n16/12/03 23:33:22 INFO mapred.MapTask: kvstart = 26214396; length = 6553600n16/12/03 23:33:22 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffern16/12/03 23:33:22 INFO mapred.LocalJobRunner: n16/12/03 23:33:22 INFO mapred.MapTask: Starting flush of map outputn16/12/03 23:33:22 INFO mapred.MapTask: Spilling map outputn16/12/03 23:33:22 INFO mapred.MapTask: bufstart = 0; bufend = 71; bufvoid = 104857600n16/12/03 23:33:22 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214372(104857488); length = 25/6553600n16/12/03 23:33:22 INFO mapred.MapTask: Finished spill 0n16/12/03 23:33:22 INFO mapred.Task: Task:attempt_local524341653_0001_m_000000_0 is done. And is in the process of committingn16/12/03 23:33:22 INFO mapred.LocalJobRunner: mapn16/12/03 23:33:22 INFO mapred.Task: Task attempt_local524341653_0001_m_000000_0 done.n16/12/03 23:33:22 INFO mapred.LocalJobRunner: Finishing task: attempt_local524341653_0001_m_000000_0n16/12/03 23:33:22 INFO mapred.LocalJobRunner: Starting task: attempt_local524341653_0001_m_000001_0n16/12/03 23:33:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1n16/12/03 23:33:22 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.n16/12/03 23:33:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : nulln16/12/03 23:33:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/vonzhou/input/file1:0+42n16/12/03 23:33:22 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)n16/12/03 23:33:22 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100n16/12/03 23:33:22 INFO mapred.MapTask: soft limit at 83886080n16/12/03 23:33:22 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600n16/12/03 23:33:22 INFO mapred.MapTask: kvstart = 26214396; length = 6553600n16/12/03 23:33:22 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffern16/12/03 23:33:22 INFO mapred.LocalJobRunner: n16/12/03 23:33:22 INFO mapred.MapTask: Starting flush of map outputn16/12/03 23:33:22 INFO mapred.MapTask: Spilling map outputn16/12/03 23:33:22 INFO mapred.MapTask: bufstart = 0; bufend = 70; bufvoid = 104857600n16/12/03 23:33:22 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214372(104857488); length = 25/6553600n16/12/03 23:33:22 INFO mapred.MapTask: Finished spill 0n16/12/03 23:33:22 INFO mapred.Task: Task:attempt_local524341653_0001_m_000001_0 is done. And is in the process of committingn16/12/03 23:33:22 INFO mapred.LocalJobRunner: mapn16/12/03 23:33:22 INFO mapred.Task: Task attempt_local524341653_0001_m_000001_0 done.n16/12/03 23:33:22 INFO mapred.LocalJobRunner: Finishing task: attempt_local524341653_0001_m_000001_0n16/12/03 23:33:22 INFO mapred.LocalJobRunner: map task executor complete.n16/12/03 23:33:22 INFO mapred.LocalJobRunner: Waiting for reduce tasksn16/12/03 23:33:22 INFO mapred.LocalJobRunner: Starting task: attempt_local524341653_0001_r_000000_0n16/12/03 23:33:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1n16/12/03 23:33:22 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.n16/12/03 23:33:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : nulln16/12/03 23:33:22 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@64accbd9n16/12/03 23:33:23 INFO mapreduce.Job: Job job_local524341653_0001 running in uber mode : falsen16/12/03 23:33:23 INFO mapreduce.Job: map 100% reduce 0%n16/12/03 23:33:53 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10n16/12/03 23:33:53 INFO reduce.EventFetcher: attempt_local524341653_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Eventsn16/12/03 23:33:53 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local524341653_0001_m_000001_0 decomp: 86 len: 90 to MEMORYn16/12/03 23:33:53 INFO reduce.InMemoryMapOutput: Read 86 bytes from map-output for attempt_local524341653_0001_m_000001_0n16/12/03 23:33:53 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 86, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->86n16/12/03 23:33:53 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local524341653_0001_m_000000_0 decomp: 87 len: 91 to MEMORYn16/12/03 23:33:53 INFO reduce.InMemoryMapOutput: Read 87 bytes from map-output for attempt_local524341653_0001_m_000000_0n16/12/03 23:33:53 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 87, inMemoryMapOutputs.size() -> 2, commitMemory -> 86, usedMemory ->173n16/12/03 23:33:53 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returningn16/12/03 23:33:53 INFO mapred.LocalJobRunner: 2 / 2 copied.n16/12/03 23:33:53 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputsn16/12/03 23:33:53 INFO mapred.Merger: Merging 2 sorted segmentsn16/12/03 23:33:53 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 162 bytesn16/12/03 23:33:53 INFO reduce.MergeManagerImpl: Merged 2 segments, 173 bytes to disk to satisfy reduce memory limitn16/12/03 23:33:53 INFO reduce.MergeManagerImpl: Merging 1 files, 175 bytes from diskn16/12/03 23:33:53 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reducen16/12/03 23:33:53 INFO mapred.Merger: Merging 1 sorted segmentsn16/12/03 23:33:53 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 165 bytesn16/12/03 23:33:53 INFO mapred.LocalJobRunner: 2 / 2 copied.n16/12/03 23:33:53 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecordsn16/12/03 23:33:53 INFO mapred.Task: Task:attempt_local524341653_0001_r_000000_0 is done. And is in the process of committingn16/12/03 23:33:53 INFO mapred.LocalJobRunner: 2 / 2 copied.n16/12/03 23:33:53 INFO mapred.Task: Task attempt_local524341653_0001_r_000000_0 is allowed to commit nown16/12/03 23:33:53 INFO output.FileOutputCommitter: Saved output of task attempt_local524341653_0001_r_000000_0 to hdfs://localhost:9000/user/vonzhou/output/_temporary/0/task_local524341653_0001_r_000000n16/12/03 23:33:53 INFO mapred.LocalJobRunner: reduce > reducen16/12/03 23:33:53 INFO mapred.Task: Task attempt_local524341653_0001_r_000000_0 done.n16/12/03 23:33:53 INFO mapred.LocalJobRunner: Finishing task: attempt_local524341653_0001_r_000000_0n16/12/03 23:33:53 INFO mapred.LocalJobRunner: reduce task executor complete.n16/12/03 23:33:54 INFO mapreduce.Job: map 100% reduce 100%n16/12/03 23:33:54 INFO mapreduce.Job: Job job_local524341653_0001 completed successfullyn16/12/03 23:33:54 INFO mapreduce.Job: Counters: 35n File System Countersn FILE: Number of bytes read=54188n FILE: Number of bytes written=917564n FILE: Number of read operations=0n FILE: Number of large read operations=0n FILE: Number of write operations=0n HDFS: Number of bytes read=213n HDFS: Number of bytes written=89n HDFS: Number of read operations=22n HDFS: Number of large read operations=0n HDFS: Number of write operations=5n Map-Reduce Frameworkn Map input records=5n Map output records=14n Map output bytes=141n Map output materialized bytes=181n Input split bytes=222n Combine input records=0n Combine output records=0n Reduce input groups=11n Reduce shuffle bytes=181n Reduce input records=14n Reduce output records=11n Spilled Records=28n Shuffled Maps =2n Failed Shuffles=0n Merged Map outputs=2n GC time elapsed (ms)=7n Total committed heap usage (bytes)=946864128n Shuffle Errorsn BAD_ID=0n CONNECTION=0n IO_ERROR=0n WRONG_LENGTH=0n WRONG_MAP=0n WRONG_REDUCE=0n File Input Format Counters n Bytes Read=85n File Output Format Counters n Bytes Written=89n? hadoop-examples git:(master) ?n

查看執行的結果:

? hadoop-examples git:(master) ? ./hadoop dfs -ls outputnDEPRECATED: Use of this script to execute hdfs command is deprecated.nInstead use the hdfs command for it.nn16/12/03 23:36:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablenFound 2 itemsn-rw-r--r-- 1 vonzhou supergroup 0 2016-12-03 23:33 output/_SUCCESSn-rw-r--r-- 1 vonzhou supergroup 89 2016-12-03 23:33 output/part-r-00000n? hadoop-examples git:(master) ? ./hadoop dfs -cat output/part-r-00000nDEPRECATED: Use of this script to execute hdfs command is deprecated.nInstead use the hdfs command for it.nn16/12/03 23:37:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablenbig 1nby 1ndata 1ngoogle 1nhadoop 2nhello 2nlearning 1npapers 1nstep 2nvonzhou 1nworld 1n

推薦閱讀:

大數據那些事(5):沉沒的微軟以及Dryad
MapReduce初窺 · 一
分散式機器學習的故事:LDA和MapReduce
技術分享丨關於 Hadoop 的那些事兒
MapReduce Lab03 筆記

TAG:Hadoop | MapReduce |