【Hadoop五】Word Count实例结果分析

如下是运行Word Count的结果，输入了两个小文件，从大小在几K之间。

hadoop@hadoop-Inspiron-3521:~/hadoop-2.5.2/bin$ hadoop jar WordCountMapReduce.jar /users/hadoop/hello/world /users/hadoop/output5
--->/users/hadoop/hello/world
--->/users/hadoop/output5
14/12/15 22:35:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 22:35:41 INFO input.FileInputFormat: Total input paths to process : 2 //一共有两个文件要处理
14/12/15 22:35:41 INFO mapreduce.JobSubmitter: number of splits:2  //两个input splits，每个split对应一个Map Task
14/12/15 22:35:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418652929537_0001
14/12/15 22:35:43 INFO impl.YarnClientImpl: Submitted application application_1418652929537_0001
14/12/15 22:35:43 INFO mapreduce.Job: The url to track the job: http://hadoop-Inspiron-3521:8088/proxy/application_1418652929537_0001/
14/12/15 22:35:43 INFO mapreduce.Job: Running job: job_1418652929537_0001
14/12/15 22:35:54 INFO mapreduce.Job: Job job_1418652929537_0001 running in uber mode : false
14/12/15 22:35:54 INFO mapreduce.Job:  map 0% reduce 0%
14/12/15 22:36:04 INFO mapreduce.Job:  map 50% reduce 0%
14/12/15 22:36:05 INFO mapreduce.Job:  map 100% reduce 0%
14/12/15 22:36:16 INFO mapreduce.Job:  map 100% reduce 100%
14/12/15 22:36:17 INFO mapreduce.Job: Job job_1418652929537_0001 completed successfully
14/12/15 22:36:17 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=3448
		FILE: Number of bytes written=299665
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=2574
		HDFS: Number of bytes written=1478
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2 //一个输入文件一个Map Task
		Launched reduce tasks=1
		Data-local map tasks=2 //两个Map Task都是从本地Node读取数据内容
		Total time spent by all maps in occupied slots (ms)=17425
		Total time spent by all reduces in occupied slots (ms)=8472
		Total time spent by all map tasks (ms)=17425
		Total time spent by all reduce tasks (ms)=8472
		Total vcore-seconds taken by all map tasks=17425
		Total vcore-seconds taken by all reduce tasks=8472
		Total megabyte-seconds taken by all map tasks=17843200
		Total megabyte-seconds taken by all reduce tasks=8675328
	Map-Reduce Framework
		Map input records=90 //输入的两个文件的一共90行
		Map output records=251 //Map输出了251行，也就是说一行有将近3个单词,251/90
		Map output bytes=2940
		Map output materialized bytes=3454
		Input split bytes=263
		Combine input records=0
		Combine output records=0
		Reduce input groups=138
		Reduce shuffle bytes=3454
		Reduce input records=251
		Reduce output records=138
		Spilled Records=502
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=274
		CPU time spent (ms)=3740
		Physical memory (bytes) snapshot=694566912
		Virtual memory (bytes) snapshot=3079643136
		Total committed heap usage (bytes)=513277952
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=2311   //两个文件的总大小
	File Output Format Counters 
		Bytes Written=1478 //输出文件part-r-00000文件的大小

只有Mapper没有Reducer 把Combiner设置为Reducer的实现，同时设置numReducerTask为0，那么只有mapper有输出，输出文件名为part-m-00000,结果显示： 1.结果既没有排序，也没有对相同的结果做归并，即Combiner没起到作用设置五个Reducer 输出结果中有五个文件part-r-00000到part-r-00004 设置block的大小设置block的大小时，需要注意一下两个参数，如下两个参数的约束限制了block大小的设置，要想设置block大小需要依赖于如下两个参数的设置 1.block的大小不能比dfs.namenode.fs-limits.min-block-size设置的块大小更小，默认1048576 2.block的大小需要是io.bytes.per.checksum的整数倍，而io.bytes.per.checksum的默认大小是256字节要将block大小改为512字节，可以在hdfs-site.xml做如下配置：

<property>
  <name>dfs.block.size</name>
  <!--<value>67108864</value>-->
  <value>512</value>
  <description>The default block size for new files.</description>
</property>

<property>
  <name>dfs.namenode.fs-limits.min-block-size</name>
  <!--<value>67108864</value>-->
  <value>256</value>
  <description>The minimum of block size</description>
</property

Pro Apache Hadoop（p13） A map task can run on any compute node in the cluster, and multiple map tasks can run in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output of all the maps will be partitioned, and each partition will be sorted. There will be one partition for each reduce task. Each partition’s sorted keys and the values associated with the keys are then processed by the reduce task. There can be multiple reduce tasks running in parallel on the cluster. The key to receiving a list of values for a key in the reduce phase is a phase known as the sort/shuffle phase in MapReduce. All the key/value pairs emitted by the Mapper are sorted by the key in the Reducer. If multiple Reducers are allocated, a subset of keys will be allocated to each Reducer. The key/value pairs for a given Reducer are sorted by key, which ensures that all the values associated with one key are received by the Reducer together. Word Count Map Reduce过程

p24: Some of the metadata stored by the NameNode includes these:
• File/directory name and its location relative to the parent directory.
• File and directory ownership and permissions.
• File name of individual blocks. Each block is stored as a file in the local file system of the DataNode in the directory that can be configured by the Hadoop system administrator. 如何查看HDFS上的数据块如果文件大小不足1个block的size大小，那个这个文件将占用1个block(记录元信息到NameNode)，这个block的实际大小就是文件的大小 p19 The NameNode file that contains the metadata is fsimage. Any changes to the metadata during the system
operation are stored in memory and persisted to another file called edits. Periodically, the edits file is merged with the fsimage file by the Secondary NameNode. 使用如下命令可以查看HDFS的状态 hdfs fsck / -files -blocks -locations |grep /users/hadoop/wordcount -A 30

【Hadoop五】Word Count实例结果分析

猜你喜欢