Hadoop-MapReduce计算案例1:WordCount

版权声明:原创文章,未经允许不得转载.Tips:传统电商火热的时代已经成为过去 , 下一个阶段属于大数据 人工智能 , 服务、便捷、安全、效率、创新成为下一个阶段互联网时代的新词汇,而IT技术也随着行业的变化发展而不断更迭。对于码农的出路总结一句话:追技术不如追领域。[基础][设计][能力] https://blog.csdn.net/shengqianfeng/article/details/82809432

案例描述:计算一个文件中每个单词出现的数量

代码:

package com.jeff.mr.wordCount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 案例1 :
 *   在一个很大的文件中含有一行行的单词,每行的单词以空格分隔,计算每个单词出现的个数
 * @author jeffSheng
 * 2018年9月18日
 */
public class RunJob {

	public static void main(String[] args) {
		//初始化Configuration自动装载src或者class下的配置文件
		Configuration config = new Configuration();
		try {
			FileSystem fs =FileSystem.get(config);
			//创建执行的任务,静态创建方式,传入config
			Job job = Job.getInstance(config);
			//设置入口类,就是当前类
			job.setJarByClass(RunJob.class);
			//设置job任务名称
			job.setJobName("wordCount");
			
			//job任务运行时Map Task执行类
			job.setMapperClass(WordCountMapper.class);
			//job任务运行时Reduce Task执行类
			job.setReducerClass(WordCountReducer.class);
			
			//map Task输出的key的类型,就是单词
			job.setMapOutputKeyClass(Text.class);
			//map Task输出的value的类型,就是单词出现数量
			job.setMapOutputValueClass(IntWritable.class);
			
			//先指定mapTask输入数据的目录:/usr/input/
			FileInputFormat.addInputPath(job, new Path("/usr/input/"));
			//指定输出数据的目录:/usr/output/wc
			Path outpath =new Path("/usr/output/wc");
			//判断目录是否存在,存在则递归删除
			if(fs.exists(outpath)){
				fs.delete(outpath, true);
			}
			//指定输出数据的目录
			FileOutputFormat.setOutputPath(job, outpath);
			//等待job任务完成
			boolean f= job.waitForCompletion(true);
			if(f){
				System.out.println("job任务执行成功");
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
		
	}
	
}
package com.jeff.mr.wordCount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.util.StringUtils;

/**
 * Map Task定义
 * 计算文件中单词出现次数和默认第一阶段洗牌
 * @author jeffSheng
 * 2018年9月18日
 * 	继承Mapper接口,泛型参数:<MapTask输入数据key,MapTask输入数据value,MapTask输出数据key,MapTask输出数据value>
 *  Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 *  
 *  KEYIN, VALUEIN
 *      mapTask输入数据从文件切割后的碎片段来的按照行去传递给MapTask,默认以数据行的下标为键值,类型为LongWritable,value为Text类型表示一行的数据
 *      
 *  KEYOUT, VALUEOUT
 *      mapTask的输出数据以单词为key,就是字符串类型Text,value则是单词的数量类型IntWritable
 * Mapper<LongWritable, Text, Text, IntWritable>
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

	/**
	 * 该map方法循环调用,从文件的split碎片段中逐行即读取每行则调用一次,把该行所在的下标为key,该行的内容为value
	 */
	protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException {
		//value是split的每一行的值,在本例子中是空格分隔的字符串
		String[] words = StringUtils.split(value.toString(), ' ');
		for(String word :words){
			//输出以单词word为键,1作为值的键值对,这里mapTask只是输出数据,统计则是在reduceTask
			/**
			 * 输出数据会经历第一阶段洗牌,即分区,排序,合并,溢写。这些在mapTask端有默认的操作
			 */
			context.write(new Text(word), new IntWritable(1));
		}
	}
}
package com.jeff.mr.wordCount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * reduce Task定义
 *   mapTask第一阶段洗牌完成后输出数据传给reduce Task进行第二阶段的洗牌(分区,排序,分组)后作为reduce的输入,数据类型一致。
 *   Tips:分组后,每一组数据传给reduce Task,即每一组调用一次,这一组数据的特点是key相等,value可能是多个
 * @author jeffSheng
 * 2018年9月18日
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

	//循环调用此方法,每组调用一次,这一组数据特点:key相同,value可能有多个。
	/**
	 * Text arg0: 键,就是每一组中的key,即某个单词。
	 * Iterable<IntWritable> arg1: 迭代器中可以拿到每一组中的所有值去迭代
	 */
	protected void reduce(Text arg0, Iterable<IntWritable> arg1,Context arg2)
													throws IOException, InterruptedException {
		int sum =0;
		for(IntWritable i: arg1){
			sum=sum + i.get();
		}
		//输出以单词为key,总和为value的键值对
		arg2.write(arg0, new IntWritable(sum));
	}
}

当前active状态的节点为node1,端口8020

将node1和8020设置到Map/Reduce Locations,新建一个location

 

红框中填入node1和8020.hadoop的用户为root

新建输入文件的路径为/usr/input

刚开始发现创建的目录不能成功,解决办法是在hdfs-site.xml加入:

<property>

          <name>dfs.permissions</name>

          <value>false</value>

</property>

重启,即可解决!

#关闭全部节点

stop-dfs.sh

我们启动node1:

#start-dfs.sh

上传wc.txt文件即输入数据到hdfs的/usr/input下:

hadooo hello world

hello hadoop

hbase zookeeper

在实际生产环境中,计算程序是先提交给ResourceManager的,所以我们先把程序打成jar包:

 

然后下一步,finish

然后我们把wc.jar上传到我们的node4,注意node4不是我们的Resourcemanager的主节点,但是node4的配置文件告诉我们了ResourceManager主节点node1的位置。

我们程序的入口类:

com.jeff.mr.wordCount.RunJob

# hadoop jar wc.jar com.jeff.mr.wordCount.RunJob

可以在监控界面看到计算进度:

执行日志:

root@node1 ~]# hadoop jar wc.jar com.jeff.mr.wordCount.RunJob

 

18/09/21 00:28:10 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

18/09/21 00:28:10 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

18/09/21 00:28:11 INFO input.FileInputFormat: Total input paths to process : 1

18/09/21 00:28:11 INFO mapreduce.JobSubmitter: number of splits:1

18/09/21 00:28:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1537198202075_0002

18/09/21 00:28:12 INFO impl.YarnClientImpl: Submitted application application_1537198202075_0002

18/09/21 00:28:12 INFO mapreduce.Job: The url to track the job: http://node4:18088/proxy/application_1537198202075_0002/

18/09/21 00:28:12 INFO mapreduce.Job: Running job: job_1537198202075_0002

18/09/21 00:28:38 INFO mapreduce.Job: Job job_1537198202075_0002 running in uber mode : false

18/09/21 00:28:38 INFO mapreduce.Job:  map 0% reduce 0%

18/09/21 00:28:51 INFO mapreduce.Job:  map 100% reduce 0%

18/09/21 00:29:04 INFO mapreduce.Job:  map 100% reduce 100%

18/09/21 00:29:05 INFO mapreduce.Job: Job job_1537198202075_0002 completed successfully

18/09/21 00:29:05 INFO mapreduce.Job: Counters: 49

         File System Counters

                   FILE: Number of bytes read=96

                   FILE: Number of bytes written=198601

                   FILE: Number of read operations=0

                   FILE: Number of large read operations=0

                   FILE: Number of write operations=0

                   HDFS: Number of bytes read=146

                   HDFS: Number of bytes written=54

                   HDFS: Number of read operations=6

                   HDFS: Number of large read operations=0

                   HDFS: Number of write operations=2

         Job Counters

                   Launched map tasks=1

                   Launched reduce tasks=1

                   Data-local map tasks=1

                   Total time spent by all maps in occupied slots (ms)=11040

                   Total time spent by all reduces in occupied slots (ms)=9092

                   Total time spent by all map tasks (ms)=11040

                   Total time spent by all reduce tasks (ms)=9092

                   Total vcore-seconds taken by all map tasks=11040

                   Total vcore-seconds taken by all reduce tasks=9092

                   Total megabyte-seconds taken by all map tasks=11304960

                   Total megabyte-seconds taken by all reduce tasks=9310208

         Map-Reduce Framework

                   Map input records=3

                   Map output records=7

                   Map output bytes=76

                   Map output materialized bytes=96

                   Input split bytes=97

                   Combine input records=0

                   Combine output records=0

                   Reduce input groups=6

                   Reduce shuffle bytes=96

                   Reduce input records=7

                   Reduce output records=6

                   Spilled Records=14

                   Shuffled Maps =1

                   Failed Shuffles=0

                   Merged Map outputs=1

                   GC time elapsed (ms)=214

                   CPU time spent (ms)=3550

                   Physical memory (bytes) snapshot=322617344

                   Virtual memory (bytes) snapshot=1724956672

                   Total committed heap usage (bytes)=136253440

         Shuffle Errors

                   BAD_ID=0

                   CONNECTION=0

                   IO_ERROR=0

                   WRONG_LENGTH=0

                   WRONG_MAP=0

                   WRONG_REDUCE=0

         File Input Format Counters

                   Bytes Read=49

         File Output Format Counters

                   Bytes Written=54

job任务执行成功

执行结果目录/usr/output/wc下有两个文件:第一个_SUCCESS文件是成功标识,第二个输出结果文件:

输入文件:

 

计算结果:

猜你喜欢

转载自blog.csdn.net/shengqianfeng/article/details/82809432