Experiment 3 MapReduce practice


Authorized by blogger curly mini pig , from our teacher PPT, I just write my own operation process

Experiment 3 MapReduce practice

  • 1. Purpose of the experiment
  • 2. Experimental principle
  • 3. Experimental Preparation
  • 4. Experimental content

1. Purpose of the experiment

  • Further master the basic principles of MapReduce by writing MapReduce programs
  • Understand the idea of ​​​​branch computing by writing MapRduce programs for different cases
  • Master the writing mode of MapReduce programs, and be able to decompose and realize requirements proficiently

2. Experimental principle

  • The map calculation is performed in units of one record, and the reduce calculation is performed in units of a group of records;
  • The combination of map+reduce can effectively disassemble data analysis in many big data scenarios (ie: computing requirements);
  • Since MR computing is carried out in a big data distributed scenario, it is necessary to have a deep understanding of the serialization and deserialization problems in the network transmission process;
  • In order to improve the efficiency of data fetching and IO by MR, the MR computing framework itself implements the two functions of "partitioning" and "secondary sorting", which together are called shuffle;
  • MR provides partition and sorting (campare) interfaces for the convenience of users' personalized operations;
  • Other MR parameter settings and interface usage, such as: merge operation (Combine), joint table query (Jion)...; [
    Note] At present, the MR calculation framework is no longer the mainstream, but there are certain applicable scenarios; we should learn to learn and apply quickly The ability of many MR divide and conquer computational thinking will be frequently used in the follow-up course "Big Data Memory Computing".

3. Experimental Preparation

  • Complete experiment 1 and build a pseudo-distributed environment
  • After completing experiment 2, the network access environment has been configured
  • Complete experiment 2 and have experience in building Maven projects

4. Experimental content

[Experimental project] Project 1, project 2, and project 3 must be done; project 4 and subsequent projects are optional

  • Project 1: Analyzing and Writing a WordCount Program
  • Project 2: Count the sum of salaries of employees in each department (serialization + department partition + Combiner)
  • Project 3: Statistics on the salary level of all employees (salary division)
  • Project 4: Advanced understanding of the principles behind WordCount
  • Project 5: TopN case - counting the two days with the highest temperature in each month
  • Project 6: Friend recommendation case (to be completed in 2023)

[Note] Instructions for the adjustment of the selected project content:
• The original "Mobile Phone Sales Statistics Project", after I did it, I felt that it was not in line with the big data distributed computing thinking, so I discarded it. Everyone has a thorough understanding of project 5 and project 2 in the theoretical class. The explanation of 3 basically meets the standard;
• Another sales case, limited to class hours (in fact, it was washed away by the holiday and xx weeks), will be put in the third homework for everyone to practice;
• The case recommended by friends, according to the current ability and Level, decided to use it next year for the next class of students, because the follow-up course will be changed to
48+16. The experimental class itself is used to communicate technology, discuss problems and accept tests. Hands-on teaching is only for freshmen and sophomores Classmates work.

4. Experimental content [approximate steps]

Project 1: Analyzing and Writing a WordCount Program

It is recommended to follow the reconstruction directly, save a lot of trouble, and cooperate with me in this article

reference link

[Reference link] ( Teacher Liang's blog )
https://blog.csdn.net/qq_42881421/article/details/83353640
Reference Note

  • a. Configure the main class: Add the following content before </project>. The addition here should be added before <\build>
  • There is this line <!-- the class where main() is located, pay attention to modify -->, the line below it <mainClass>com.MyWordCount.WordCountMaincom.MyWordCount is your mave project, and WordCountMain is used later, you can Use this WordCountMain to get recommendations first
  • run the packaged fileinsert image description here
    insert image description here
  • Running Ubuntul, how to solve the problem that winscp cannot connect? Click on the link , the pro-test is valid

According to the tutorial, the following has been completed
insert image description here

Steps

Configure Maven for eclipse (download in advance, it takes time, pay attention to mirror configuration and network)
• Write code and explain code

  • WordCountMain.java (calling class, where the main function is located)
	package com.MyWordCount;
	
	import org.apache.hadoop.conf.Configuration;
	import org.apache.hadoop.fs.Path;
	import org.apache.hadoop.io.IntWritable;
	import org.apache.hadoop.io.Text;
	import org.apache.hadoop.mapreduce.Job;
	import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
	import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
	
	public class WordCountMain {
    
    
	
		public static void main(String[] args) throws Exception {
    
    
			//1.创建一个job和任务入口
			Job job = Job.getInstance(new Configuration());
			job.setJarByClass(WordCountMain.class);  //main方法所在的class
			
			//2.指定job的mapper和输出的类型<k2 v2>
			job.setMapperClass(WordCountMapper.class);//指定Mapper类
			job.setMapOutputKeyClass(Text.class);    //k2的类型
			job.setMapOutputValueClass(IntWritable.class);  //v2的类型
			
			//3.指定job的reducer和输出的类型<k4  v4>
			job.setReducerClass(WordCountReducer.class);//指定Reducer类
			job.setOutputKeyClass(Text.class);  //k4的类型
			job.setOutputValueClass(IntWritable.class);  //v4的类型
			
			//4.指定job的输入和输出
			FileInputFormat.setInputPaths(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			
			//5.执行job
			job.waitForCompletion(true);
		}
	}
  • WordCountMapper.java (map class)
   package com.MyWordCount;
   import java.io.IOException;
   import org.apache.hadoop.io.IntWritable;
   import org.apache.hadoop.io.LongWritable;
   import org.apache.hadoop.io.Text;
   import org.apache.hadoop.mapreduce.Mapper;
   
   //                                      泛型    k1         v1    k2       v2
   public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    
    
   
   	@Override
   	protected void map(LongWritable key1, Text value1, Context context)
   			throws IOException, InterruptedException {
    
    
   		//数据: I like MapReduce
   		
   		String data = value1.toString();
   		//分词:按空格来分词
   		String[] words = data.split(" ");
   		//输出 k2    v2
   		for(String w:words){
    
    
   			context.write(new Text(w), new IntWritable(1));
   		}
   	}
   }

The map() method receives three parameters:

  1. key: Indicates the key of the input data, usually the offset or line number of the data, represented by the LongWritable type.
  2. value: Indicates the value of the input data, usually a line of text or a data block, represented by the Text type.
  3. context: Indicates the context object of the Mapper task, which provides the interface required for operations such as output results and progress updates, and is represented by the Context type.
  4. In addition to the map() method, the Mapper class also provides some other methods for operations such as initialization, cleaning, and configuration of Mapper tasks. These methods include setup(), cleanup()

The role of throws IOException, InterruptedException
In Java, the throws keyword is used to declare the exceptions that a method may throw, so that the code that calls the method can catch and handle these exceptions. In the Hadoop MapReduce framework, the map() method of the Mapper class may throw IOException and InterruptedException, so the throws keyword needs to be used in the method declaration to declare these exceptions.

  • WordCountReducer.java (reduce class)
	package com.MyWordCount;
	import java.io.IOException;
	import org.apache.hadoop.io.IntWritable;
	import org.apache.hadoop.io.Text;
	import org.apache.hadoop.mapreduce.Reducer;
	//                                              k3      v3         k4       v4
	public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    
	
		@Override
		protected void reduce(Text k3, Iterable<IntWritable> v3,Context context) throws IOException, InterruptedException {
    
    
			//对v3求和
			int total = 0;
			for(IntWritable v:v3){
    
    
				total += v.get();
			}
			//输出   k4 单词   v4  频率
			context.write(k3, new IntWritable(total));
		}
	}

• Maven command packaging (jar package)
• Upload the successfully compiled jar package to Linux
• Open the hadoop service in Linux, upload the original data to HDFS (note the path)
• Execute MapReduce tasks in Linux and view the output
[Note] This project is explained before theory/project 4, and it is relatively simple, so please think carefully about the relevant knowledge behind each line of code after completing the experiment.

Analysis code:

  • The role of context.write

    • The context.write() method can be used in both the Mapper phase and the Reducer phase.
      In the Mapper stage, we need to output the results obtained by Mapper processing to a file on disk. Specifically, during Mapper processing, we need to convert each line of data in the input file into a set of key-value pairs, where the key represents the word and the value represents the number of times the word appears in the input file. These key-value pairs need to be output to a file on disk for processing in the subsequent Reducer stage.
      In the Reducer phase, we need to output the results processed by the Reducer to a file on disk. Specifically, the Reducer needs to group the key-value pairs output by the Mapper stage according to the key, and aggregate the values ​​in each group. The aggregation results need to be output to a file on disk for processing by subsequent processing programs.
      When using the context.write() method in both the Mapper and Reducer phases, you need to pay attention that the type of the output key-value pair needs to match the type of the corresponding phase output key-value pair . In the Mapper stage, we need to output the key-value pair as <Text, IntWritable> type, where Text represents the word, and IntWritable represents the number of times the word appears in the input file. In the Reducer stage, we need to output key-value pairs as <Text, IntWritable> type, where Text represents the word, and IntWritable represents the total number of times the word appears in all input files.
  • MapReduce program

    • In the MapReduce program, the output result of Mapper will be passed to Reducer for further processing.
    • Specifically, the output results of Mapper will be partitioned, sorted, merged and grouped, and then grouped according to the key of the key-value pair, and the reduce() method of Reducer will be called once for each group for processing.
    • In this example, the processing flow of the MapReduce program is as follows:
      read the input file, and pass each row of data in the file to Mapper for processing.
      Mapper divides the input data into words, outputs each word and the number of times it appears as a key-value pair, and passes it to the framework.
      The framework performs operations such as partitioning, sorting, merging, and grouping, and groups key-value pairs of the same key into the same Reducer task.
      For each key-value pair, the framework calls the reduce() method of the Reducer once for processing.
      The Reducer sums all values ​​for the same key, outputting each word and the total number of times it occurs in the input file.
      Therefore, in this example, the output result will be passed to the Reducer by the MapReduce framework for processing, and the specific processing process is completed in the reduce() method of the Reducer.
  • Reducer program

    • Reducer is a component in MapReduce, which is used to further process the output of Mapper. The role of Reducer is to combine key-value pairs with the same key, merge their values ​​into one result, and finally output a key-value pair. The Reducer process consists of three main phases: shuffle, sort, and reduce.

    • Shuffle
      In the MapReduce program, the output of the Map task will be passed to the Reducer for further processing. Before that, the output results need to be Shuffled. The main task of Shuffle is to divide the output result of Mapper according to the key, and send the result of the same key to the same Reducer. The Shuffle process includes the following steps:

      • Partitioning: partition the output of the Mapper according to the key, and each partition corresponds to a Reducer task.
      • Sort: Within each partition, the keys are sorted so that values ​​for the same key are adjacent.
      • Merge: For identical keys within each partition, merge their values ​​into one result.
      • Copy: Copy the data of each partition to the corresponding Reducer node.
    • Sorting
      On Reducer nodes, the framework sorts all key-value pairs by key. The purpose of sorting is to put the key-value pairs of the same key together to facilitate the merge operation of the Reducer.

    • The reduce() method of Reduce
      Reducer will receive an iterator composed of all key-value pairs with the same key, merge their values, and finally output a key-value pair. The parameters of the reduce() method include:

      • key: the key of a key-value pair of the same key.
      • Iterator of values: an iterator of the values ​​of key-value pairs with the same key.
      • Context object: You can use the context object to output the output of the Reducer to HDFS.
    • In this process, the Reducer combines all the values ​​of the same key to get the total count of the key, and finally forms a key-value pair with the key and the total count, and outputs it to the context object. The output result of the Reducer will be passed to the MapReduce framework for further processing, such as output to HDFS or passed to the next Reducer.

  • WordCountMain.java program

    • Job object
      In Hadoop, the Job object is the main object representing a MapReduce job. The Job object is mainly used to configure the input, output, Mapper, Reducer and other attributes of the MapReduce job, start the MapReduce job and monitor its running status.
    • .class
      In Java, .class is the keyword used to get the class type
    • job.getInstance()
      Job.getInstance() is a static method in a MapReduce job, used to create a new Job instance. In MapReduce, Job represents a complete job, including all configuration information and runtime status of the job . Each Job instance is associated with a specific job, and the job can be configured and controlled through this instance.
      Use the Job.getInstance() method to create a new Job instance and return it . When creating a Job instance, you need to pass in a Configuration object as a parameter to configure various parameters of the job.
    • job.setJarByClass():
      job.setJarByClass() is a configuration method in a MapReduce job, which is used to specify the jar package used when the job is running. In MapReduce, all codes need to be packaged into a jar package, and the jar package needs to be uploaded to a node in the Hadoop cluster when running the job. Therefore, when writing a MapReduce job, you need to specify the jar package where the classes that the job depends on reside.
      The function of the job.setJarByClass() method is to automatically search for the jar package of the class according to the type information of the specified class, and set it as the jar package when the job is running.
    • Map stage parameters (similar to the Reduce stage)
      These lines of code are used to configure the relevant parameters of the Map stage.
      .
      job.setMapperClass(WordCountMapper.class)
      specifies the Mapper class as WordCountMapper.class, indicating that we want to use a custom Mapper class to process input data. The Mapper class is one of the core components of a MapReduce job and is used to convert input data into key-value pairs.
      .
      job.setMapOutputKeyClass(Text.class)
      Set the data type of the key output by Mapper to Text.class. In the WordCount example, the key output by Mapper is word, so we need to set its data type to Text.class.
      .
      job.setMapOutputValueClass(IntWritable.class)
      Set the data type of the value output by Mapper to IntWritable.class. In the WordCount example, the value output by Mapper is the count value of words, so we need to set its data type to IntWritable.class.
      .These
      configuration parameters are required because they tell the MapReduce framework the input and output data types and specify the processing flow of the MapReduce job.

4. Experimental content [approximate steps]

  • Project 1: Analyze and write the WordCount program ==> Thinking questions 1~4 need to be answered in the experiment report. Thinking questions 5 and 6 are selected to save your life [Thinking question 1] In the mapper, every time a word is counted, a new
    one IntWritable object, is this appropriate? Why?
    In Mapper, it is possible to create a new IntWritable object every count,
    because in MapReduce, each Mapper and Reducer run in an independent JVM process, there
    is no data sharing between them, and they will not interfere with each other.
    Therefore, it is safe to create a new object in Mapper every time without affecting the correctness of the program.

[Thinking question 2] The current data naturally has a space (" "), which naturally separates the calculation elements; if there is no similar separator, what should we do?
If there is no similar delimiter, regular expressions can be used for word segmentation.
In Hadoop, various custom delimiters or regular expressions can be used to specify custom word segmentation rules.
You can use Java's regular expression library in Mapper to split the string of value.

 private final Pattern pattern = Pattern.compile(","); // 逗号分隔符正则表达式
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] fields = pattern.split(line); // 使用正则表达式分割字符串
        // 对分割后的每个字段进行处理
        for (String field : fields) {
    
    
            // 处理逻辑
            context.write(new Text(field), new IntWritable(1));
        }

Are there utility classes in Hadoop that can handle this? Even allow users to personalize operations according to their needs? Or can users only split manually?
Hadoop also provides the TextInputFormat class
, which can automatically read the input file line by line, and parse each line of data into a Text type object for subsequent processing.
If more free operations are required, users can also customize the InputFormat class and RecordReader class.

Haven't tried it yet

[Thinking question 3] In the reducer, why is v3 in the input <k3,v3> an iterator Iterable? What are iterators? What is its role in MR calculations?
V3 is that Iterable should be designed for handling multiple values ​​for the same key output by the Map stage.
.
Iterators allow us to traverse the elements of different types in the collection one by one, and perform corresponding operations during the traversal process.
.
In the Map phase, iterators are used to read and process input data.
The Map function usually needs to process a large amount of data, and due to memory and other limitations, it is impossible to load all the data into memory at once.
The iterator allows the Map function to read the input data one by one and process them.
Iterators provide a mechanism to access input data one by one, enabling the Map function to efficiently handle large-scale data sets.

4. Experimental content [approximate steps]

  • Project 1: Analyze and write the WordCount program ==> Thinking questions 1~4 need to be answered in the experiment report. Thinking questions 5 and 6 are selected to save your life. 【Thinking question 4】Observe the key-value pair types of two sets of input and output
    <
    LongWritable key1, Text value1> ==> <Text key2, IntWritable value2> (mapp( ))
    <Text k3, Iterable v3> ==> <Text key4, IntWritable value4> (reduce( ))
    Why should these types be processed in MR encapsulation? Is it not good to use it directly?
    The reason for encapsulating key-value pairs in MapReduce is to improve the flexibility, scalability, and versatility of the framework. And direct use lacks flexibility

[Thinking question 5] Whether it is map or reduce, how does the output data interact with the framework in the end?

In the Map phase, the <k2, v2> key-value pair output by the Map function needs to be collected by the collector (Collector), and the collector will convert <k2, v2> into an intermediate key-value pair <k3, v3>, and then follow The values ​​of k3 are sorted and grouped. Finally, the intermediate key-value pair <k3, v3> is passed to the Reduce stage.
In the Reduce phase, the input data received by the Reducer is <k3, Iterable>, where Iterable represents an iterator of the v3 value, which is aggregated from multiple v2 values. The task of the Reducer is to aggregate the v3 values ​​of the same key and generate the final <k4, v4> key-value pair.
Finally, the output of all Reduce functions will be collected by the frame collector and stored in the specified output file.

[Thinking question 6] Carefully observe the information printed on the console during runtime. How many maps and reduces are involved in the calculation in your environment? How many times are the map() method and reduce() method called? What is the basis?

4. Experimental content [approximate steps]

Project 2: Count the sum of salaries of employees in each department (serialization + department partition + Combiner)

reference link

[Reference link] ( Teacher Liang's blog )
https://blog.csdn.net/qq_42881421/article/details/84133800
Reference Note
The entire package with code at the end of the article can be imported.

  1. Configure the original cvs file
  2. 编写代码
    (1)Employee.java(序列化模型类) (4)SalaryTotalPartitioner (分区类)
    (2)SalaryTotalMapper.java (mapper类) (5)SalaryTotalMain (程序运行主类)
    (3) SalaryTotalReducer.java (reducer类) (6)在SalaryTotalMain 中添加Combiner用法
  3. Maven命令打包(jar包)
  4. 上传编译成功的jar包到Linux中(自己学习下文件传输工具)
  5. 在Linux中打开hadoop服务,上传原始数据文件到HDFS
  6. 在Linux中执行MapReduce任务,并查看输出结果
    操作结果截图:
    insert image description here

4.实验内容【大致步骤】

  • 项目2 :统计各部门员工薪水总和(序列化+部门分区+Combiner) ==> 思考题7 和 思考题8都需要在实验报告中回答
    【思考题7】Hadoop使用的是Java原生的序列化?如果是请说明理由;如果不是请说明原因,并指出Hadoop使用哪种方式做序列化?

Hadoop不使用Java原生的序列化。
Hadoop采用自己的序列化框架Writables。
Java原生的序列化机制存在一些性能和可移植性方面的限制
不适合在大规模数据处理环境中使用。

【思考题8】对于shuffle,MR框架分别对P值和key值做了排序,请问我们可以对这两个排序过程做更加精细的控制吗?即灵活地自定义排序规则?

在MapReduce框架中,对P值和Key值进行排序是为了确保相同Key的记录能够被发送到同一个Reducer进行处理。
对于P值排序,可以通过实现自定义的Partitioner类来控制。
对于Key值排序,可以通过实现自定义的Comparator类来控制。
MapReduce框架提供了默认的排序规则,也允许用户自定义排序规则。
自定义Partitioner和Comparator
可以通过调用setPartitionerClass()和setSortComparatorClass()方法
来指定自定义的Partitioner和Comparator。

项目3:统计全体员工工资水平(薪水分区)

【参考链接】 (梁老师博客
https://blog.csdn.net/qq_42881421/article/details/84328787
参考注意
直接导入包而已,要注意的就注意Partitioner这个类而已

  1. 配置原始cvs文件
  2. 编写代码
    (1)Employee.java(序列化模型类) (4)SalaryTotalPartitioner (分区类,相对项目2改写分区规则)
    (2)SalaryTotalMapper.java (mapper类) (5)SalaryTotalMain (程序运行主类)
    (3) SalaryTotalReducer.java (reducer类)
  3. Maven命令打包(jar包)
  4. 上传编译成功的jar包到Linux中(自己学习下文件传输工具)
  5. 在Linux中打开hadoop服务,上传原始数据文件到HDFS
  6. Execute the MapReduce task in Linux, and view the output results
    of Project 3:
    insert image description here

Guess you like

Origin blog.csdn.net/L2489754250/article/details/130053835