hadoop --- MapReduce

MapReduce definition:

MapReduce can be decomposed into Map (map) + Reduce (statute), the specific process: 

  1.  Map : The input data set is divided into multiple small pieces and assigned to different computing nodes for processing
  2. Shuffle and Sort: Shuffle and sort. After the Map phase ends, the key-value pairs generated by each Mapper are sorted by key, and the values ​​of the same key are merged together, and the same key is sent to the subsequent reduce
  3. Reduce: Reduced computing, each computing node independently processes their key-value pairs and generates the final output.

         MapReduce is a programming framework for distributed computing programs, which is used for users to develop the core framework of "Hadoop-based data analysis applications". The core function of f is to integrate the business logic code written by the user and its own default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster.

advantage:

  1. Easy to program: users only care about business logic. Implement the framework's interface.
  2. Good scalability: servers can be dynamically added to solve the problem of insufficient computing resources
  3. High fault tolerance: If any machine fails, tasks can be transferred to other nodes.
  4. Parallel processing: It can effectively use multiple computing nodes in the cluster to perform parallel computing and improve processing speed.
  5. Suitable for massive data computing (TB, PB) and thousands of servers for joint computing

shortcoming:

  1. Not good at real-time calculations. mysql
  2. Not good at streaming computing. Spark streaming flink
  3. Not good at DAG directed acyclic graph calculation. Spark 

MapReduce architecture: 

        In MapReduce, there are two machine roles for executing MapReduce tasks: JobTracker and TaskTracker, where JobTracker is used for task scheduling and TaskTracker is used for task execution. In a Hadoop cluster, there is only one JobTracker.

         When the Client submits a job to the JobTracker, the JobTracker will split the job into multiple TaskTrackers for execution, and the TaskTracker will send a heartbeat message at regular intervals. TaskTracker tasks are assigned to other TackTrackers.

MapReduce execution process: 

  1.  The client starts a job
  2. The client requests a JobID from the JobTracker
  3. JobClient copies the resources required to run the job to HDFS, including jar files, configuration files, and input division information calculated by the client, and archives them in a folder named JobID.
  4. JobClient submits tasks to JobTracker.
  5. JobTracker schedules jobs, creates a map task for each partition according to the input partition information, and assigns the map task to taskTracker for execution. [5/6 steps in the picture]
  6. TaskTracker sends a Heartbeat to JobTracker every once in a while to tell JobTracker that it is still running, and the heartbeat also carries a lot of information such as the progress of the map task completion. When the JobTracker receives the completion information of the last task of the job, it sets the job as "successful", and the JobClient then conveys the information to the user.

MapReduce parallelism and tuning

        MapReduce parallelism refers to the number of data blocks processed simultaneously when executing MapReduce tasks. It is closely related to the utilization of system resources, task execution efficiency and task completion time. Correctly setting the degree of parallelism can improve the parallelism and overall performance of tasks.

        Parallelism tuning is to optimize the execution efficiency and resource utilization of tasks by setting the parallelism parameters in MapReduce tasks reasonably. Parallelism tuning strategy: 

  1. Map task parallelism: The computing resources in the cluster can be fully utilized by appropriately increasing the parallelism. mapreduce.job.mapsThe number of Map tasks can be set by adjusting parameters . Normally, the number of Map tasks can be set to the number of input shards or the number of computing slots available in the cluster to make full use of cluster resources. for example:

    1. If the hardware configuration is 2*12core +  64G, the appropriate mapdegree of parallelism is about 20-100 per node map, preferably with mapan execution time of at least one minute each.
    2. If the running time of each map or reduce task of the job is only 30-40 seconds, then reduce the number of map or reduce tasks of the job, and add the setup and reduction of each task (map|reduce) to the scheduler for scheduling. This intermediate process may take a few seconds, so if each task runs very quickly, too much time will be wasted at the beginning and end of the task
  2. Reduce task parallelism: The execution efficiency of tasks can be improved by adjusting the parallelism. You can manually set job.setNumReduceTasks(num), or you can mapreduce.job.reducesset the number of Reduce tasks by adjusting parameters

    1. If the data distribution is uneven, it is possible reduceto generate data skew in the stage.
    2. ReduceTaskThe number is not set arbitrarily, and business logic requirements must also be considered. In some cases, it is necessary to calculate the global summary results, so there can only be oneReduceTask
    3. Try not to run too many ReduceTask. For most job, the best reducenumber is at most reduceequal to or smaller than the cluster  reduce slots. This is especially important for small clusters.
  3. Use of Combiner: Combiner is a function that combines intermediate results after the Map phase and before passing the data to the Reduce task. By properly setting the use of Combiner, the overhead of data transmission and disk IO can be reduced, thereby improving the execution efficiency of tasks. CombinerThe principle used is that presence or absence cannot affect business logic.

    1. Reduced data transmission: The Combiner can send the results of the local merging of Map tasks to the Reduce task in the next stage, reducing the overhead of network transmission.
    2. Reduce disk IO: Combiner results can be merged locally in the Map task, reducing the number of disk reads and writes and the amount of data.
    3. Improve task execution efficiency: By reducing the overhead of data transmission and disk IO, Combiner can speed up task execution and improve overall performance.
  4. Data locality optimization: When setting the degree of parallelism, the locality of the data can be considered, and the task of processing the same data block should be scheduled to the node where the data block is located, so as to reduce the network transmission overhead of data.
  5. Resource tuning: The parallelism tuning of MapReduce tasks also involves the tuning of cluster resources. The degree of parallelism and the overall performance of tasks can be improved by increasing the number of nodes, increasing computing and storage resources, or adjusting memory allocation during task execution.

   Notice:      

        The Combiner function is not a mandatory requirement of the MapReduce framework, and whether it is used depends on the specific MapReduce program. And the premise of using Combiner is that the input and output types must be consistent. Merging operations can be implemented by implementing a custom Combiner function or using the built-in Combiner functions of primitive types (such as Sum, Max, Min, etc.).

Shuffle mechanism:

        The Shuffle mechanism refers to the process of grouping and sorting the intermediate key-value pairs generated by the Mapper according to the keys after the output of the Map stage, and distributing the values ​​of the same key to the corresponding Reducer tasks for the final aggregation operation.

The shuffle process is mainly divided into three steps: 

  1. Partition (partition): After output in the Map stage, the intermediate key-value pairs need to be divided according to the specified partitions to determine which key-values ​​should be processed by which Reducer task. The default partition function will partition according to the hash value of the key to ensure that the data with the same key is assigned to the same Reducer.
  2. sort (sorting): After the partition is completed, the key-value pairs of the same partition will be sorted according to the key to ensure that the same key-value is continuous after sorting, which is convenient for subsequent aggregation operations. Sorting can be implemented using the outer row algorithm or the inner row algorithm, and the specific implementation depends on the available memory and data volume of the system
  3. Combine and Transfer: After sorting, the values ​​of the same key can be combined and sent to the corresponding Reducer task for aggregation. The process of network transmission is designed here, and the data is sent from the Map node to the corresponding Reducer node. In order to improve the performance, the Combiner function can be used for local combination to reduce the overhead of data transmission.

   ShuffleThe size of the buffer in will affect MapReducethe execution efficiency of the program. In principle, the larger the buffer, the fewer the number of disk io, and the faster the execution speed.

Example of MapReduce wordCount: 

hadoop data types
Java type Hadoop Writable type
Boolean BooleanWritable
Byte ByteWritable
Int IntWritable
Float FloatWritable
Long LongWritable
Double DoubleWritable
String Text
Map MapWritable
Array ArrayWritable
Null NullWritabl

Count the frequency of words in a document

1. Introduce pom dependencies: 

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency> 

2. Serialization class Writable

[In the current case, no custom serialization interface is used]

Hadoop has its own serialization mechanism - Writable. Compared with Java's serialization, Hadoop's serialization is more compact, fast, and supports multiple languages.

Hadoop serialization steps: 

  1. Implement the Writable interface
  2. No-argument construction needs to be called during deserialization, so the serialized object must have no-argument construction
  3. Override the serialization method write() 
  4. Override the deserialization method readFidlds()
  5. The order of deserialization and serialization must be exactly the same 
  6. Override toString() to display the result in a file 
  7. Implement the Comparable interface, and transfer the custom serialized object in the key
//1 实现Writable接口
@Data
public class FlowBeanWritable implements Writable, Comparable<FlowBeanWritable> {
    private long upFlow; 
    private long downFlow; 
    private long sumFlow; 
    //2 提供无参构造
    public FlowBeanWritable() { }
     
    //4 实现序列化和反序列化方法,注意顺序一定要保持一致
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(upFlow);
        dataOutput.writeLong(downFlow);
        dataOutput.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }

    //5 重写ToString
    @Override
    public String toString() {
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    // 6 如果作为Key传输,则还需要实现compareTo方法
    @Override
	public int compareTo(FlowBeanWritable o) {
		// 倒序排列,从大到小
		return this.sumFlow > o.getSumFlow() ? -1 : 1;
	}
}

3. Write the Mapper class and implement the Mapper interface

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private Text outK = new Text();
    private IntWritable outV = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        // 1 获取一行并将其转成String类型来处理
        String line = value.toString();
        // 2 将String类型按照空格切割后存进String数组
        String[] words = line.split(" ");
        // 3 依次取出单词,将每个单词和次数包装成键值对,写入context上下文中供后续调用
        for (String word : words) {
            // 先将String类型,转为text,再包装成健值对
            outK.set(word);
            context.write(outK, outV);
        }
    }
}

There are four classes in the Mapper<LongWritable, Text, Text, IntWritable> generic type, here are actually two pairs of key-value pairs:

  1. LongWritable, Text: Indicates the input data, LongWritable indicates the index of the data, similar to the number of rows of data; Text indicates the read file content. Generally, the system default key-value pair is used.
  2. Text, IntWritable: indicates the output data, Text indicates the input word, and IntWritable indicates the number of occurrences of the word. This key-value pair needs to be determined according to business requirements.

4. Write the Reducer class and inherit the Reduce abstract class

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        outV.set(sum);
        //写出
        context.write(key,outV);
    }
}

There are four classes in the Reducer<Text, IntWritable, Text, IntWritable> generic type, and here are also two key-value pairs:

  • Text, IntWritable: the first key-value pair, which must be consistent with the output generic type of Mapper
  • Text, IntWritable: The second key-value pair, which represents the output result data, because the output here is the number of occurrences of the word, so it is still of type Text and IntWritable 

Reduce will be executed once for each group, that is, the same key will be assigned to the same group, so here only need to calculate the count superposition of each key

5. Write Driver class 


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class WordCountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 获取配置信息以及job对象
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 关联当前Driver程序的jar
        job.setJarByClass(WordCountDriver.class);

        // 指定Mapper和Reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 设置输入、输出的k、v类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 将job提交给yarn运行
        Boolean result = job.waitForCompletion(Boolean.TRUE);
    }

}

Guess you like

Origin blog.csdn.net/zhoushimiao1990/article/details/131377051