I want to enter the big data MapReduce knowledge points of the big factory (1)

Insert picture description here

01 Let's learn big data together

What Lao Liu shared today is the distributed computing MapReduce module in the big data Hadoop framework. There are many MapReduce knowledge points. You need to be patient and remember, this time I will share the first part of MapReduce. This time, Lao Liu shared these knowledge points based on self-study materials. One is to help students who are interested in big data, and the other is to get criticism and guidance from big guys.

02 MapReduce knowledge points

Point 1: The concept of MapReduce

MapReduce , it is a distributed computing framework , and adopts a divide and conquer idea. It can be seen from the words that MapReduce consists of two stages:

Map stage (divide the big task into smaller tasks)

Reduce phase (summarize the results of a small task)

So what is divide and conquer ?

For example, a complex, computationally intensive, and time-consuming task is temporarily called a "big task"; if it is impossible to calculate using a single server at this time or the calculation does not produce a result in a short period of time, the large task can be divided into A small task, the small task can be executed in parallel on different servers, and finally the results of each small task are summarized.

Point 2: Introduction to the Map phase

The map stage has a key map() function. The input of this map() function is (k, v) key-value pairs, and the output is also a series of (k, v) key-value pairs, and finally the output is written to the local Disk.

Point 3: Introduction to the Reduce phase

There is a key reduce() function in the reduce phase. The input of this reduce() function is also a key-value pair (that is, the key-value pair output by the map), and the output is also a series of (k, v) key-value pairs. The result will be written at the end Into HDFS.

Summarize the introduction of points 2 and 3 into a picture, as follows:
Insert picture description here
Point 4: MapReduce programming

After explaining some of the above concepts, Lao Liu was in a state of understanding but not understanding at the time, so he looked at the programming example given in the material, and after understanding the principle of this example and each line of code, he felt suddenly enlightened and happy. !

Let's take the word frequency statistics of MapReduce as an example: count the total number of occurrences of each word in a batch of English articles.

Schematic diagram of MapReduce word frequency statistics: The
Insert picture description here
picture is a bit vague, everyone will take a look, first share the key code of the map side and the reduce side, the complete code is shared at the end.

On the map side:

String line = value.toString();
//按照\t进行分割,得到当前行所有单词
String[] words = line.split("\t");

for (String word : words) {
    
    
     //将每个单词word变成键值对形式(word, 1)输出出去
    //同样,输出前,要将kout, vout包装成对应的可序列化类型,如String对应Text,int对应IntWritable
    context.write(new Text(word), new IntWritable(1));
}

On the reduce side:

//定义变量,用于累计当前单词出现的次数
int sum = 0;

for (IntWritable count : values) {
    
    
    //从count中获得值,累加到sum中
    sum += count.get();
}
//将单词、单词次数,分别作为键值对,输出
context.write(key, new IntWritable(sum));// 输出最终结果

Regarding the code, what Liu wants to say is that every parameter in MapReduce must be figured out! The above is just the core code. Although it looks very simple, there are still many details and various parameters that must be carefully figured out!

Next, let’s talk about this process.
In the Map phase:

Assume that the MR input file has three blocks: block1, block2, and block3. Each block corresponds to a split, and each split corresponds to a map task.

Looking at the above picture, there are 3 map tasks (map1, map2, map3). The logic of these 3 tasks is roughly the same, so just talk about the first one.

map1 reads the data of block1, reads the data of block1 one line at a time, and uses the byte offset of the line beginning of the currently read line relative to the beginning of the current block as key(0), and the content of the current line as value.

Then the key-value pair will be passed in as a parameter of map(), and map() will be called.

In the map() function, code will be written according to various needs. If it is a word count, the content of the current line of value will be divided into spaces to obtain three words Dear, Bear, and River.

After turning each word into a key-value pair, output it to get (Dear, 1) | (Bear, 1) | (River, 1).

The final result is written to the local disk of the node where the map task is located (there are many details, which will involve shuffle knowledge points, which will be slowly expanded later in this article).

After the data in the first row of the block is processed, then the second row is processed, the principle is the same as above. When the map task has processed all the data in the current block, the map task will end.

In the Reduce phase:

The number of reduce tasks (ie, reduce tasks) is specified by the program written by yourself. Write job.setNumReduceTasks(4) in main() to specify 4 reduce tasks (reduce1, reduce2, reduce3, reduce4).

The logic of each reduce task is similar, so take the first reduce task for analysis.

After the map1 task is completed, reduce1 is connected to map1 through the http network, and the data of the partition belonging to reduce1 in the output result of map1 (using the default hash partition) is obtained through the network to the reduce1 end, and the same is connected to map2 and map3. Get results.

In the end, the reduce1 side obtains 4 (Dear, 1) key-value pairs; because their keys are the same, they are grouped into the same group; these 4 (Dear, 1) key-value pairs are converted into [Dear, Iterable( 1, 1, 1, )], passed into reduce() as two parameters.

In reduce(), the total number of Dear is calculated as 4, and (Dear, 4) is output as a key-value pair. The final output file of each reduce task will be written to HDFS. Shuffle will also be involved here.

This is roughly the process. After the code is written, the Jar package is generated and then run in the Hadoop cluster.

After finishing this example, Lao Liu said that there are many MapReduce examples. This is just one of them. You can find more examples by yourself. Lao Liu will continue to share MapReduce programming: data cleaning, search for two examples of users. The relevant code, put it at the end, everyone can go and see.

Point 5: shuffle

This is an important point, a very important point. Shuff mainly refers to the process in which the output of the map side is used as the input of the reduce side. It means shuffle in Chinese.

First give its detailed picture:
Insert picture description here
partitioning uses a partitioner, the default partitioner is HashPartitioner, the code is as follows:

public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {
    
    

  public void configure(JobConf job) {
    
    }

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    
    
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

Next, talk about the specific process,

On the map side:

1. Each map task has a corresponding ring memory buffer; the output is a kv pair, which is written to the ring buffer (the default size is 100M). When the content occupies 80% of the buffer space, a background thread will The data in the buffer overflow is written to a disk file.

2. During the overflow writing process, the map task continues to write data to the ring buffer; however, if the writing speed is greater than the overflow writing speed, eventually resulting in 100m full, the map task will suspend writing data to the ring buffer Process; only the process of overflow writing is performed; until all the data in the ring buffer overflows and is written to the disk, writing to the buffer is resumed.

3. There are several steps in the process of background thread overflow writing to disk:

① First partition each overwritten kv pair; the number of partitions is determined by the number of reduce tasks of the MR program; by default, HashPartitioner is used to calculate which partition the current kv pair belongs to; the calculation formula is (key.hashCode() & Integer.MAX_VALUE )% numReduceTasks

② In each partition, sort in memory according to the key of kv pair;

③ If the local aggregation combiner on the map side is set, the sorted data in each partition will be combined;

④ If the function of compressing map output is set, the overwritten data will be compressed;

Note: In the second step of sorting, Lao Liu read a lot of materials without mentioning it, so he searched for some materials by himself. This sorting operation belongs to the default behavior of Hadoop. The data in any application will be sorted, regardless of whether it is logically necessary.

There are 5 categories for sorting: partial sorting, full sorting, full sorting using partitioners, auxiliary sorting, and secondary sorting. The specific concepts will not be explained. Search by yourself!

4. As data is continuously written to the ring buffer, overflow writing will be triggered multiple times (every time the ring buffer is full of 100m), and the local disk will eventually generate multiple overflow files.

5. Before the map task is completed, all overflow files will be merged into one big overflow file; and it is a partitioned and sorted output file.

Here are some small details, let’s remember:

When merging overflowed files, if there are at least 3 overflowed files, and the map-side combine is set, the combine operation will be triggered during the merging process;

However, if there are only 2 or 1 overflow files, the combine operation will not be triggered (because the combine operation is essentially a reduce, it needs to start the JVM virtual machine, which has a certain overhead)

On the reduce side:

1. The reduce task will obtain its own partition data (many kv pairs) in the map task output through HTTP after each map task has run.

2. If the map output data is relatively small, save it in the reduce jvm memory first, otherwise write it directly to the Reduce disk.

3. Once the memory buffer reaches the threshold (0.66 by default) or the threshold for the number of map outputs (1000 by default), the merge is triggered and the result is written to the local disk. If combine is specified in MR programming, combine operation will be executed during merging.

4. As the files written by overflow increase, the background thread will merge them into large, sorted files.

5. After the reduce task copies all the map tasks, it will merge all the overflow files on the disk

, The default is to merge 10 at a time. The last batch of merges, part of the data comes from the memory, and part from the file on the disk.

6. Enter the "merging, sorting, and grouping stage"

7. Call the reduce method once for each set of data

8. The final result is written to HDFS

03 Summary of knowledge points

To be honest, there are a lot of knowledge points today. When you first started learning, some things were really difficult to understand, but when you remember it, it is really useful for learning spark and flink in the future. You will find a lot of content in this module. The principle is roughly the same, please remember what you shared today.

The last thing I want to say is that the programming examples of MapReduce are all placed on Lao Liu's official account, and everyone who is interested can go and see.

If you have something, contact the official account: Lao Liu who works hard; it's okay, learn big data with Lao Liu.

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/109882098