Hard-core MapReduce principle analysis, just read this one

Introduction

Ever since Mr. Meow told Xiaobai the story of HDFS, Xiaobai's thoughts were still not enough, and I haunted Mr. Meow all day, wanting him to continue to talk about another important MapReduce computing framework in the Hadoop framework.

table of Contents

  1. What is MapReduce
  2. Principles of MapReduce

1 What is MapReduce

Mr. Meow said: "MapReduce is a calculation framework of hadoop. To put it bluntly, hdfs is responsible for storage. Then things like other statistics and calculations will be handed over to MapReduce, which is divided into map process and reduce process.
Map process It is dismantling. For example, there is a red car and a group of workers disassembled it into parts. This is Map"

Hard-core MapReduce principle analysis, just read this one

The process of Reduce is a combination. We have a lot of auto parts, and many other various device parts. Assemble them into Transformers. This is Reduce.

Hard-core MapReduce principle analysis, just read this one

Xiaobai listened: "It sounds so vivid, so what is the specific map process and reduce process."
"Let me explain to you slowly," Mr. Meow swallowed.

2 Principles of MapReduce

Mr. Meow said: "First of all, let's look at the following data-the student's information record sheet"

Hard-core MapReduce principle analysis, just read this one

  1. We can filter out the data whose gender is 1;
  2. You can convert 1 to male and 0 to female in the gender field;
  3. You can also expand the field address;

The above process is map: mapping (filtering/converting/expanding) in units of one record

Xiaobai said: "I feel that the principle of map is similar to the grammar of mysql, select * from student where sex=1, both of which process data one by one."
Mr. Meow: "Well, we can teach you how to Let's continue to look at the reduce process: when we want to count the total number of students in each major, we need to group python, java, and c, and use this group as a unit for statistical calculations."

The above process is reduce: the calculation is performed in groups

Xiaobai said: "Isn't this the principle of group by in mysql, the statistics are based on the group"
Mr. Meow added: "The idea is similar to that of mysql's group by"

Hard-core MapReduce principle analysis, just read this one

Finally Mr. Meow continue concluded: "the input data is performed according to a data unit mapping (map method), and then outputs kv value pairs, as an input unit of a group reduce the calculated final output."
Learn the white Continue to ask: "Well, I understand the general process of mapreduce, how does it fetch data from hdfs, and how does it interact with each other?"
Mr. Meow: "Yes, Xiaobai, it seems that you are quite advanced. Let’s take a look at the mapreduce interaction diagram, mapreduce is divided into 4 steps"

Hard-core MapReduce principle analysis, just read this one

Divided into 4 steps:

  1. The map task uses split to fetch data on hdfs, a split corresponds to a map method and outputs key, value, and partition format data
  2. The map task puts the retrieved data in the memory, partitions, and sorts them.
  3. The reduce task now knows the partition where the key is located, and pulls the data from the corresponding file partition (dfs).
  4. And calculate the final output data.

"Why map doesn't get the data directly from hdfs, do you have to use split to get it in the middle?" Xiaobai scratched his head and looked at Mr.
Meow. Mr. Meow nodded and said to Xiaobai: "This question is very good. The default size of split is equal to hdfs. One of the block blocks is about 64M, but the size of split can be adjusted to deal with different calculation types.
When we run CPU-bound (computation-intensive), we can set the split to be smaller, and multiple splits correspond to 1 block. This can increase the calculation speed

Hard-core MapReduce principle analysis, just read this one

When we run IO-bound (IO-intensive), the split can be set a bit larger, and 1 split corresponds to N block blocks, which can improve the efficiency of IO reading and writing.

Hard-core MapReduce principle analysis, just read this one

CPU-bound(计算密集型):
假设有一道数学题,题干只有一行字,
读题花费1秒,解题需要1个月才能解出来,
这样就是CPU-bound。(CPU利用率几乎100%)。
IO bound(IO密集型):
假设有一道数学题,题干有史记那么厚,
读完花费2个月,问题只是让你回答1+1=?,
这样就是I/O-bound。(CPU IDLE状态)。

Xiaobai concluded: "split can control the parallelism of the map, and it determines how many map tasks are enabled, one split for one map method, and output k, v, p key-value pairs."
"Why do you want to output the kv key here? The value pair is put in the memory. Although the memory speed is 100,000 times that of the hard disk, the data will eventually be written to the disk. Doesn't this mean taking off your pants and farting?" Xiaobai asked anxiously,
"Well, words The rough idea is not rough. Here we put the kv key-value pair output by the map into 100M of memory. An important thing is also done-that is to sort the k, v, p data, and sort the data under partition p Put them together, and the k in the same partition is sorted, so that the following reduce is used for merge sorting." Mr. Meow explained.
"You slow down, I'm so confused, let me give you an example"
"Let's take a look at the following example, count the number of occurrences of java\python\mysql" Mr. Meow quickly drew a picture

Hard-core MapReduce principle analysis, just read this one

Imput stage: the storage file locations of java, python, and mysql are stored on the block
of hdfs. Split stage: use split to cut the files on hdfs, among which file partitions are 0, 2, 3, 15, 16, 17, 205
Store the information of java\python\mysql, the map stage: output the data containing java\python\mysql information on each partition for kvp key-value pair output, for example: java,1,0 means java is stored in partition 0 Information 1
shuffle phase: sort the same set of data in memory, for example: java appears on partitions 0, 3, 15, and 205,
reduce phase: the final reduce task is sorted according to the output of the shffle phase to the specified file partition Get the corresponding file on
"It's amazing. It seems that sorting in the memory is really important. It effectively reduces the number of file reads. Reading multiple times at a time will speed up the corresponding processing speed." Xiaobai suddenly realized.

"There is another question. In the above example, the number of keys is 3 (java\python\mysql) and the number of reduce tasks is also 3. Is the number of keys equal to the number of reduce?" Xiaobai asked Tao,

"The observation is very careful. The number of reduce is controlled by the programmer's code, but the number of keys is not completely equal to the number of reduce. What if there are 100,000 keys? Then the number of reduce needs 100,000. Is it? There are definitely not so many resources, so it is generally determined by the number of reduce executors in the specific server resources." Mr. Meow added.

"In addition, it should be noted that if the data volume of the key is not uniformly distributed, the problem of data skew may occur. If there are 2 keys-1 male and 1 female, the male data volume is 10T, and the female data volume is 10T. The data is only 1G. In this case, the system will follow reduce to process the same key, and the same key will be assigned to the same reduce executor, then one reduce executor will process 10T of data, and another reduce executes The processor processes 1G data, which becomes the data tilt.” Mr. Meow continued to add.

"How to solve it, the data is skewed", Xiaobai asked,
"Then let me tell you about the next order~"

More dry goods are on the WeChat public account [Data Ape Wen Da]

Watch get hadoop official Definitive Guide

Hard-core MapReduce principle analysis, just read this one

Guess you like

Origin blog.51cto.com/14974545/2543122