Hadoop(三)–MapReduce

Hadoop(三)–MapReduce

mr

The framework of mr: From the
perspective of macro operations, there are maptask and reducetask, which have a macro dependency. There is map before reduce. There is no intermediate level mapping, and a standardized data set is generated. How to reduce.
Insert picture description here
map: is the calculation framework of maptask. So how to determine the number of maps?

There are as many maps as there are, and each map runs to the block block. In fact, there is a one-to-one correspondence between map and split objects. The number of map calculation frames required for the current job is related to the number of splits, and not directly related to the block.

At the beginning, the file needs to be divided into different blocks and hashed on different nodes. The next thing to do is to move the map calculation framework to the server where the block is located. The number of blocks should be less than the number of maps. Map corresponds to split. The previous block is the physical block that is actually cut apart, the latter split can be said to be a slice, the front is a physical block, and the back is a logical slice-there is no real split, but a logical plan. The size of the data volume in this area. The amount of data in this area is determined by the block.

Why not just cut into 10 blocks and run by 10 programs?

Assuming that the amount of data in a block is large, a block needs to be counted for one year. Then why would it cause a long time-the block is too big. The concept of "slice" is flexibly put forward, and it is divided into several logical slices on the basis of blocks.

How many maps are needed for a drop job, first need to know the number of split slices, and the number of slices is related to the number and location of blocks. Based on this existing fact, the number of splits can be known, and then the map can be deduced.
Insert picture description here
The map completes the intermediate level mapping of KV. Suppose: Count the average house prices in major cities in China –

It means to calculate the average value, all the map needs to do is: kv intermediate level mapping. k is the city name and v is the house price. How to calculate reduce: merge the same key into a group, get the total amount and calculate the average value. It is actually a concept of grouping, group-by.

Questions: 1. The same key needs to be calculated in a group; 2. How many are needed for reduce?

What does reduce have to do with: one key corresponds to one reduce (a more conventional way of thinking); if different groups are assigned to one reduce, one reduce can complete multiple groups of keys. The rigidity of the MR framework is that the same set of keys is sent to a reducer, which is the data distribution strategy. One reduce can handle multiple sets of keys. But the same set of keys can only be completed by one reducer.

Reduce is related to business data.

The MR framework is divided into 4 stages: slicing; after getting the logical slice data, map tasks can be generated to start calculation; shuffle stage: shuffle the order; reduce process;

Insert picture description here

The process of shuffle: It is about how to complete a sorting and shuffling of the data from the map to the reduce.

Process: Split the slice first, and the size of a slice is the size of a block. The buffer is sorted by partition, and there are several partitions if there are several reducers. After sorting, each one represents a city.

Some work is completed before the pull work, and many things are done in the buffer:

  1. Sort by partition first; partition refers to-there are several reduce, there are several partitions;
  2. The data inside the partition should be merged into a group;

In addition to sorting, the memory side also has a combiner and data compression.

When each map generates a key-value pair kv, it also generates a partition. The partition represents who the record belongs to
. The first step is to sort it by partition. The partition number and reduce are corresponding. Each reducer knows which partition data it should pull. In the second sorting, sorting is also performed inside the partition. The reason is that a reduce may not only process one set of data, but may have to process multiple sets of data. The data obtained is the data of the group; the data obtained by the reduce end is the regular data, which has been sorted twice before.
Overwrite 128M data and write the overflow into small files. When the entire batch of data is processed, a bunch of small files will be generated. These small files need to be sorted. Whoever completes the merging first, and finally gets a sorted file to reduce.

Summarize the mapreduce calculation process:

  1. First, split slices. The principle of slices is: the default size is the same as the block size; the number of slices determines the number of maps;
    Insert picture description here
  2. Perform intermediate kv mapping to generate kv and at the same time generate the partition to which kv belongs. Write to the buffer with the record of this line of parameters, the buffer is filled, and the overflow is written into a small file;
  3. Finished the sorting of the partition and the partition inside the buffer (2 times);
  4. After the small files are completed, an overall merge algorithm is completed through quick sorting or merge to form an orderly large file;
  5. Because multiple maps are used in parallel, these files will eventually be handed over to reduce in the shuffle phase, but reduce cannot process one file per file. Therefore, these files need to be merged to completely generate 1-n ordered files. Finally, set a merge algorithm and hand it to reduce. Just ensure that the data obtained by reduce is completely orderly.

Word statistics

Insert picture description here
Four stages: cutting, intermediate level mapping, data shuffling, data iteration.

  1. Cutting: Cut by row, and read one row at a time. Because you want to do kv mapping on the map side, the best form is to take a row of data. Hand over one line at a time in split;
  2. Complete the kv intermediate level mapping and form kv to generate the partition number. The partition number is generated based on the number of reduce. The advantage of hashing is that the same k is placed in a partition;
  3. shffle: put all the same words in one;
  4. These words can be accumulated in reduce, and finally output;

frame

Insert picture description here
K-The same data feature is extracted, and the value is related to business requirements. Full amount: On the reduce side, one group or multiple groups can be processed. The principle of calling reduce is: as long as it belongs to the same group, it will be called once. Assuming that k and reduce are different, reduce will automatically exit. The data obtained by reduce is already regular and does less work than map.

kv is a custom data type. All basic data types are not supported, and kv must be a type derived from a Java class or a self-defined class. Or it must be a package class. Use java variable parameters to solve. Serialization and deserialization are required during the transfer process, so the transferred kv needs to implement the serialization interface. In addition to serialization, you have to implement an interface, a comparable comparator. The purpose of the comparator is to sort.

Hadoop1.x version got mr

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_29027865/article/details/111831431