MapReduce Distributed Computing (2)

MapReduce workflow

Raw data File

1T data is divided into blocks and stored on HDFS, each block has a size of 128M

Data block Block

A unit of data storage on hdfs, the size of blocks in the same file is the same.
Because data storage is immutable on HDFS, it is possible that the number of blocks does not match the computing power of the cluster. We need a dynamic adjustment to participate in the calculation. A unit of the number of nodes

Slice Split

Slicing is a logical concept.
Without changing the current data storage, the number of nodes participating in the calculation can be controlled. The purpose of controlling the number of computing nodes can be achieved through the size of the slice. As many
slices as there are, as many Map tasks will be executed.

Generally, the slice size is an integer multiple of Block (2 1/2)
to prevent redundant creation and many data connections.
If Split>Block, there are fewer computing nodes.
If Split<Block, there are more computing nodes
. By default, the size of the Split slice is equal to Block The size, the default is 128M A slice corresponds to a MapTask

MapTask

Map reads data from the slice by default, and reads one line at a time (the default reader) into the memory.
We can write word segmentation logic (separated by spaces) according to our own writing. Calculate the number of occurrences of each word, which will be generated (Map <String,Integer>) Temporary data, stored in the memory,
but the memory size is limited, if multiple tasks are executed at the same time, there may be memory overflow (OOM) If the data is directly stored in the hard disk, the efficiency is too low. We need to save between
OOM and Provide an effective solution between low efficiency.
You can write part of it in the memory now, and then write it out to the hard disk

ring data buffer

This memory area can be recycled to reduce the stop time of the map when data is overflowed.
Each Map can exclusively use a memory area.
Build a ring data buffer (kvBuffer) in the memory. The default size is 100M and
the threshold of the buffer is set to 80%, when the data in the buffer reaches 80M, it starts to overflow and write to the hard disk

When overflow writing, there is still 20M space that can be used, and the efficiency will not be slowed down, and the data will be written to the hard disk in a loop, so there is no need to worry about OOM problems

Partition

The number of corresponding Reduce
partitions is directly calculated according to the Key and the number of Reduce is equal.
hash(key) % partition = num
The default partition algorithm is Hash and then take
the hashCode() of the remainder Object --- equals()
If two objects equals, then the hashcodes of the two objects must be equal
If the hashcodes of the two objects are equal, but the objects are not necessarily equlas

Sort Sort

Sort the data to be overwritten (QuickSort)
in the order of Partation first and then Key --> the same partitions are together, and the same Key is together.
The small files that we will overwrite in the future are also in order

Spill

Write the data in the memory to the hard disk in a loop, without worrying about the OOM problem.
Each time a 80M file will be generated.
If there is a lot of data generated in this Map, multiple files may be overwritten.

Merge

Because overflow writing will generate a lot of ordered (partition key) small files, and the number of small files is uncertain, it will
bring big problems to transfer data to reduce later,
so the small files will be merged into one large file, and the data pulled in the future will be directly Just pull from the large file.
When merging small files, they are also sorted (merge and sort), and finally an ordered large file is generated.

Combiner combiner

The bandwidth of the cluster limits the number of mapreduce jobs, so data transfer between map and reduce tasks should be avoided as much as possible. Hadoop allows the user to process the output data of the map. The user can customize the combiner function (like the map function and the reduce function), and its logic is generally the same as the reduce function. The input of the combiner is the output of the map, and the output of the combiner is used as the input of the reduce In many cases, the reduce function can be used directly as a combiner function
(job.setCombinerClass(FlowCountReducer.class);).
The combiner is an optimization scheme, so it is impossible to determine how many times the combiner function will be called. The combiner function can be called when the ring buffer overflows files, or the combiner can be called when the overflowing small files are merged into a large file. However, it is necessary to ensure that no matter how many times the combiner function is called, the final result will not be affected, so not all processing logic can use the combiner component. Some logic will change the output of the final rerduce after using the combiner function (such as finding several numbers The average value, you can't use the combiner to calculate the average value of the output results of each map, and then calculate the average value of these average values, which will lead to wrong results).
The meaning of the combiner is to locally summarize the output of each maptask to reduce the amount of network transmission.
The data originally passed to reduce is a1 a1 a1 a1 a1
after the first combiner combination becomes a{1,1,1,1,..}
The data passed to reduce after the second combiner becomes a{4,2 ,3,5...}

Pull Fetch

We need to pull the temporary results of the Map to the Reduce node.
Principle:
The same Key must be pulled to the same Reduce node
, but a Reduce node can have multiple Keys.
When pulling data before unsorting, the final merge of the Map must be generated. Do a total order traversal of the file
and each reduce needs to do a total order traversal
If the large files generated by map are in order, each reduce only needs to read what it needs from the file

Merge

Because when reduce is pulled, data will be pulled from multiple maps,
and each map will generate a small file. These small files (disordered between files and ordered inside the file) are for the convenience of calculation (no need to read N small files), it is necessary to merge the files
and merge them into 2 files with
the same key

Merge Reduce

Read the data in the file into the memory,
read all the same key into the memory at one time,
and directly get the result with the same key --> the final result

write Output

Each reduce will store the final result of its own calculation on HDFS

Guess you like

Origin blog.csdn.net/qq_61162288/article/details/131296498