Hadoop's MapReduce framework principle

bbf48a5f21c2b1999ba1bd00b528c9e6.jpeg

 

Table of contents

The simple operating mechanism of the MapReduce framework:

Mapper stage: 

InputFormat data input:

        Slice and MapTask parallelism determination mechanism:

Source code analysis of the job submission process:

Slicing logic:

1) FileInputFormat implementation class

virtual storage

(1) Virtual stored procedure:

Shuffle stage:

Sort by:

 Combiner merge:

 ReduceTask stage:

Reduce Join:

Map Join:


The simple operating mechanism of the MapReduce framework:

MapReduce is divided into two phases, MapperTask phase, and ReduceTask phase. (There is a Shuffle stage in the middle)

In the Mapper stage, you can read the data by choosing a method (the choice of K and V corresponds to a different method), and after reading, hand over the data to the Mapper for subsequent business logic (written by the user), and then enter the Reduce stage through Shuffle To pull the data in the Mapper stage, and then write it out through OutputFormat (such as methods) (can be ES, mysql, hbase, files)

Mapper stage: 

InputFormat data input:

        Slice and MapTask parallelism determination mechanism:

The number of MapTasks determines the degree of parallelism (equivalent to how many people are working in the process of generating the map collection), ** (not necessarily the more the better, when the amount of data may be small, the time for many MapTasks that may be opened A MapTask has been calculated)

Data block: Block is HDFS physically divides data into pieces . Data block is HDFS storage data unit .

Data slicing: Data slicing only logically slices the input , and does not divide it into slices for storage on disk . A data slice is the unit for calculating the input data of a MapReduce program , and a slice will correspondingly start a Map Task .

 

Source code analysis of the job submission process:

Because the job submission we are looking for, put a breakpoint in the job submission function,

after stepping into the function   

ensureState(JobState.DEFINE); is to ensure that your state is correct (if the state is wrong or running, an exception will be thrown)
setUseNewAPI(); Handle API compatibility between different versions of Hadoop
connect(); connection, (the client needs to connect to the cluster or the local machine)
checkSpecs(job); Check whether the output path has been created and whether there are parameters

return submitter.submitJobInternal(Job.this, cluster); You need to click twice when entering the core code,

The first step is to step into the parameter Job and the second step into this method 

This method is to submit the job (in cluster mode, the submitted job contains (submit the jar package to the cluster through the client), and there is no need to submit the jar package locally, and the jar exists locally)

Slicing will also be performed to generate slice information (several slices have several MapTasks)

xml file will also be generated

To sum up, job submission will hand in three things (jar, xml file, slice information---"in cluster mode)

Finally, all information files will be deleted

Slicing logic:

** (Slicing is a separate slice for each file)

Locally, it is a 32m piece. As mentioned earlier, a piece corresponds to a slice by default, but there are prerequisites. When you subtract 32m, if the remaining last piece is greater than 1.1 times, the slice will be redistributed, but if it is less than 1.1, the score cannot be updated. piece

Example 1:

There is already a 32.1m data physical block is (32m+0.1m) slice distribution is (1 slice, because 32.1/32=1.003125<1.1 so use a slice)

Example 2:

There is already a 100m data

100-32-32=36>32 (36/32=1.125>1.1 so the last 36m needs to allocate two slices)

**The size of the block cannot be changed, but the size of the slice can be adjusted (maxSize makes the slice smaller) (minSize makes the slice larger)

Slice summary:

(Opening a MapTask takes up 1g memory + 1 cpu by default)

  

1 ) FileInputFormat implementation class

Thinking: When running a MapReduce program, the input file formats include: line-based log files, binary format files, database tables, etc. So, for different data types, how does MapReduce read these data?

Common interface implementation classes of FileInputFormat include : TextInputFormat , KeyValueTextInputFormat, NLineInputFormat, CombineTextInputFormat , and custom InputFormat. (Different application scenarios choose different interface implementation classes)

TextInputFormat is the default FileInputFormat implementation class. Read each record by row . The key is to store the starting byte offset of the line in the entire file, of type LongWritable. The value is the content of this line, excluding any line terminators (newline and carriage return), of type Text.

CombineTextInputFormat is used in scenarios where there are too many small files. It can logically plan multiple small files into one slice, so that multiple small files can be handed over to one MapTask for processing.

virtual storage

(1) Virtual stored procedure:

Compare the size of all files in the input directory with the setMaxInputSplitSize (slice size) value in turn, and if it is not greater than the set maximum value, logically divide a block. If the input file is larger than the set maximum value and greater than twice, then cut a block with the maximum value; when the remaining data size exceeds the set maximum value and is not greater than 2 times the maximum value , then divide the file into 2 virtual storage blocks (to prevent appears too small for slices) .

 test:

Without using CombineTextInputFormat (default TextInputFormat )  

 You can see that the slice is 4

Add code, set the implementation class to CombineTextInputFormat and set the virtual storage slice size

// 如果不设置InputFormat,它默认用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);

//虚拟存储切片最大值设置4m
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);

  

 As you can see, there are now 3 slices

We can change the number of slices called by changing the virtual slice size

To sum up: the factors that affect the number of slices are: (1) the size of the data volume (2) the size of the slice (usually automatically adjusted) (3) the file format (some files cannot be sliced)

Factors affecting the size of the slice: The size of the block in HDFS (judged by adjusting maxsize, minsize and comparing the size of the block)

Shuffle stage:

The shuffle stage is a stage after the mapper stage, and will write (k, v) a ring buffer (the buffer is divided into two halves, one half stores the index, and the other half stores the data, the default is 100m, and it will be reversed after reaching 80% Reverse writing (reduce time consumption, improve efficiency, because there is no need to wait for all overflow writes before performing write operations) before reverse writing files, partitions will be performed (the number of partitions is related to the number of reduceTasks) sorting (for The key is sorted, but the storage location does not change, only the location of the index is changed, and changing the storage location consumes a lot of resources)) Merge and sort will be performed after writing to the file (in the case of order, merge is the most efficient))

Sort by:

Sorting can be customized, for example, full sorting:

A Bean class is customized, and the bean object is used as the key transmission. It is necessary to implement the WritableComparable interface and rewrite the compareTo method to achieve sorting.

 Combiner merge:

It does not satisfy all production environments, and it can only be realized without affecting the final business logic (summation is fine, average value is not) 

The difference between combiner and reducetask is as follows:

 ReduceTask stage:

(1) Copy stage: ReduceTask remotely copies a piece of data from each MapTask, and for a certain piece of data, if its size exceeds a certain threshold, it is written to the disk, otherwise it is directly put into the memory.

(2) Sort stage: while copying data remotely, ReduceTask starts two background threads to merge files in memory and disk to prevent excessive memory usage or too many files on disk. According to the semantics of MapReduce, the input data of the reduce() function written by the user is a set of data aggregated by key. In order to gather data with the same key together, Hadoop adopts a sorting-based strategy. Since each MapTask has implemented partial sorting of its own processing results, the ReduceTask only needs to merge and sort all the data once.

(3) Reduce phase: the reduce() function writes the calculation results to HDFS.

The number of ReduceTasks can be set manually, and the number of settings will generate several files (the partition is the same as above)

Reduce Join:

Briefly describe the process:

(1) Custom bean object (serialization deserialization function --- implements Writable)

(2) Write the mapper class and rewrite the setup method first (because this case requires two files, initialize (read multiple files and hope to get the file name first (multiple files) one file and one slice, the setup method is an optimization method to obtain the file name)

(3) Write reduce class (business logic) first create a collection (type bean type) and bean object for storage

Use the for loop to traverse the value (the key is the same, the same key will enter the same reduce method)

Get the file name to judge and write different business logic

"order" table:

First create a bean object to store data for subsequent writing to the collection

Use the method BeanUtils.copyProperties(tmpOrderBean,value); to get the original data

Let's add the collection created above orderBeans.add(tmpOrderBean);

"pd" table:
BeanUtils.copyProperties(pdBean,value); Get the original data directly

End of storage, combining phase:

Use enhanced for

orderbean.setPname(pdBean.getPname());

Use the set function to directly set the pname in the collection

write later

context.write(orderbean,NullWritable.get());
end of business

Disadvantages of Reduce Join: In this method, the merge operation is completed in the Reduce stage, the processing pressure on the Reduce side is too high, the computing load on the Map node is very low, the resource utilization rate is not high, and data skew is easily generated in the Reduce stage .

Map Join:

scenes to be used

Map Join is suitable for scenarios where one table is very small and one table is large.

Data merging on the Map side solves the shortcomings of Reduce Join (data skew)

Briefly describe the process:

in the map class

setup method: read smaller files into the cache, store the data in the global map collection, and write all the data in the cache

In the rewritten map method:

Convert to a string and cut, and get the pname in the map collection through the cut array

Let's reset the format of the output file for writing out

(So ​​far mapreduce is over!!!)

Guess you like

Origin blog.csdn.net/m0_61469860/article/details/129675247