CombineTextInputFormat small file processing scenarios

Problems

HDFS is designed for mass data storage, especially storage for the TB , PB amount level data. However, over time, HDFS there may be a large number of small files on here that the small file is a file size much smaller than a HDFS block ( 128MB size) ; HDFS present on a large number of small files will have at least the following influences:

Consumption NameNode a lot of memory

Extend MapReduce total running time job

Because MapReduce framework default TextInputFormat slicing mechanism slices planning tasks by file, if there are a large number of small files, it will generate a lot of MapTask , processing small files is very inefficient .

solution

Hadoop built provides a CombineTextInputFormat class to deal specifically with small files, the core idea is: according to certain rules, the HDFS merge multiple small files on a InputSplit then enabled a Map to handle this inside the file, thereby reduce MR runtime whole job .

CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m

Slicing mechanism process

CombineTextInputForma slice through mechanisms include: virtual storage process and the slicing process two parts

Suppose setMaxInputSplitSize value 4M , following four documents

a.txt 1.7M

b.txt 5.1M

c.txt 3.4M

d.txt 6.8M

( 1 ) virtual stored procedure

( 1.1 ) all the input directory file size, and sequentially arranged setMaxInputSplitSize value comparison, if not greater than the set maximum value, the division of a logical block.

( 1.2 ) If the input file is set greater than the maximum and greater than twice, then cut to a maximum value, when the maximum value of the remaining data size exceeds the maximum value of not more than 2 times, then the files are divided into 2 virtual storage block (to prevent small slices).

1.7M <4M dividing a

5.1M> 4M but less than 2 * 4M divided into two: the block . 1 = 2.55M , block 2 = 2.55M

3.4M <4M dividing a

6.8M> 4M but less than 2 * 4M divided into two: the block . 1 = 3.4M , block 2 = 3.4M

The final storage of files:

1.7M

2.55M2.55M

3.4M

3.4M , 3.4M

 

( 2 ) the slicing process

( 2.1 ) determines whether the file size is greater than the virtual storage setMaxIputSplitSize value is greater than or equal to form a single slice.

( 2.2 ) If not then be combined with the next virtual memory file, together form a slice.

It will ultimately lead to 3 slices:

1.7+2.55M,(2.55+3.4M,(3.4+3.4M

 

Code Example:

the main method:

        job.setInputFormatClass(CombineTextInputFormat.class);
        CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
 
 
 
2
 
 
 
1
        job.setInputFormatClass(CombineTextInputFormat.class);
2
        CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
 
 

to sum up:

CombineTextInputFormat small file processing scenarios

  • The default mechanism of the drawbacks to read data
  • TextInputFormat default file one by one through the file means that regardless of how much is at least a slice itself 
    if too many small files is bound to the formation of many slices will launch multiple maptask waste of resources.
  • CombineTextInputFormat
    • Setting a maximum slice 
    • CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
    • First, the virtual storage process


    • Through each small file is compared with the setting value 
      is less than itself is a
      , if not more than 2 times greater than the average number of documents put
      if it is greater than than twice the maximum cut in accordance with
    • Then, slice the planning process
    • Previous virtual storage results one by one to determine if not more than the value set on the back of the merger and the merger if still not over, continue until the merger exceeds the maximum. Forming slices.
    • For small files, or upload the best way to deal with hdfs merge, target before ---> block size

 



Guess you like

Origin www.cnblogs.com/TiePiHeTao/p/68fafedf8e1e410015117d1d7104986f.html