Problems
HDFS is designed for mass data storage, especially storage for the TB , PB amount level data. However, over time, HDFS there may be a large number of small files on here that the small file is a file size much smaller than a HDFS block ( 128MB size) ; HDFS present on a large number of small files will have at least the following influences:
Consumption NameNode a lot of memory
Extend MapReduce total running time job
Because MapReduce framework default TextInputFormat slicing mechanism slices planning tasks by file, if there are a large number of small files, it will generate a lot of MapTask , processing small files is very inefficient .
solution
Hadoop built provides a CombineTextInputFormat class to deal specifically with small files, the core idea is: according to certain rules, the HDFS merge multiple small files on a InputSplit then enabled a Map to handle this inside the file, thereby reduce MR runtime whole job .
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
Slicing mechanism process
CombineTextInputForma slice through mechanisms include: virtual storage process and the slicing process two parts
Suppose setMaxInputSplitSize value 4M , following four documents
a.txt 1.7M
b.txt 5.1M
c.txt 3.4M
d.txt 6.8M
( 1 ) virtual stored procedure
( 1.1 ) all the input directory file size, and sequentially arranged setMaxInputSplitSize value comparison, if not greater than the set maximum value, the division of a logical block.
( 1.2 ) If the input file is set greater than the maximum and greater than twice, then cut to a maximum value, when the maximum value of the remaining data size exceeds the maximum value of not more than 2 times, then the files are divided into 2 virtual storage block (to prevent small slices).
1.7M <4M dividing a
5.1M> 4M but less than 2 * 4M divided into two: the block . 1 = 2.55M , block 2 = 2.55M
3.4M <4M dividing a
6.8M> 4M but less than 2 * 4M divided into two: the block . 1 = 3.4M , block 2 = 3.4M
The final storage of files:
1.7M
2.55M,2.55M
3.4M
3.4M , 3.4M
( 2 ) the slicing process
( 2.1 ) determines whether the file size is greater than the virtual storage setMaxIputSplitSize value is greater than or equal to form a single slice.
( 2.2 ) If not then be combined with the next virtual memory file, together form a slice.
It will ultimately lead to 3 slices:
(1.7+2.55)M,(2.55+3.4)M,(3.4+3.4)M
Code Example:
the main method:
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
to sum up:
- The default mechanism of the drawbacks to read data
-
TextInputFormat default file one by one through the file means that regardless of how much is at least a slice itself
if too many small files is bound to the formation of many slices will launch multiple maptask waste of resources. - CombineTextInputFormat
- Setting a maximum slice
-
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
- First, the virtual storage process
Through each small file is compared with the setting value
is less than itself is a
, if not more than 2 times greater than the average number of documents put
if it is greater than than twice the maximum cut in accordance with
- Then, slice the planning process
-
Previous virtual storage results one by one to determine if not more than the value set on the back of the merger and the merger if still not over, continue until the merger exceeds the maximum. Forming slices.
-
For small files, or upload the best way to deal with hdfs merge, target before ---> block size