[Big Data] MapReduce component InputFormat

Thinking: When running the MapReduce program, the input file formats include: line-based log files, binary format files, database tables, etc. So, for different data types, how does MapReduce read these data?

input format: input format

Common interface implementation classes of FileInputFormat include:

TextInputFormat, KeyValueTextInputFormat, NLineInputFomat, CombineTestInputFormat and custom InputFormat, etc.

1. TextInputFormat

TextInputFormat is the default FileInputFormat implementation class. Read each record line by line. The key is to store the starting byte offset of the line in the entire file, LongWritable type. The value is the content of this line, does not include any line terminator (newline and carriage return), Text type.

内容:
R1ch lear ning form
Intelligent lear ning engine 
Learning more co nvenient
From the real demand for more close to the enterprise
(k,v):
(0, Rich learning form)
(19, Inte 11 igent learning engine)
(47 ,Learning more convenient)
(72 ,From the real demand for more close to the enter prise)

2. KeyValueTextInputFormat

Each line is a record, divided into key and value by separator. The separator can be set by setting conf. set(KeyValueLineRecordReaderKEY_VALUE_SEPERATOR, "\t"); in the driver class. The default separator is tab (t).

以下是一 个示例,输入是一 个包含4条记录的分片。其中一>表示-个(水平方向的)制表符。
line1-- >Rich learning form
line2--->Intel ligent learning engine
line3--->Learning more convenient
line4--->From the real demand for more close to the enterprise
(k,v):
(line1,Rich learning form)
(line2, Intelligent learning engine)
(line3, Learning more convenient)
(line4, From the real demand for more close to the enterprise)

 

3. NLinelnputFormat

If NlineInputFormat is used, the InputSplit processed by each mp process is no longer divided into B1ock blocks, but is divided according to the number of lines N specified by NIinelnpuFormat. That is, the total number of lines in the input file N = the number of slices, if not evenly divided, the number of slices = quotient + 1.

R1ch lear ning fo rm
Intelligent lear ning engine 
Learning more co nvenient
From the real demand for more close to the enterprise

例如,如果N是2,则每个输入分片包含两行。开启2个MapTask.
(0, Rich learning form)
(19,ntelligent 1earmning engine)

另一个mapper则收到后两行: 
( 47, Lear ning more convenient)
( 72,From the real demand for more close to the enterprise)
这里的键和值与TextInputFormat生成的一样。

4 . Combine TextInputFormat

The default TextInputFormat slicing mechanism of the framework is to slice tasks according to file planning. No matter how many files are, it will be a single slice and will be handed over to a MapTask. In this way, if there are a large number of small files , a large number of MapTasks will be generated, and the processing efficiency is extremely low.

1 ) Application scenarios:

CombineTextInputFormnat is used in scenes with too many small files. It can logically plan multiple small files into one slice , so that multiple small files can be handed over to one MapTask for processing.

2 ) Maximum setting of virtual storage slice

CombineTextInputF ormat.setMaxInputSplitSize(job, 4194304):// 4m

Note: It is best to set the specific value of the virtual storage slice maximum value according to the actual small file size.

3 ) Slicing mechanism

The slicing process includes two parts: the virtual storage process and the slicing process.

inpinputformat

 

Introduction to custom InputFormat

       An InputFormat is to convert the file slice-----> into a <key--value> pair and transfer it to the Mapper for processing.

  So we see that there are only two methods in the InputFormat class, one is responsible for slicing, and the other returns an object that can convert slicing information into corresponding key-value pairs.

       We learned that the mapper has input data types, namely key/value, the data type of key is LongWritable type, and value is text type. So how does the key/value come from? Here will be a custom way to introduce InputFormat, because of data reasons, the output data type here is still LongWritable/text, mainly to understand the value process of this mapper and customize MyInputFormat in the future.

 

Regardless of HDFS or MapReduce, the efficiency in processing small files is very low, but it is inevitable to face the scenario of processing a large number of small files. At this time, a corresponding solution is needed. You can customize the InputFormat to realize the merging of small files.

 

Input data

                1.txt   2.txt  3.txt

Expected output file format

                 part-r-0000

Guess you like

Origin blog.csdn.net/Qmilumilu/article/details/104677033