Hadoop-Mapreduce(二)

InputFormat data input

  • Detailed explanation of job submission process and slice source code
waitForCompletion()
submit();
// 1建立连接
	connect();	
		// 1)创建提交job的代理
		new Cluster(getConfiguration());
			// (1)判断是本地yarn还是远程
			initialize(jobTrackAddr, conf); 
	// 2 提交job
submitter.submitJobInternal(Job.this, cluster)
	// 1)创建给集群提交数据的Stag路径
	Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
	// 2)获取jobid ,并创建job路径
	JobID jobId = submitClient.getNewJobID();
	// 3)拷贝jar包到集群
copyAndConfigureFiles(job, submitJobDir);	
	rUploader.uploadFiles(job, jobSubmitDir);
// 4)计算切片,生成切片规划文件
writeSplits(job, submitJobDir);
	maps = writeNewSplits(job, jobSubmitDir);
		input.getSplits(job);
// 5)向Stag路径写xml配置文件
writeConf(conf, submitJobFile);
	conf.writeXml(out);
// 6)提交job,返回提交状态
status = submitClient.submitJob(jobId, submitJobDir.toString(), job.getCredentials());
    • FileInputFormat source code analysis (input.getSplits(job))
    • Find the directory where your data is stored.
    • Start to traverse each file in the directory (planning slice)
    • Traverse the first file ss.txt
      • Get file size fs.sizeOf(ss.txt);
      • 计算切片大小computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M
      • By default, slice size=blocksize
      • Start cutting to form the first slice: ss.txt—0:128M The second slice ss.txt—128:256M The third slice ss.txt—256M:300M (each time you slice, you must judge the remaining Whether the lower part is greater than 1.1 times of the block, divide a slice if it is not greater than 1.1 times)
      • Write slice information to a slice planning file
      • The core process of the entire slice is completed in the getSplit() method.
      • Data slicing only slices the input data logically, and does not divide it into slices for storage on the disk. InputSplit only records the metadata information of the fragment, such as the starting position, length, and the node list where it is located.
      • Note: Block is data physically stored in HDFS, and slice is a logical division of data.
    • Submit the slice planning file to yarn, and MrAppMaster on yarn can calculate the number of maptasks to be started based on the slice planning file.
  • FileInputFormat slicing mechanism

    • The default slicing mechanism in FileInputFormat:

      • Simply slice according to the content length of the file

      • Slice size, the default is equal to the block size

      • When slicing, the whole data set is not considered, but each file is sliced ​​separately.

        For example, there are two files for the data to be processed:

        file1.txt 320M

        file2.txt 10M

      • After the slicing mechanism of FileInputFormat, the slice information formed is as follows:

        file1.txt.split1-- 0~128

        file1.txt.split2-- 128~256

        file1.txt.split3-- 256~320

        file2.txt.split1-- 0~10M

    • FileInputFormat slice size parameter configuration

      • By analyzing the source code, in the 280 lines of FileInputFormat, the logic of calculating the slice size: Math.max(minSize, Math.min(maxSize, blockSize));

      • The slice is mainly determined by these values

      • mapreduce.input.fileinputformat.split.minsize=1 The default value is 1

      • mapreduce.input.fileinputformat.split.maxsize= Long.MAXValue 默认值Long.MAXValue

        Therefore, by default, slice size=blocksize.

      • maxsize (maximum value of slice): If the parameter is adjusted smaller than blocksize, the slice will become smaller, and it will be equal to the value of this parameter configured.

      • minsize (slice minimum): If the parameter is adjusted larger than blockSize, the slice can be made larger than blocksize.

    • Get slice information API

// 根据文件类型获取切片信息
FileSplit inputSplit = (FileSplit) context.getInputSplit();
// 获取切片的文件名称
String name = inputSplit.getPath().getName();

CombineTextInputFormat slicing mechanism

  • Optimization strategy for a large number of small files

    • By default, TextInputformat’s task slicing mechanism is to slice according to file planning. No matter how small the file is, it will be a separate slice and will be handed over to a maptask. In this way, if there are a large number of small files, a large number of maptasks will be generated, which is extremely efficient. low.

    • Optimization Strategy

      • The best way is to merge small files into large files at the forefront of the data processing system (preprocessing/collection), and then upload them to HDFS for subsequent analysis.

      • Remedy: If there are already a large number of small files in HDFS, you can use another InputFormat to slice (CombineTextInputFormat), its slice logic is different from TextFileInputFormat: it can logically plan multiple small files into one slice , In this way, multiple small files can be handed over to a maptask.

      • Meet the minimum slice size first, and not exceed the maximum slice size

        CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m

        CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m

        Example: 0.5m + 1m + 0.3m + 5m = 2m + 4.8m = 2m + 4m + 0.8m

    • Specific implementation steps

//  如果不设置InputFormat,它默认用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class)
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m
  • Case practice

  • InputFormat interface implementation class

    The input files of MapReduce tasks are generally stored in HDFS. The input file format includes: line-based log file, binary format file, etc. These files are generally very large, reaching tens of gigabytes or even larger. So how does MapReduce read these data? Below we first learn the InputFormat interface.

    Common interface implementation classes of InputFormat include: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, CombineTextInputFormat and custom InputFormat.

  • TextInputFormat

    TextInputFormat is the default InputFormat. Each record is a line of input. The key K is of type LongWritable, which stores the byte offset of the line in the entire file. The value is the content of this line, not including any line terminator (newline and carriage return).

    The following is an example. For example, a fragment contains the following 4 text records.

Rich learning form
Intelligent learning engine
Learning more convenient
From the real demand for more close to the enterprise

Each record is represented as the following key/value pair:

(0,Rich learning form)
(19,Intelligent learning engine)
(47,Learning more convenient)
(72,From the real demand for more close to the enterprise)
  • Obviously, the key is not the row number. Under normal circumstances, it is difficult to obtain the line number, because the file is divided into slices by bytes rather than by lines.

  • KeyValueTextInputFormat

    Each row is a record, divided into key and value by separator. The separator can be set by setting conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "") in the driver class. The default separator is tab (\t).

    The following is an example, the input is a shard containing 4 records. Among them --> means a (horizontal) tab character.

line1 ——>Rich learning form

line2 ——>Intelligent learning engine

line3 ——>Learning more convenient

line4 ——>From the real demand for more close to the enterprise

Each record is represented as the following key/value pair:

(line1,Rich learning form)
(line2,Intelligent learning engine)
(line3,Learning more convenient)
(line4,From the real demand for more close to the enterprise)
  • The key at this time is the Text sequence of each row before the tab.

  • NLineInputFormat

    If NlineInputFormat is used, the InputSplit processed by each map process is no longer divided by block, but by the number of lines N specified by NlineInputFormat. That is, the total number of lines in the input file/N=the number of slices (20), if not evenly divided, the number of slices = quotient+1.

    The following is an example, still taking the above 4 lines of input as an example.

Rich learning form
Intelligent learning engine
Learning more convenient
From the real demand for more close to the enterprise

For example, if N is 2, then each input slice contains two rows. Start 2 maptasks.

(0,Rich learning form)
(19,Intelligent learning engine)

The other mapper receives the last two lines:

(47,Learning more convenient)
(72,From the real demand for more close to the enterprise)
  • The keys and values ​​here are the same as those generated by TextInputFormat.

  • Custom InputFormat

    • Overview

      • Customize a class to inherit FileInputFormat.
      • Rewrite RecordReader to realize reading one complete file at a time and package it as KV.
      • Use SequenceFileOutPutFormat to output the merged file during output.
    • Case practice

      For details, see 7.4 Small File Processing (Custom InputFormat).

MapTask working mechanism

  • Parallelism decision mechanism

    • Question leads

      The parallelism of maptask determines the concurrency of task processing in the map phase, which in turn affects the processing speed of the entire job. So, is the more mapTask parallel tasks the better?

    • MapTask parallelism determination mechanism

      The parallelism (number) of MapTask in the map phase of a job is determined by the number of slices when the client submits the job.
      Insert picture description hereMapTask working mechanism
      Insert picture description here

(1)Read阶段:Map Task通过用户编写的RecordReader,从输入InputSplit中解析出一个个key/value。

(2)Map阶段:该节点主要是将解析出的key/value交给用户编写map()函数处理,并产生一系列新的key/value。

(3)Collect收集阶段:在用户编写map()函数中,当数据处理完成后,一般会调用OutputCollector.collect()输出结果。在该函数内部,它会将生成的key/value分区(调用Partitioner),并写入一个环形内存缓冲区中。

(4)Spill阶段:即“溢写”,当环形缓冲区满后,MapReduce会将数据写到本地磁盘上,生成一个临时文件。需要注意的是,将数据写入本地磁盘之前,先要对数据进行一次本地排序,并在必要时对数据进行合并、压缩等操作。
	溢写阶段详情:
	步骤1:利用快速排序算法对缓存区内的数据进行排序,排序方式是,先按照分区编号partition进行排序,然后按照key进行排序。这样,经过排序后,数据以分区为单位聚集在一起,且同一分区内所有数据按照key有序。
	步骤2:按照分区编号由小到大依次将每个分区中的数据写入任务工作目录下的临时文件output/spillN.out(N表示当前溢写次数)中。如果用户设置了Combiner,则写入文件之前,对每个分区中的数据进行一次聚集操作。
	步骤3:将分区数据的元信息写到内存索引数据结构SpillRecord中,其中每个分区的元信息包括在临时文件中的偏移量、压缩前数据大小和压缩后数据大小。如果当前内存索引大小超过1MB,则将内存索引写到文件output/spillN.out.index中。

(5)Combine阶段:当所有数据处理完成后,MapTask对所有临时文件进行一次合并,以确保最终只会生成一个数据文件。
	当所有数据处理完后,MapTask会将所有临时文件合并成一个大文件,并保存到文件output/file.out中,同时生成相应的索引文件output/file.out.index。
	在进行文件合并过程中,MapTask以分区为单位进行合并。对于某个分区,它将采用多轮递归合并的方式。每轮合并io.sort.factor(默认100)个文件,并将产生的文件重新加入待合并列表中,对文件排序后,重复以上过程,直到最终得到一个大文件。
	让每个MapTask最终只生成一个数据文件,可避免同时打开大量文件和同时读取大量小文件产生的随机读取带来的开销。

Guess you like

Origin blog.csdn.net/qq_45092505/article/details/105371512