Hadoop big data technology of MapReduce (3) - MapReduce framework principle the concept of a

Chapter 3 MapReduce framework principles

3.1 InputFormat data input

Here Insert Picture Description

MapTask sections 3.1.1 and parallelism decision-making mechanism
  • Problems lead
    parallelism determined Map MapTask task processing concurrency phase, thereby affecting the processing speed of the whole of the Job.
    Thoughts: Data 1G, start the 8 MapTask, can improve the ability of concurrent processing cluster. So 1K of data, also started eight MapTask, it will improve cluster performance? MapTask parallel tasks as possible whether it? What factors influence the degree of parallelism MapTask?
  • MapTask parallelism mechanisms decision
    blocks: Block is the HDFS data into a physical one.
    Data slice: the input data slice only logically be fragmented, and not cut into pieces which will be stored on the disk.
  • MapTask parallelism decision-making mechanism
    Here Insert Picture Description
3.1.2 Job submission process source code and source code Detailed sliced
  • Job submission process Detailed source
    Here Insert Picture Description
waitForCompletion()

submit();

// 1建立连接
	connect();	
		// 1)创建提交Job的代理
		new Cluster(getConfiguration());
			// (1)判断是本地yarn还是远程
			initialize(jobTrackAddr, conf); 

// 2 提交job
submitter.submitJobInternal(Job.this, cluster)
	// 1)创建给集群提交数据的Stag路径
	Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);

	// 2)获取jobid ,并创建Job路径
	JobID jobId = submitClient.getNewJobID();

	// 3)拷贝jar包到集群
copyAndConfigureFiles(job, submitJobDir);	
	rUploader.uploadFiles(job, jobSubmitDir);

// 4)计算切片,生成切片规划文件
writeSplits(job, submitJobDir);
		maps = writeNewSplits(job, jobSubmitDir);
		input.getSplits(job);

// 5)向Stag路径写XML配置文件
writeConf(conf, submitJobFile);
	conf.writeXml(out);

// 6)提交Job,返回提交状态
status = submitClient.submitJob(jobId, submitJobDir.toString(), job.getCredentials());
  • FileInputFormat slice source parsing (input.getSplits (job))
    Here Insert Picture Description
3.1.3 FileInputFormat slicing mechanism

Here Insert Picture Description
Here Insert Picture Description

3.1.4 CombineTextInputFormat slicing mechanism
  • The default frame TextInputFormat slicing mechanism slices planning tasks by file, file no matter how small, will be a single slice, will give a MapTask, so if there are a large number of small files, it will generate a lot of MapTask, processing efficiency is extremely low .
  1. Scenarios:
    CombineTextInputFormat too small for the scene files, it can be planned into a plurality of small files logically slices such that a plurality of small files can be processed to a MapTask.
  2. Maximum virtual storage sections provided
    CombineTextInputFormat.setMaxInputSplitSize (job, 4194304); // 4m
    NOTE: Virtual memory is preferably arranged to slice the maximum set value of a specific size in accordance with the actual situation of small files.
  3. Microtome
    generating the slicing process comprising: slicing process and the virtual storage procedure two parts.
    Here Insert Picture Description
    (1) virtual stored procedure:
  • All input directory file size, and sequentially setMaxInputSplitSize comparison value, if not greater than the maximum value set, dividing a logical block. If the input file is set greater than the maximum and greater than twice, then cut to a maximum value; when the maximum value of the remaining data size exceeds the maximum value of not more than 2 times, then the files are divided into two virtual storage blocks (prevents small slice) appears.
  • E.g. setMaxInputSplitSize is 4M, the input file size 8.02M, is divided into a first logical 4M. The remaining size 4.02M, if logically divided according to 4M, 0.02M smaller virtual storage file will appear, so that the remaining file into 4.02M (2.01M and 2.01M) two files.
    (2) the slicing process:
  • Determining whether the file size is greater than the virtual storage setMaxInputSplitSize value is greater than or equal to form a single slice.
  • If not then be combined with the next virtual memory file, together form a slice.
  • Test Example: There are 1.7M, 5.1M, 3.4M and 6.8M four small files, the file 6 is formed after the virtual storage blocks 4 are small file size, size are:
    a 1.7M, (2.55M, 2.55 M), 3.4M and (3.4M, 3.4M)
    eventually form three sections, sizes are:
    (1.7 + 2.55) M, (2.55 + 3.4) M, (3.4 + 3.4) M
Published 37 original articles · won praise 7 · views 1175

Guess you like

Origin blog.csdn.net/zy13765287861/article/details/104684219