[Eleven] Great job data distributed parallel computing MapReduce

Job requirements: https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/3319

 

1. Use your own words to clarify on the platform Hadoop HDFS and MapReduce function, working principle and process.

Hadoop HDFS is a distributed file system platform, is mainly used to store and read data .

Work process: First, the working process can be divided into write and read operations in two steps.

(1) Write operation: Suppose we have a file size of 100M a, a system user files written to the HDFS. HDFS default configuration (block size is 64M). HDFS distributed over three stands Rack1, Rack2, Rack3. After the system by a user file block 64M is divided into two, and BLOCK1 block2. The system then transmits the user data to the write request nameNode. NameNode node records information block, and returns DataNode available.

(2) Read Operation: The client desired to open the file to read the file system object by calling the open () method, by using a Distributed File NameNode RPC invoked to determine the position of the start block of the file, in accordance with the number of repetitions of the same block will return multiple positions. The first two steps will return a FSDataInputStream object is encapsulated into objects DFSInputStream, DFSInputStream can easily manage and datanode namenode stream, the client calls the read () method of the input stream. DFSInputStream DataNode address stored file start block immediately connected to the nearest DataNode, () method, the data transmitted to the client through the DataNode from the data stream called repeatedly read. When reaching the end of the block, DFSInputStream closes the connection to the DataNode and look for the next best block DataNode. Once the client has completed the reading, on the right FSDataInputStream call close () method to close the file read.

How it works: The client by calling DistributedFileSystem create the () method, create a new file. DistributedFileSystem by remote procedure calls NameNode, to create a new file with no blocks associated with it. FSDataOutputStream returns the object after the end of the first two steps, FSDataOutputStream is packaged into DFSOutputStream, DFSOutputStream NameNode can coordinate and DataNode. Start writing data to the client DFSOutputStream, DFSOutputStream will cut a small data packet, then queued. DataStreamer accept the will to deal with the queue, it first inquiry NameNode this new block most suitable store where DataNode several years. After the client finishes writing data, call the close () method to close a write stream.

 

MapReduce is a parallel model of computation scalable, mainly to solve the massive batch offline data .

Work process: during the execution of a job and a plurality Jobtracker Tasktracker, respectively corresponding to the HDFS and namenode datanode. Jobclient in the client configuration parameters have been packaged into a jar file is stored in HDFS, and to submit to the storage path Jobtracker, then Jobtracker create each Task, and distributed to Tasktracker service execution.

Working principle: program according InputFormat input file into the splits, each split will be inputted as a map task, each map task will have a memory buffer, the input data is subjected to an intermediate stage of the processing result is written to the memory map buffer and write data to decide which partitioner, when data is written to reach the threshold of the memory buffer, it will start a thread to overflow memory data is written to disk, but does not affect the map interim results continued to write buffer Area. In the overflow writing process, MapReduce framework will sort key, if relatively large intermediate results, will form a plurality of overflow write files, last buffer overflow all the data will be written to disk to form a write overflow file, if it is more than overflow write files, consolidate all of the last overflow write files into one file. When all map task is completed, each map task will form a final document, and the document is divided by the area. Before reduce task starts, after a map task is completed, it will start to pull the thread map the resulting data to the appropriate reduce task, continue to consolidate data, reduce data entry to prepare for, and when all the map tesk completed, the data also after pulling the merged, reduce task starts, the final result is stored in output on HDFS.

 

Guess you like

Origin www.cnblogs.com/makky1116/p/10966169.html