MapReduce distributed parallel computing operation --11

Job requirements from: https: //edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/3319


1. Use your own words to clarify on the platform Hadoop HDFS and MapReduce function, working principle and process.

  HDFS

    Features

      Distributed file system for mass data storage.

    working principle

    1, HDFS cluster is divided into two roles: NameNode, DataNode (Secondary Namenode)

    2, NameNode responsible for managing the entire file system metadata

    3, DataNode responsible for managing user file data block

    4, a fixed file size (BLOCKSIZE) cut into several blocks distributed on a number of units stored datanode

    5, each block may have a plurality of copies of the file, and stored on different datanode

    6, Datanode regularly report file block information itself stored to Namenode, and number of copies namenode will remain in charge of the file

    7, HDFS inner workings transparent to the client, the client requests are accessing HDFS carried out by applying to namenode

    work process

      Write    

      1, root namenode communication request to upload files, whether namenode check the target file already exists, the parent directory exists

      2, namenode return if it can upload

      3, client requesting a block of the first transmission to the server on which datanode

      4, namenode datanode server 3 returns ABC

      5, client A requests an upload of data (essentially an RPC call, establishing pipeline), the request will continue to receive calls A B 3 sets the dn, and B calls C, the really a pipeline is created, back step client

      6, client A starts to upload the first Block (start disk read data into a local cache memory), in packet units, A will receive a packet transmitted to B, B to pass C; A per pass a packet will be placed in a queue waiting for a reply reply

      7, when a block transfer is complete, client requests again namenode second block upload server.

      Read

      1, with namenode communications metadata query to find the block where the file server datanode

      2, a selection Datanode (principle of proximity, then randomly) server, a request to establish socket stream

      3, datanode start sending data (read data stream from the disk into the inside, do the check in packet units)

      4, the client received in packet units, and now the local cache, and then written to the destination file

  MapReduce

    Features

      Parallel processing framework to achieve a task decomposition and scheduling.

    working principle

     1, is created by the Job submit () method a JobSummiter example, and call its submitJobInternal () method.

     2, the job is submitted to the ResourceManager, obtained from a ApplicationID at ResourceMananger

    3, JobClien check the output Job description, the slicing of the input is calculated, and the Job resources (including Jar package running configuration and fragmentation information) to HDFS

     4, the job submission through submitApplications on ResourceManager 

     After 5, ResourceManager received submitApplication () message, put the request to the scheduler (scheduler). The scheduler assigns a container (Container), then the resource manager to start the application at Container Node Manager (NodeManger) managed master

    6, initialization Job: book records by creating multiple objects in order to keep track of the progress of the job, because it will accept from the task progress and completion reports

     7, the receiving Client HDFS client computing fragment information input

    8, connected ResourceManager, resource apply to the ResourceManager

    9, Application master boot Container by communicating with Node Manager (NodeManager), the main task for the class execution of a Java program YarnChiled.

    10, the localized resources before step 9, the task needs to be required, including operating Jar package, and configuration information and the fragment file HDFS

    11, the final task of running map or reduce task.

    work process

    MapReduce job two-step process: map and reduce. Input and output of each stage is in the form of key-value, key value, and can specify the type of its own. segmentation map stage of good data for parallel processing, the processing result to reduce, reduce the function to complete the final summary.

MapReduce running on 2.HDFS

  1) prepare a text file on the local / home / hadoop / wc

  2) write a function map and reduce functions, locally run through the test

  3) Start Hadoop: HDFS, JobTracker, TaskTracker

  4) to upload a text file to the file system hdfs user / hadoop / input 

  Path to the jar file 5) streaming write environment variables, so that the environment variables to take effect

  6) establish a shell script file: streaming interface to run scripts, the name run.sh

  7) source run.sh to perform mapreduce

  8) Check operation results

 

Guess you like

Origin www.cnblogs.com/linxiLYH/p/10965958.html