Hadoop learning summary

A, HDFS

First, the basic concept of HDFS

  HDFS: Hadoop Distributed File System, Hadoop Distributed File System. Mainly used to solve the problem of storing vast amounts of data.

 

Two, HDFS file write process

First, I want to save a file to HDFS cluster.

  1. Clients through RPC (Remote Service) access NameNode, a request to write a file.
  2. NameNode check whether the client has permissions to write, if you have permission to return a response. If the client does not throw an exception.
  3. Press BlckSize client moves the file size (default 128M) to a Block file into a block, then a request for writing the first block Block.
  4. NameNode will be based on its load-balancing mechanism, to meet its client returns the number of copies (the default is 3) list (BlockId: host, port number, stored in the directory).
  5. The client according to the returned list, start building the pipeline (pipeline). Client -> first node -> the second node -> the third node.
  6. Begins to transmit data, in accordance with a transfer Block Packet, Packet When a successful transmission to the first DataNodes, put the first DodaNode Packet copying begins, the Packet and the pipe to the next DataNodes, next after receiving the DataNode Packet, continue copying, and then transmitted to the next DataNode.
  7. Block When a block transfer is completed successfully, a DataNodes starting from the last, in turn returns an ACK from the pipe queue, to the client.
  8. The client maintains their own internal queue with an ACK, with the return match to the ACK queue, as long as there is a written DataNode success, they think the write operation is completed.
  9. Block start the next block is written. Repeat 3-8.

If at the time of transmission, some DataNode is down, this DataNode will exit from this pipe. DataNode continue the rest of the transmission. Then, so after the transfer is complete, NameNode will then issue a sub-node, to write out a copy successful DataNode Block block, wrote a new DataNode.

 

Three, HDFS file reading process

 

  1. The client sends a read request to NameNode through RPC.
  2. NameNode confirm whether the client has read access, if any, to the client returns a response, and if not, the client throws an exception.
  3. The client requests a file needs to be read to NameNode.
  4. NameNode returns a list of the location of each block storage Block this file is located.
  5. The client will be selected from the list returned by a recent, establish a connection, read Block block. When the read verification information under the Block will block statistics directory, read come together.
  6. Block after block completion of client information read, will calculate a checksum with the checksum read over the comparison, if the match, it shows correctly. If there is no match, the block Block is read from the other nodes.

 

二、MapReduce

A, MapReduce basic concepts

  MapReduce is a distributed computing model, proposed by Google, the search field is mainly used to solve computational problems of massive data. MR consists of two phases: Map and Reduce, users only need to implement the map ()

And reduce () two functions, distributed computing can be realized. Parameter of these two functions is key, value pairs to represent the input information function. Here is the execution flow of MR:

 1, Map task

   ① read the contents of the input file, parsed into key, value for children. Each line of the input file, parsed into key, value pairs. Each key-value pair called once the map function.

        ② write your own logic for key inputs, value processing, converted to the new key, value output

        The output ③ of the key, value partition

        ④ different partitions of the data, according to key sorting, grouping. The same key value into a set.

        ⑤ (optional) data packets reduction

2, Reduce task

   ① outputs of the plurality of map tasks, according to a different partition, the network nodes reduce to a different copy.

        ② to map multiple output tasks to merge sort. Write reduce function increase their logic, the input key, value processing, is converted into the new key, value output.

        ③ reduce the output to a file stored in

 

Two, WordCount program

  Functions of the program : Suppose now there is a test text, WordCount MR program is the use of statistical models to calculate the total number of times each word appears in the text.

 

  Then the realization principle is: map method MapTask in our each row of data for processing. Then ReduceTask the same key is grouped, each group call a method to reduce the processing!

 code show as below:

 

. 1  public  class WordCountTest {
 2      // start position KEYIN each row of data, row offset
 . 3      // VALUEIN each row
 . 4      // KEYOUT Map data output terminal of the key type
 . 5      // VALUEOUT Map data value output terminal type 
. 6      public  static  class WCMapper the extends Mapper <LongWritable, the Text, the Text, IntWritable> {
 . 7          // Key offset rows
 8          // data value for each row
 9          // subsequent context context, our map is processed data sent out. 
10          @Override
 11          protected  voidMap (LongWritable Key, the Text value, the Context context) throws IOException, InterruptedException {
 12 is              String [] = value.toString words () Split (. "  " );
 13 is              for (String Word: words) {
 14                  context.write ( new new the Text (Word), new new IntWritable ( . 1 ));
 15              }
 16          }
 . 17      }
 18 is      // the KEYIN the reduce side receives key data types
 . 19      // VALUEIN the reduce side receives the data type of the value
 20 is      // KEYOUT the reduce the data output terminal key type
 21      //VALUEOUT reduce the output end of the data type of the value 
22 is      public  static  class WCReducer the extends the Reducer <the Text, IntWritable, the Text, IntWritable> {
 23 is          // key is to reduce the end to accept the type of data map
 24          // value of the set of values the same key
 25          / / context context, is to reduce the future we processed the data sent. 
26 is          protected  void the reduce (the Text Key, the Iterable <IntWritable> values, the Context context) throws IOException, InterruptedException {
 27              int SUM = 0 ;
 28              for (IntWritable value: values) {
 29                  . + = SUM value GET();
30             }
31             context.write(key, new IntWritable(sum));
32         }
33     }
34     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
35         Configuration conf = new Configuration();
36         Job job = Job.getInstance(conf);
37         job.setJarByClass(WordCountTest.class);
38         job.setMapperClass(WCMapper.class);
39         job.setReducerClass(WCReducer.class);
40         //指定map和reduce输出数据的类型
41         job.setMapOutputKeyClass(Text.class);
42         job.setMapOutputValueClass(IntWritable.class);
43         job.setOutputKeyClass(Text.class);
44         job.setOutputValueClass(IntWritable.class);
45         FileInputFormat.setInputPaths(job, new Path("E:\\input"));
46         FileSystem fs = FileSystem.get(conf);
47         Path outPath = new Path("E:\\output");
48         if (fs.exists(outPath)) {
49             fs.delete(outPath, true);
50         }
51         FileOutputFormat.setOutputPath(job, outPath);
52         job.submit();
53     }
54 }
View Code

 

 

 

Three, MapReduce operation principle

  splitSize default is 128M.

  FileInputFormat first slice scan, each scan line data, class call RecordReader getCurrentKey (), getCurrentValue () Returns a key (row offset), value (the contents of each row).

  context of the return key and value into the MapTask, so that map the method to be processed.

  After the processed map method, the processed key, value serialized, written to the ring buffer. (The default is 100M). When the ring buffer to reach 80%, which will be an overflow of content writing.

  When will overflow write partition, and the default value in accordance with hashcode of the key, to reduceTask be taken over. According to the same number assigned to more than one partition. When the partition will be sorted by default lexicographically. Use quick sort.

  Key -> key of hashCode -> according to the number of modulo reduceTask -> partition according to the result of modulo.

  When MapTask end, the same data will be aggregated into a partition. And sort and merge sort.

  MapTask since ended.

 

  Reduce map will end later ends the file is processed, the same pulling to a partition. And merge sort, merge sort.

  A data processing ReduceTask to a partition.

  ReduceTask packet based on the same key, the same key data is divided into a group.

  A set of data called once to reduce method.

  After a reduceTask processed reduceTask written to a file.

 

Three, Yarn flowchart resource scheduling

  1. The client will submit its program to the Yarn.
  2. RM returned to the client as well as a jobid a path.
  3. The program information (jar packet, slice information, serialized file) corresponding to the client will be submitted to the corresponding path.
  4. After you submit your returns an acknowledgment to the RM.
  5. Creating that Taiwan NodeManager RM will submit the information stored on it in a container. Start our ApplicationMaster.
  6. ApplicationMaster register with the RM slice with information and procedures submitted and application containers.
  7. RM after receipt of the resource request to talk to communicate NM, NM will create a number of containers needed in their own node.
  8. ApplicationMaster corresponding task information will be sent to the corresponding node NM, create and use out of the container to run Task.
  9. Run ReduceTask.
  10. So after all the Task have been executed, ApplicationMaster on the cancellation of the RM, RM will reclaim resources.

RM: responsible for allocating resources.

ApplicationMaster: application of resources, the monitoring program.

NM: container is responsible for creating, running Task.

 

Guess you like

Origin www.cnblogs.com/hong-bo/p/11431125.html