【Big Data】Small summary

hadoop: ecosystem hdfs mapreduce computing framework yarn

Hdfs: Distributed file system, used for data management of n servers.

          Each server needs to install hdfs distributed file system
          client: the client is used to interact with the Datanode Namenode, and the file fragmentation
          Namenode: responsible for storing metadata --> block size and location. . .
                    Interaction with datanode
          Datanode: actual data storage operations
          Secondary Namenode: responsible for assisting Namenode merging files --> fsimage edits
          
yarn Resource scheduling manager
     Resourcemanager: responsible for resource request processing
     NamenodeManager: responsible for finding resources responsible for starting the container container (memory, cpu)
    
mapreduce : It is a computing framework
     implementation framework: implement mapper class interface implement reducer class interface to
               create a job, encapsulate startup class, mapper class, reducer class,
                  mapper output type, reducer output type, input and output file path, start

In the process of implementing mapreduce, the components are involved.
  1. Initialize the inputformat component: read the file and read it into the form of key-value
     0 hello taiyuanligong
     12 hello chognqing
     29 welcome to taiyuanligong
    
     1 hello taiyuanligong
     2 hello chognqing
     3 welcome to taiyuanligong
 
     RecorderReader: getCurrentKey getCurrentValue
     to the custom map method value.toString
  2. Partition partitionner: The partition is determined according to our own needs.
     The default partition is a
     default partition and a custom partition. Each partition corresponds to a ReduceTask.
     Each ReduceTask corresponds to a result file.
     Each ReduceTask corresponds to a number of tasks.
     After customizing the partition, set the number of tasks in the corresponding partition in the driver.   
  3. Combiner : Combine multiple small files into one big file, reduce the number of tasks
      1000 small files 1000 tasks at the same time can run in parallel tasks is 50, need to run 20
      500 files and 500 tasks. The tasks that can run in parallel at the same time are 50 and need to run 10.
   
  4. Sorting:
      Full sorting: just implement a WritableCompare-like compareTo method
              in a custom bean, and implement your own sorting rules in the compareTo method: from large to small or small to large
      portions of the sort: With our custom partitions
      set up the partitions driver, set the custom bean.
 

Guess you like

Origin blog.csdn.net/Qmilumilu/article/details/104677234