hadoop: ecosystem hdfs mapreduce computing framework yarn
Hdfs: Distributed file system, used for data management of n servers.
Each server needs to install hdfs distributed file system
client: the client is used to interact with the Datanode Namenode, and the file fragmentation
Namenode: responsible for storing metadata --> block size and location. . .
Interaction with datanode
Datanode: actual data storage operations
Secondary Namenode: responsible for assisting Namenode merging files --> fsimage edits
yarn Resource scheduling manager
Resourcemanager: responsible for resource request processing
NamenodeManager: responsible for finding resources responsible for starting the container container (memory, cpu)
mapreduce : It is a computing framework
implementation framework: implement mapper class interface implement reducer class interface to
create a job, encapsulate startup class, mapper class, reducer class,
mapper output type, reducer output type, input and output file path, start
In the process of implementing mapreduce, the components are involved.
1. Initialize the inputformat component: read the file and read it into the form of key-value
0 hello taiyuanligong
12 hello chognqing
29 welcome to taiyuanligong
1 hello taiyuanligong
2 hello chognqing
3 welcome to taiyuanligong
RecorderReader: getCurrentKey getCurrentValue
to the custom map method value.toString
2. Partition partitionner: The partition is determined according to our own needs.
The default partition is a
default partition and a custom partition. Each partition corresponds to a ReduceTask.
Each ReduceTask corresponds to a result file.
Each ReduceTask corresponds to a number of tasks.
After customizing the partition, set the number of tasks in the corresponding partition in the driver.
3. Combiner : Combine multiple small files into one big file, reduce the number of tasks
1000 small files 1000 tasks at the same time can run in parallel tasks is 50, need to run 20
500 files and 500 tasks. The tasks that can run in parallel at the same time are 50 and need to run 10.
4. Sorting:
Full sorting: just implement a WritableCompare-like compareTo method
in a custom bean, and implement your own sorting rules in the compareTo method: from large to small or small to large
portions of the sort: With our custom partitions
set up the partitions driver, set the custom bean.