Hadoop simple learning summary

1. What is Hadoop?
Hadoop is a software for data processing and analysis, including HDFS (distributed file storage system), MapReduce (distributed computing framework), Yarn (Distributed Resource Scheduler task allocation framework).
Hadoop comes from Google 2003-- three papers published in 2004, based on these three papers doug cutting (from lucene), he spent two years spare time, developed the prototype code and mr hdfs the year 2005 Hadoop project into the apache project foundation.
Hadoop characteristics:
① high fault tolerance: Hadoop each have a copy of the data, that is data backup.
② High reliability: Hadoop fault recovery mechanisms, if there is a hung node, its stored data will be reborn in another node.
③ efficiency: Hadoop's mapreduce run in parallel
④ highly scalable: Hadoop cluster can be extended almost thousands

2. What is the HDFS?
HDFS is a distributed file storage system (to solve the problem of mass data storage)
HDFS file from the system architecture based architecture:
① the NameNode: the master node, the stored data information (metadata information management)
② DataNodes: from a node, store real data

hdfs memory block:

Version 1.x default block size is 64M

2.x version defaults to 128M

hdfs copy of mechanisms: for fault tolerance of each block will have a copy, block size for each file replication factor are configurable, the application can specify the number of copies of a file. A copy of the coefficients can be specified when the file is created, it can be changed later.
hdfs very suitable for:
write once, read many times. (Suitable for mass data storage)
can not be modified, not suitable for network disk.
hdfs to read and write data flow:
writing process:

①client initiate file upload request, and by RPC NameNode establish communications, NameNode check the target file exists, the parent directory exists, whether the return can upload.
②client NameNode like a block of the first transmission request to which the server on a DataNode.
③NamaNode allocated according to the configuration file in the specified number of backups and the replica placement policy, return DataNode addresses available, such as: A, B, C;
Note: Hadoop consideration of data security and efficiency, the data file default locations on hdfs designed three, a local storage policy, a rack with other certain node, a node on a different chassis.
④client request in a three DataNode A uploading data (essentially an RPC call is established pipeline), will continue to call B after receiving the request A, and B calls C, the entire pipeline is created, the back step Client
⑤client start Block a to upload a first (start disk read data into a local cache memory), in packet units (default 64K), a will receive a packet transmitted to B, B to pass C; a per pass a reply packet will be placed in a queue waiting for a reply.
⑥ one packet data is divided into packets transmitted successively in a pipeline, the pipeline in the opposite direction, sending individual ACK (correct response command), the final pipeline the first pipeline DataNode node A to a client.
⑦ When a block transfer is complete, client requests again the second block NameNode uploaded to the server.

Reading process:


①client RPC requests to NameNode, to determine the location where the block.
②NameNode as the case will return all or part of the file list block, for each block, NameNode will return DataNode address block copy; these DN return address, will come from DataNode client (client) in accordance with the cluster topology, then, the sort two rules: network topology distance near the front client; Super heartbeat reporting mechanism DN STALE state, after this by;
③client selected higher-ranking DN Block read, if the customer DataNodes end itself, then the local data from the direct access (read short-circuit characteristic)
④ is based essentially on the underlying SocketStream (FSFataputStream), repeated calls the read method of DataInputStream parent, until the data block has been read;
⑤ when after reading block list, if the file is not over, the client will continue to want to get NameNode block list next batch;
⑥ finished reading a block will be checksum validation, error occurs while reading DataNode, the client will be notified NameNode, and then have a copy of the block from a DataNode continue Reading;
⑦read parallel reading method information block is not a one read; just returns the NameNode client request containing the address block, the fast data is not the request;
⑧ final readings of the block will be all combined into a full final file .

3. What is MapReduce?
MapReduce is a programming distributed computing framework (to solve the problem of massive data computing)
MapReduce core idea is to "divide and rule"
MapReduce programming development is divided into eight steps:
Map stage
① reading the file, parsed into key, value of the key (K1, V1)
② custom class map, receiving the read key, value key-value pair (k1, v1), converted by our custom logic converted into the new key, value of the output key (k2, V2)
shuffle stage (this stage all completed by the frame, a Map shuffle output data as an output to Reduce acquisition process before output)
③ partition (the same k2 v2 sent to the same reduce inside to form a set)
④ ordered (sorted) according lexicographically
⑤ Statute (reduce operation map phase: each of several portions of the can be counted to count, before sending it to reduce unified calculation, this will reduce assignments reduce the.)
⑥ grouping
reduce stage
⑦ custom reduce logic to receive our k2, v2 (a collection), write on our own business logic, converted to the new k3, v3, ready output
⑧ output file, we will reduce the output of the data processing completed
a full program there are three instances of MapReduce runtime analysis process:
①MRAppMaster: responsible for the entire Process scheduling procedures and state coordination
②MapTask: responsible for handling the entire data flow map stage of
③ReduceTask: responsible reduce the overall flow of data processing stage

 

4. What is Yarn?
Yarn is a distributed resource scheduling framework consists of three main modules:
①ResourceManager: responsible for monitoring all resources, allocation and management
②ApplicationMaster: responsible for scheduling and coordinating each specific application
③NodeManager: responsible for the maintenance of each node, for all applications, RM has the absolute right to control and allocate resources, while NM and RM will consult resources, to implement and monitor task and NM communication
Yarn workflow:
①client submit an application to the RM, including the launch of the application ApplicationMaster necessary information, e.g. ApplicationMaster program ApplicationMaster start command, the user program or the like
②ResourceManager starts a container for operating ApplicationMaster. The ApplicationMaster want to start their own ResourceManager register, after a successful start to keep the RM heartbeat
③ApplicationMaster send a request to the ResourceManager, apply for a corresponding number of container
④ResourceManager return of the container application ApplicationMaster information. Successful applicants container, is initialized by the ApplicationMaster. After starting the initialization of container information, corresponding to AM communications NM, NM required starting container. AM and NM maintaining heartbeat, thus tasks running on NM to monitor and manage.
During ⑤container run, AM monitoring of container, container report their progress and status information to the corresponding AM by RPC protocol.
⑥ during the application is running, client directly communicate with the state of the application of AM, progress updates and other information
end ⑦ application running after, AM to deregister its RM, and allows the container belonging to it is withdrawn

yarn scheduler Scheduler

first scheduler: FIFO Scheduler (queue scheduler)
after the end of a task, the next task at the beginning of
the second dispatcher: capacity scheduler (capacity scheduler, Apache versions default scheduler)
will resource cluster is divided into several parts, each part of a resource group can be acquired, may be set inside the packet.
The third scheduler: Fair Scheduler (Fair Scheduler, CDH version of hadoop default scheduler)
secretary goal Fair scheduler is (for definition of fairness can be set by parameters) for the fair allocation of resources for all applications. Fair scheduling can work across multiple queues. For example, assume there are two users A and B, respectively, they have a queue. When A and B do not start a job task, will get all A cluster resources; When B starts a job, job A will continue to run, then after a while no two tasks will each get half of the cluster resources. If B is start a job and other job is still running, and it will be the first job of the B share the resources of the queue B, B is two job back for cluster resources 1/4, and A the job still using cluster resources in general, the final result is equal sharing of resources between two users.

Yarn Common parameters:
first parameter: minimum memory allocation container
Minimum memory yarn.scheduler.minimum-allocation-mb 1024 assigned to the application container
second parameter: the maximum memory allocation container
yarn.scheduler.maximum-allocation-mb 8192 maximum memory allocated to the application container
third argument: the minimum number of virtual cores in each container
yarn.scheduler.minimum-allocation-vcores 1 of each container to assign a default minimum number of virtual kernel
fourth parameter: the maximum number of virtual cores each container
yarn.scheduler.maximum-allocation-vcores 32 of each container the maximum number that can be assigned virtual kernel
fifth parameter: nodeManager can allocate memory size
yarn.nodemanager.resource.memory-mb 8192 nodemanager can allocate the maximum memory size, default 8192Mb
we can yarn-site.xml them in modify the following two parameters to change the default values
define the size of memory used for each machine
yarn.nodemanager.resource.memory-mb 8192
defined for each virtual machine kernel size
yarn.nodemanager.resource.cpu-vcores 8
define swap space size may be used (swap space out to do is say a hard disk memory usage)
specified here is 2.1 times the nodemanager
yarn.nodemanager.vmem-pmem-ratio 2.1

Guess you like

Origin www.cnblogs.com/hyf199/p/11371805.html