Hadoop noun personal understanding

2012-05-16

 

About business processes

Think of a distributed computing as an agricultural production process, and hadoop as a production solution.

     Map : The sowing stage, which will eventually yield the crude product of grain. It is regarded as the farmland where the cultivation activities are carried out.

     Combiner : As the name suggests, a combine harvester. The crude product is harvested and packaged for processing. It can improve the efficiency of the processing stage, of course, you can choose not to use it. Each farm uses its own harvester, and I have never heard of a harvester that hits the world.

Shuffle : Literally called shuffle. In this worldview, it can be thought of as the process of distributing from farmland to different processing units by category (key).

     Reduce : The processing stage, where crude agricultural products are processed into real food products, the final result of the solution. Think of it as an agricultural product processing plant.

 


About the physical structure

     For the field of HDFS, there are three concepts: Name-Node (named node), Data-Node (data node), Secondary Name-Node (secondary named node). Think of HDFS as a porter that works in a foreman-peer model, where the entire file system is the port, and the files are the cargo.

      The named node is the foreman, who is responsible for the roll call and assignment of peons (maintaining the files and indexes of the file tree).

      Data nodes are the hard workers who are involved in the work. Peers' job is to move supplies (store documents and receive interviews).

      Peers report the status of their work and the goods in charge to the foreman at regular intervals. This behavior is called Heartbeat. Once the foreman finds that the peasant has no heartbeat, he will ask the peasant to restore it, or seek a new backup of the data.

      The second named node sounds like Foreman No. 2, but has no foreman responsibilities. He constantly backs up Foreman's data (namespace mirroring). If the foreman hangs, we can manually elevate it to the foreman to continue the system work. Because of the lag in such backups, losses are inevitable.

      The disadvantage of this model is that once the foreman hangs up, the entire system will fail. If there is no backup, there will be no chance to restore the system. There are two ways to prevent irreparable losses. One is that the named node continues to perform persistent operations elsewhere; the other is to use a secondary named node.

 


About the task structure

     JobTracker : It is a Java application that coordinates the running of jobs and is responsible for creating, dividing, and assigning tasks. Tasks are divided into Map tasks and Reduce tasks. When assigning Map tasks, priority will be given to the data localization (data local) characteristics, followed by the data in the same rack (rack local), and finally the non-localized and non-same racked data to improve efficiency; but assigning Reduce The task will not have this aspect considered.

     TaskTracker : A Java application that executes tasks and accepts assignments from JobTracker.

     BlackList : During the task (not limited to one task), a certain TaskTracker node reciprocates N times (configurable) task failures, which will cause the JobTracker to write the TaskTracker node to the BlackList (blacklist). During the period when the node is blacklisted, it will no longer be called by JobTracker. Unless the node is restarted manually or automatically pardoned by the JobTracker after 24 hours. [Personal experience] The blacklist situation is often caused by the hard disk being overwritten.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326538017&siteId=291194637