Good programmers learn Hadoop Big Data learning course dry cargo Share

  Good programmers learn Hadoop Big Data learning course dry goods share, Apache Hadoop is reliable, scalable, distributed computing open source software development.
Apache Hadoop software library is a framework that allows the use of a simple programming model across a cluster of machines distributed processing of large data sets (vast amounts of data).
These modules include:

  • Hadoop Common: support for other common tools Hadoop modules.
  • Hadoop Distributed File System (HDFS ™): a distributed file system that provides high throughput access to application data.
  • Hadoop YARN: job scheduling framework and cluster resource management.
  • Hadoop MapReduce: YARN based on parallel processing systems for large data sets.

Each said module has its own independent function, and there are associated with each other between the modules.

Broadly speaking, HADOOP usually refers to a broader concept --HADOOP ecosystem
reliable, scalable, distributed computing, open source software by HDFS, MapReduce, YARN composition.

HDFS
Hadoop distributed file system, generally consists of one or two Namenode process and composition of several Datanode process, in the realization of HDFS HA mechanism, there ZKFC process (usually with NameNode processes running on the same computer) and a number of JN process.

Node
machine running Namenode or Datanode process called nodes, corresponding to the machine running the process Namenode Namenode called nodes, Datanode machine running process called node Datanode, where the machine may be a physical machine or a virtual machine.

Mapreduce
distributed parallel frame calculated offline, is a programming framework distributed computing program, the user development "Based on the data analysis applications hadoop" core framework, Mapreduce core function is a user-written code and business logic MR own default component integrated into a complete distributed computing program, run concurrently on a hadoop cluster; HDFS similar principles and problem solving, HDFS is a large file into several small files, which are then stored in each host in the cluster . The same principle, MapReduce is cut into a complex operation if the sub-operation, respectively, and then each host to cluster, parallel operation by the respective host.

Glossary

  • Job: calculated for each user request is called a job.
  • Task: Each job will need to split open, and handed over more than one host to complete spin-off of the unit is to perform the task. Task is divided into the following three types of tasks:
    • The entire process is responsible for the data processing phase map: Map
    • Reduce: reduce phase responsible for the overall data processing flow
    • MRAppMaster: responsible for the entire process of program scheduling and coordinating state
      YARN
      Yet Another Resource Negotiator (Well, another resource coordinator), job scheduling framework and cluster resource management, the ResourceManager and NodeManager composition, ResourceManager has two main components: Scheduler and ApplicationsManager.

Scheduler
scheduler responsible according to the familiar capacity, queues and other constraints to allocate resources to various applications are running. The scheduler is purely scheduler, because it does not perform monitoring or tracking application state. In addition, because the application or hardware failure, it can not guarantee that mission failed to restart. Scheduler to perform its scheduling function according to the resource requirements of the application; it is based on the abstraction of resources Container, which includes memory, cpu, disk, network and other elements.

ApplicationsManager
responsible for receiving job submission, negotiation ApplicationMaster first container to perform application-specific, and provides restart ApplicationMaster container service in case of failure. Each application ApplicationMaster responsible for negotiating the appropriate resource container from Scheduler, track their status and monitor progress.

Zookeeper
distributed coordination service for distributed applications, consisting of a plurality of QuorumPeerMain processes, these processes function similar in nature, but in the process zookeeper operation, a process which will act as a leader role, the remaining process acts as a follower role.

znode
internal zookeeper maintained at a part of the tree data structure in memory, i.e. in the node tree data structure, having the meta information related to rights, type, version, etc., and other child nodes, the parent node, their content information. zookeeper in charge of monitoring node status change, including new nodes, delete, change content, change the child node, but the zookeeper is not responsible for the operation of a node after the change, after change zookeeper can inform the Watcher node and then made this watcher responsible for handling.

HA
-called HA, namely high availability (7 * 24 hours uninterrupted service) (secondarynamenode only to ensure that the "reliability") to achieve high availability is the most critical single point of failure, hadoop-ha, strictly speaking, should be divided into various components of HA --HDFS mechanisms HA, YARN of HA.

  • Detailed mechanisms of HDFS HA: single point of failure by the double namenode, bis namenode coordination points:
    • Metadata management needs to change
    • We need a state management module

Guess you like

Origin blog.51cto.com/14479068/2432970