Hadoop Big Data Development Foundation series: 1, acquaintance Hadoop

Directory Structure

1.Hadoop Overview

    About 1.1 Hadoop

    1.2 History of Hadoop development

    1.3 Hadoop Features

2.Hadoop core

    2.1 Distributed File System --HDFS

    Distributed computing frameworks --MapReduce 2.2

    2.3 cluster resource manager --YARN

3.Hadoop ecosystem

4.Hadoop scenarios

5. Summary


A, Hadoop Introduction

1.Hadoop Overview

Two core : HDFS and MapReduce

Framework for resource and task scheduling : YARN

    About 1.1 Hadoop

        Hadoop is an Apache Foundation developed a distributed system infrastructure . Users can distributed without knowing the underlying details of the development of distributed applications. Take full advantage of the power of a cluster of high-speed computing and storage. Its purpose is to extend from a single server to thousands of machines, the cluster will be deployed on multiple machines, each machine to provide local computing and storage, and the backup data is stored in a plurality of nodes, thereby improving the cluster availability, rather than by the hardware upgrade, when a machine goes down, the other node can still provide backup data and computing services, the core of the framework is designed to HDFS and Hadoop MapReduce.

    1.2 History of Hadoop development (Transfer from Baidu Encyclopedia)

        Hadoop was originally from the Google MapReduce programming model called a package. Google's MapReduce framework can put an application into many parallel instruction computing, across a large number of computing nodes running very large data sets. Typical examples of using a frame search algorithm is run on the network data. Site Map Hadoop and related initially, rapidly developed into a leading platform for analyzing big data.

        There are many companies began offering Hadoop-based business software, support, service and training. Cloudera is a US-based enterprise software company, which in 2008 began offering Hadoop-based software and services. GoGrid is a cloud computing infrastructure company, in 2012, the company accelerated the pace of cooperation with Cloudera Enterprise adoption of Hadoop-based applications. Dataguise is a data security company, also in 2012, the company introduced a software data protection and risk assessment for the Hadoop.

    1.3 Hadoop features

        Hadoop is a framework that allows users to easily and use distributed computing platform. Users can easily develop and run applications handling massive amounts of data in Hadoop. Its advantages mainly in the following:

        (1) High reliability: because it assumes that computing and storage elements can fail, so it maintains multiple copies of data to work, ensure that the redistribution process for the failed node.

        (2) high scalability: the cluster nodes can easily be extended to expand the cluster.

        (3) efficiency: Hadoop is dynamically move data between nodes, the node where the data in parallel processing, and to ensure dynamic balance of the respective nodes, thus the processing speed is very fast.

        (4) high fault tolerance: HDFS will backup copy when storing files on multiple machines or multiple nodes to store files, to ensure the program runs smoothly. If you start a task fails, Hadoop will rerun the task or enable other tasks to complete part of this task is not completed.

        (5) low cost: Hadoop is open source.

        (6) can be built on an inexpensive machine, the basic framework is Hadoop java prepared.

2.Hadoop core

    2.1 Distributed File System --HDFS (Hadoop Distributed File System)

        2.1.1 HDFS architecture and Introduction

            HDFS is a distributed file system for storage, is mainly responsible for a cluster of data storage and retrieval. HDFS is a master / slave (master / slave) architecture of the distributed file system. HDFS supports the traditional hierarchical file structure organization, user or application can create a directory, and then save the file in these directories can be performed through a file path to the file to create, read, update, or delete operation. However, due to the nature of distributed storage, and he has a traditional file system is significantly different.

            HDFS basic architecture diagram:

           

HDFS file system includes a NameNode, a Secondary NameNode and more DataNode.

            (1) the metadata (Metadata)

            Metadata is not the specific contents of the file, there are three parts IMPORTANT: ① files and directories their attributes, such as file name, directory name, parent directory information, file size, creation time, modification time; relevant ② record file stored content information, such as file block, the number of copies, each copy information DataNode located; ③ to record all the information HDFS DataNode for managing DataNode

            (2) NameNode

           NameNode for storing metadata, and a processing request sent by the client. Meta-information stored in the file is NameNode fsimage file. During system operation, all operations will be on the metadata stored in memory and persistent storage is in another file edits (log), when NameNode start, fsimage will be loaded into memory and data memory of perform operations edits recorded, in order to ensure that the data memory reserved date.

            (3) Secondary NameNode

            Secondary NameNode for data backup NameNode periodically merged edits the file to the local file and the sub fsimage the file to store the new fsimage NameNode, original fsimage substituted, deleted edits the file. Create a new edits modify the operating status and continue to store files.

            (4) DataNode

            DataNode is truly a place to store data. In DataNode, the file stored in the form of data blocks. When the end of the data files to HDFS block 128MB file will cut, each data block is stored into the same or different and DataNode backup copy, three general default, the NameNode responsible for recording information file block, to ensure that you can find and integrate the entire block while reading the file.

            (5) a data block (block)

            HDFS files uploaded to the default file block size is divided into a file system according to the data blocks. Hadoop 2.x default to 128MB a data block, such as storage size is 129MB file, is divided into two blocks were stored. Data blocks will be stored in each node, a backup copy of each data block will be.

        2.1.2 HDFS distributed principle

            What is a distributed system? The system will be distributed into a plurality of modules or subsystems, each running on a different machine, between the subsystems or modules communicate through a network to collaborate to achieve the final overall function. Use multiple nodes to work together to complete the system one or more specific business functions is a distributed system.

            Distributed File System is a subset of distributed systems to solve its problem is data storage. In other words, it is across the storage systems on multiple computers. Data stored on the distributed file automatically distributed over different nodes.

            HDFS as a distributed file system, mainly in the following three aspects:

            (1) HDFS is not a stand-alone file system, which is a distributed file system on multiple nodes of a cluster. Collaborate through network communication between nodes, the nodes provide more file information, so that each user can see the file system. Allow multiple users on multiple machines to share files and storage space.

            (2) When the file is stored in the plurality of distributed nodes, the concepts herein relates to a data block, a data storage file by not stored, but a file into one or more data blocks stored. When storing data blocks are not stored on a node, but is distributed and stored in each node, and a copy of the data block is stored on another node.

            (3) the data read from the plurality of nodes. When reading a file, the plurality of nodes to find the data block from the file, read the distribution of all the data blocks, until the last data block has been read.

        2.1.3 HDFS down process

            Data stored in the file system, if a node goes down, it is easy to cause data loss, HDFS provides protection against this problem:

            (1) redundancy

            During data storage for each data block are made redundant processing, the number of copies can set their own.

            (2) a copy of the deposit

            Strategy Use: A dfs.replication example, two nodes on the same machine of a backup copy of each, then put a copy on a node on another machine, the former is to prevent node goes down, the latter the entire machine downtime is to prevent the loss of data.

            (3) processing down

            ①DataNode periodically send heartbeat information to NameNode (default 3s once). If NameNode not receive a heartbeat message within a predetermined time (default 10min), he would think DataNode problem, he is removed from the cluster. HDFS then detects the number of copies of data blocks on the hard disk less than the required, then the number of copies does not meet the requirements to create a copy of the data blocks needed to achieve pre-set requirements. DataNode may be due to hardware failure, motherboard failure, power failures and other problems of aging and leaves the cluster network.

            ② When HDFS to read a block of data, if the node is in the down, the client to the storage node of the other data blocks read, HDFS also detects the number of copies of the data block does not meet the requirements of the re-fill full copy.

            ③ When HDFS store data, to be stored if the node goes down, a node HDFS reassigned to a data block, then back down the data node.

        2.1.4 HDFS features

            (1) the advantages of:

            High fault tolerance for large data processing streaming data access (write once, read many; file once written, can not be modified, only increases, it can guarantee data consistency)

            (2) Disadvantages:

            Not suitable for low-latency data access, can not efficiently store large amounts of small files, it does not support multi-user write and modify any file (write operation can only be done at the end of the file, can only perform additional operations)

    Distributed computing frameworks --MapReduce 2.2

        2.2.1 MapReduce Introduction

            MapReduce is Hadoop core computing framework, for large data sets (greater than 1TB) programming model of parallel computation, including Map (mapping) and Reduce (reduction) in two parts. When initiating the task MapReduce, Map terminal will read the data on the HDFS, data is mapped to the key value required for the type and passed to Reduce end. Reduce Map side terminal receives data transmitted over the key type, depending on the packet, each group of keys the same data is processed to obtain a new key-value pair and output to the HDFS, the core idea of ​​this is MapReduce.

        2.2.2 MapReduce works

            (1) MapReduce execution flow (input, slices, Map data processing phase, the Reduce stage of data processing, data output phases)

            Reduce emphasis talk about stages: Reduce task can have multiple, according to data partitions decision (of the type of key data partitioning) Map stage set, a partition is a Reduce data processing. For each task Reduce, Reduce will receive different data from the Map task, and each coming Map data is ordered. Every time a process Reduce tasks are the same for all the key data, data reduction, a new key-value pairs output to HDFS.

            (2) MapReduce nature:

            (3) to help small example to understand the map and reduce processes:

 

    2.3 cluster resource manager --YARN

        Introduction 2.3.1 YARN

            YARN provides a more general resource management and distributed application framework, the purpose is to make Hadoop data processing power. In this framework, users can achieve data processing applications customized to suit your needs. MapReduce is also an application on YARN. Another goal is to expand YARN Hadoop, MapReduce computation makes it not only supports, but also easy to manage, such as Hive, HBase, Pig, Spark and other applications. By YARN, various applications can be run without disturbing each other in the same Hadoop system, the entire cluster to share resources.

        2.3.2 YARN basic architecture and task flow

            Basic structures (1) YARN of

            Overall, YARN or Master / Slave architecture, ResourceManager as Mater, NodeManager is Slave, ResourceManager responsible for the resources on each NodeManager unified management and scheduling. When a user submits an application needs to provide a ApplicationMaster for tracking and management of this program, which is responsible for the ResourceManager application resources, and require NodeManager start the task can take up certain resources. Since different ApplicationMaster are distributed to different nodes, so that they do not affect each other (i.e., some applications may be executing concurrently).

            ①RM (ResourceManager): consists of two components (scheduler, scheduler; Application Manager ASM)

            ResourceManager part of the individual resources (compute, memory, bandwidth, etc.) to elaborate the basis NodeManager (YARN each node agent). ResourceManager also allocate resources and ApplicationMaster together, start and monitor their application basis with NodeManager. In this context, ApplicationMaster bear some of the previous role of TaskTracker, ResourceManager took JobTracker role.

            The scheduler responsible for running applications allocate resources, it does not engage in any work related to the specific application.

            ASM is responsible for handling the client's job submission and negotiation of a Container (Object Packager resources) for ApplicationMaster running, and restart it when ApplicationMaster failure.

            ②NM(NodeManager):

            Resource and task manager on each node. In one aspect, he will periodically report resource usage and operational status of each node Container according to the RM node; on the other hand, it receives and processes Container ApplicationMaster from the start / stop request and the like.

            YARN container is in the resource abstraction that encapsulates the multidimensional resources on a node, such as memory, CPU, disk, network and the like. When ApplicationMaster request resources RM, RM resource is returned by Container package. YARN for each task assigned a Container, and the task can only use the resources of the Container described.

            ③AM (ApplicationMaster): equivalent to an application is configured with a small butler

            When each application submitted by the user, the system will generate an AM and included in the program submitted, the main features are: RM in consultation with the scheduler to acquire resources (expressed in Container); the resulting task is further broken down to the internal procedures; NM communicate with the start / stop services; monitor the operation status of all tasks, and when the task fails to re-apply for funding for the mission to restart the task.

            ④CA (Client Application): Client Application

            The client application will be submitted to the RM, first creates a file object Application up and down, and set up the necessary resources to request information AM, and then submitted to the RM.

            (2) YARN workflow

                YARN detailed description of the entire workflow from job submission to completion of the task:

                ① a user submits to the Client Application by YARN processing, wherein the program comprises a AM, AM start command, the user program and the like.

                Applications for the first allocation ②RM Container, and communicates with the location NodeManager Container location assigned application requires it starts AM in this Container. The AM and AM Container for initiating a subsequent command.

                After ③AM start Xianxiang RM registered, so users can view the status of running directly through the RM application, and then began to apply resources to implement the various tasks for the application submitted by need, and monitor its operating status, know the end of the run. (I.e. repeatedly performed ④ - ⑦)

                ④AM take turns using the RPC protocol by asking the way to the RM requested and obtained resources, and monitor its operating status, so when multiple applications submitted, not necessarily the first to perform.

                ⑤ Once the AM application to resources, they NM communication resources corresponding, asking it to start the task in resource allocation.

                After ⑥NM set up the operating environment for the task, the task start command writes a script, and start the task by running the script.

                ⑦ was started task began to execute each task to report their status and progress to the AM through an RPC protocol to allow AM to keep running each task, which can re-start the task if the task fails. In the application is running, the user can at any time by the current operating status of the AM RPC query application.

                ⑧ After the completion of the application is running, ApplicationMaster cancellation of their own to RM.

                ⑨ close the client and AM.

3.Hadoop ecosystem

    After Hadoop available, have developed a lot of components that together provide services to Hadoop-related projects, and gradually formed a series of components of the system, called Hadoop ecosystem   

    (1)HBase

        Hadoop Database, is a high-reliability, high performance, column-oriented, scalable, distributed storage system, using technology HBase storage structures erected on a large scale inexpensive PC Server

    (2)Hive

        Hive is based on Hadoop data warehouse infrastructure. It provides a range of tools that can be used for data extraction transformation loading (ETL), which is a store, query, and mechanisms for large-scale data stored in Hadoop Analysis.

    (3)Pig

        Pig is a large-scale Hadoop-based data analytics platform, SQL-LIKE language that provides called Pig Latin, the compiler will type the SQL language data analysis request into a series of optimized processing of MapReduce operations.

    (4)Sqoop

        Sqoop is an open-source tool, mainly used in the conventional database (MySQL, PostgreSQL ...) between Hadoop (Hive) for data transfer, data may be turned into a relational database to the HDFS in Hadoop , HDFS data may be turned into a relational database.

                                    

    (5)Flume

        Flume Cloudera is provided to a highly available, highly reliable, distributed massive log collection, aggregation and transmission systems, to support various types of customized data Flume sender log system for collecting data; simultaneously, providing Flume the ability to perform simple data processing, and data written to various recipients (customizable) of

    (6) Oozie

         Oozie hadoop-based scheduler, the scheduling process to write the xml form can be scheduled mr, pig, hive, shell, jar task and so on.

The main features are

        * Workflow: order of execution of the node, support fork (multiple branch nodes), the Join (combined as a plurality of nodes)

        * Coordinator, timed trigger workflow

        * Bundle Job, bind multiple coordinator

    (7)ZooKeeper

        ZooKeeper is a distributed, open-source coordination service for distributed applications, is an open source implementation of Google's Chubby, is a key component of Hadoop and Hbase. It is to provide a consistent service for distributed applications, provides features include: configuration maintenance, domain name service, distributed synchronization, group services.

                     

    (8)Mahout

         Mahout is the Apache Software Foundation (ASF) 's an open source project, some achieve scalable machine learning classical algorithm, designed to help developers more quickly and easily create intelligent applications. Mahout contains many implementations, including clustering, classification, recommendation filtration, frequent child mining. Further, by using a library Apache Hadoop, Mahout can be effectively extended to the cloud.

4.Hadoop scenarios

    Ten scenarios : (1) online travel (2) Mobile data (3) E-Commerce (4) Image Processing (5) energy extraction 

                            (6) Fraud Detection (7) IT security (8) Healthcare (9) search engine (10) social platform

5. Summary

    This paper introduces the basic concepts of the theory of Hadoop, Hadoop features to understand the Hadoop core idea; understand the HDFS, MapReduce, YARN three major core framework, in-depth understanding of the overall structure of Hadoop; simply understand the Hadoop eco-system and some application scenarios.

 

6. References

    In this paper, with reference to "Hadoop big data base development," a book, the author Yu Minghui, Zhang Liangjun

Published 18 original articles · won praise 0 · Views 454

Directory Structure

Guess you like

Origin blog.csdn.net/weixin_45678149/article/details/104937915