hadoop study notes (1)

First, the concept of hadoop

Second, the development history of hadoop

3. The ecosystem of hadoop1.x

HBase : Real-time distributed database

  Equivalent to a relational database, data is placed in files, and files are placed in HDFS. Therefore, HBase is a relational database based on HDFS. Real-time: Latency is very low and real-time is high.

For example, it only takes 1.58s to query 10,000 pieces of data in a table with   nearly 1.8 billion pieces of data , which is impossible for ordinary databases (Oracle cluster, Mysql cluster).

HDFS : Distributed File System

MapReduce : Distributed Computing Framework

Zookeeper : Distributed Collaboration Service

  Collaborating with HBase to store, manage, and query data, Zookeeper is a good distributed collaborative service framework.

Hive : Data Warehouse

  Data warehouse :

  For example, give you a warehouse of 1,000 square meters and let you put fruits. If there are fruits in spring, summer, autumn and winter, let you put them in a certain category. But the fruit is divided into bananas, apples and so on. Then it is divided into good fruit and bad fruit. . . . .

  Therefore, the concept of data warehouse is also the same. It is a large warehouse, and then there are many patterns in it, and each pattern is divided into small patterns and so on. For the whole system, such as the file system. How to manage files? Hive is here to solve this problem.

  Hive

Categorize and manage files and data, and provide HiveQL query language similar to SQL language to help you analyze   these data through a very friendly interface . In fact, the bottom layer of Hive is converted into MapReduce. When the written HiveQL is executed, Hive provides an engine to convert it into MapReduce and then execute it.

    Hive design purpose : to facilitate the DBA to quickly switch to the mining and analysis of big data.

Pig : Data Stream Processing

  Based on MapReduce, based on stream processing. After writing the dynamic language, it is also converted into MapReduce for execution. Similar to Hive.

Mahout : data mining library

  Graphically based data bowl fern.

Sqoop : Database ETL Tool

  ELT : Extract --> Transform --> Load.

  Obtain data from the database, perform a series of data cleaning and cleaning screening, convert qualified data into data in a certain format for storage, and store the formatted data on the HDFS file system for data analysis and analysis by the computing framework. dig.

  Format data:

    |- TSV format : each row of data is separated by a tab character (tab \t) between each column

    |- CVS format : each row of data is separated by commas

  Sqoop : Import and export data in relational database and data in HDFS (HDFS file, table in HBase, table in Hive).

Flume : log collection tool

  Collect the logs of each machine on the large cluster and automatically put them in a path of the HDFS file system you specify.

Ambari : installation, deployment, configuration and management tool

  Provides a graphical tool to install, deploy, configure, and manage clusters without requiring manual operations on the command line.

Fourth, the ecosystem of hadoop2.x

YARN : Cluster Resource Management System

   Manage the resources of each machine in the entire cluster, and schedule each service, each job, and each application (CPU, etc.).

HDFS2 : Distributed File System

  Some features have been enhanced, the most important being the single node failure of the NameNode and the horizontal expansion of the NameNode. 

Tez : DAG computing framework

Storm : Streaming computing framework

Five, hadoop1.x composition

For the architecture of distributed systems and frameworks, it is generally divided into two parts:

  The first part: the management layer, which is used to manage the application layer.

  Part II: Application Layer (Work).

HDFS core background daemon:

  NameNode : Metadata server

    NameNode is the master node, which stores the metadata of files such as file name , file directory structure , file attributes (generation time, number of copies, file permissions), as well as the block list of each file and the DataNode where the block is located .

    It belongs to the management layer and is used to manage the storage of data.

  Secondary NameNode : Secondary metadata server

Auxiliary daemon     used to monitor HDFS status and take snapshots of HDFS metadata at regular intervals.

    It belongs to the management layer and assists the NameNode for management.

  DataNode : block storage

    Store file block data in the local file system , as well as checksums of the block data .

    It belongs to the application layer and is used for data storage and is managed by the NameNode. It is necessary to report work to the NameNode regularly and execute the tasks assigned and distributed by the NameNode.

MapReduce, a distributed computing framework:

  JobTracker : task scheduler

    Responsible for receiving jobs submitted by users, and responsible for starting and tracking task execution.

    It belongs to the management layer, manages cluster resources and schedules tasks, and monitors the execution of tasks.

  TaskTrackers : task execution

    Responsible for executing tasks assigned by JobTracker and managing the execution of each task on each node .

    It belongs to the application layer, executes the tasks assigned and distributed by the JobTracker, and reports the work status to the JobTracker.

6. HDFS Architecture Diagram

 7. MapReduce Architecture Diagram

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324766673&siteId=291194637