Hadoop series (1) basic concepts

Hadoop series (1) basic concepts

1. Introduction to Hadoop

Hadoop is a distributed system infrastructure developed by the Apache Foundation, which enables users to develop distributed programs without knowing the underlying details of the distribution, and make full use of the power of clusters for high-speed computing and storage.

From its definition, it can be found that it solves two major problems: big data storage and big data analysis. That is, the two cores of Hadoop: HDFS and MapReduce.

  1. HDFS (Hadoop Distributed File System) is a scalable, fault-tolerant, high-performance distributed file system with asynchronous replication, write-once-many-read, and is mainly responsible for storage.

  2. MapReduce is a distributed computing framework, including map (mapping) and reduce (reduction) processes, responsible for computing on HDFS.

Let's first understand the development history of Hadoop, as shown in Figure 1-1.

Figure 1-1 Hadoop development history

From 2002 to 2004, the first round of the Internet bubble just burst, and many Internet practitioners lost their jobs. Our "protagonist" Doug Cutting is no exception, he can only write some technical articles and earn some royalties to support his family. But Doug Cutting was unwilling to be lonely. With his longing for dreams and future, he and his good friend Mike Cafarella developed an open source search engine Nutch, and it took a year to make this system able to support the search of billions of web pages. But the number of web pages at that time was far more than this size, so the two continued to improve, and wanted to support an order of magnitude more web pages.

In 2003 and 2004, Googles published two papers on GFS and Mapreduce, respectively. Doug Cutting and Mike Cafarella found that this was not the same as what they thought, and it was more perfect, completely out of the state of manual operation and maintenance, and realized automation.

After a series of careful consideration and detailed summaries, in 2006, Dog Cutting set out to start a business, and then joined the yahoo company after several twists and turns (Nutch's part was also officially introduced). Hadoop named the item after the elephant's name.

When the system entered Yahoo, the project gradually developed and matured. The first is the scale of the cluster. From the initial scale of dozens of machines to a machine that can support thousands of nodes, a lot of engineering work has been done in the middle; then, in addition to search business development, Yahoo gradually integrates its own advertising system. Data mining related work has also been migrated to Hadoop, further maturing the Hadoop system.

In 2007, the New York Times used Hadoop to transform 4 terabytes of image data on 100 Amazon's virtual machine servers, adding to the impression of Hadoope.

In 2008, a Google engineer found that it was very difficult to put the Hadoop at that time into any cluster for transportation, so he and a few good friends set up a specialized commercial Hadoop company. Company Cloudera. In the same year, the Facebook team found that many of them could not write Hadoop programs, but were familiar with a set of SQL, so they built a software called Hive on Hadoop to convert SQL into Hadoop Mapreduce programs.

In 2011, Yahoo spun out the Hadoop team and established a subsidiary, Hortonworks, to provide Hadoop-related services.

Having said that, what are the advantages of Hadoop?

Hadoop is a distributed computing platform that allows users to easily architect and use it. Users can easily publish and run applications that process massive amounts of data in Hadoop. Its advantages are mainly as follows:

(1) High reliability : Hadoop can be trusted for its ability to store and process data in bits.

(2) High scalability : Hadoop distributes data and completes computing tasks among available computer clusters, and these clusters can be easily extended to dozens of nodes.

(3) Efficiency : Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

(4) High fault tolerance : Hadoop can automatically save multiple copies of data, and can automatically redistribute failed tasks.

(5) Low cost : Compared with all-in-one computers, commercial data warehouses, and data marts such as QlikView and Yonghong Z-Suites, Hadoop is open source, so the software cost of the project will be greatly reduced.

Hadoop comes with a framework written in the Java language, so it is ideal to run on a linux production platform, but applications on Hadoop can also be written in other languages, such as C++.

2. Hadoop Storage - HDFS

Hadoop's storage system is HDFS (Hadoop Distributed File System) distributed file system. For external clients, HDFS is like a traditional hierarchical file system, which can create, delete, move or rename files or folders. , similar to the Linux filesystem.

However, the architecture of Hadoop HDFS is based on a specific set of nodes (see Figure s), these sections are called Nodes (NameNode, only one), which provide metadata services inside HDFS; Secondary NameNode (Secondary NameNode), The helper node of the name node is mainly to integrate metadata operations (note that it is not a backup of the name node); the data node (DataNode), which provides storage blocks for HDFS. Since there is only one NameNode, this is a disadvantage of HDFS (single point of failure, much improved after Hadoop2.x).

Figure 1-2 Hadoop HDFS architecture

Files stored in HDFS are divided into blocks, and these blocks are then copied to multiple data nodes (DataNodes), which is very different from traditional RAID architectures. The size of the block (usually 128M) and the number of blocks copied are determined by the client when the file is created. The namenode can control all file operations. All communication within HDFS is based on the standard TCP/IP protocol.

The specific description of each component is as follows:

(1) Name Node (NameNode)

It is a component that typically runs on a separate machine in an HDFS architecture and is responsible for managing the filesystem namespace and controlling access by external clients. The NameNode decides whether to map the file to a replicated block on the DataNode. For the most common 3 replicated blocks, the first replicated block is stored on a different node in the same rack, and the last replicated block is stored on a node in a different rack.

(2) Data Node (DataNode)

A data node is also a component that usually runs on a separate machine in an HDFS architecture. A Hadoop cluster contains one NameNode and a large number of DataNodes. Data nodes are usually organized in racks that connect all systems through a switch.

Data nodes respond to read and write requests from HDFS clients. They also respond to commands from the NameNode to create, delete, and replicate blocks. Namenodes rely on periodic heartbeat messages from each datanode. Each message contains a block report against which namenodes can verify block mappings and other filesystem metadata. If the data node cannot send heartbeat messages, the name node will take remedial action and re-replicate blocks that were lost on that node.

(3) Secondary NameNode

The role of the second name node is to provide a Checkpoint for the name node in HDFS, it is just a helper node of the name node, which is why it is considered a Checkpoint Node in the community.

As shown in Figure 1-3, edits are merged into the fsimage file only when the NameNode restarts, resulting in an up-to-date snapshot of the filesystem. But NameNodes in a production cluster are rarely restarted, which means that edits files can become very large when NameNodes are running for a long time. And when the NameNode goes down, edits will lose a lot of changes, how to solve this problem?

Figure 1-3 Name node function

fsimage is a snapshot of the entire file system when the NameNode is started; edits is a sequence of changes to the file system after the NameNode is started.

As shown in Figure 1-4, the Secondary NameNode will regularly go to the NameNode to obtain the edits of the name node and update it to its own fsimage in time. In this way, if the NameNode goes down, we can also use the Secondary-NameNode information to restore the NameNode. Also, if the Secondary NameNode's new fsimage file reaches a certain threshold, it will copy it back to the NameNode so that the NameNode will use the new fsimage file on the next restart, reducing the restart time.

Figure 1-4 NameNode Helper Node Secondary NameNode

Take an example of data uploading to understand how HDFS works internally, as shown in Figure 1-5.

Figure 1-5 HDFS file upload

The file will be divided into blocks when it is on the client. Here you can see that the file is divided into 5 blocks, namely: A, B, C, D, E. At the same time, for load balancing, each node has 3 blocks. Let's take a look at the specific steps:

  1. The file to be uploaded by the client is divided into 128M chunks.

  2. The client sends a write data request to the namenode.

  3. The NameNode records information about each DataNode and returns a list of available DataNodes.

  4. The client directly sends the divided file blocks to the DataNode, and the sending process is written in streaming.

  5. After the write is complete, the DataNode sends a message to the NameNode to update the metadata.

Note here:

  1. Writing 1T files requires 3T storage and 3T network traffic.

  2. In the process of executing read or write, NameNode and DataNode communicate through HeartBeat to make sure that DataNode is alive. If it is found that the DataNode is dead, put the data on the dead DataNode to other nodes, and when reading, read other nodes.

  3. It doesn't matter if one node is down, there are other nodes that can be backed up; even, it doesn't matter if one rack is down; there are backups on other racks.

3. Hadoop Computing - MapReduce

MapReduce is a software architecture proposed by Google for parallel operations on large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce" and their main ideas are both borrowed from functional programming languages, with features borrowed from vector programming languages.

The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specify a concurrent Reduce (induction) function to ensure that all mapped key-value pairs are Each shares the same set of keys, as shown in Figure 1-6.

Figure 1-6 Simple understanding of Map/Reduce

Next, we will analyze the logic of MapReduce with Hadoop's "Hello World" routine—word count, as shown in Figure 1-7. A general MapReduce program goes through the following processes: Input, Splitting, Map, Shuffle, Reduce, and Final result.

Figure 1-7 Hadoop MapReduce word count logic

1) Not to mention the input, the data is generally placed on HDFS, and the file is divided into blocks. The relationship between file blocks and file shards is described in Input shards.

2) Input fragmentation: Before the Map stage, the MapReduce framework will calculate the input fragment (split) according to the input file, each input fragment will correspond to a Map task, and the input fragment is often closely related to the HDFS block. For example, the block size of HDFS is 128M. If we input two files with sizes of 27M and 129M, then the 27M file will be used as an input fragment (less than 128M will be regarded as a fragment), and 129MB is Two input shards (129-128=1, less than 128M, so 1M is also regarded as an input shard), so, in general, one file block corresponds to one shard. As shown in Figure 1-7, Splitting corresponds to the following three data and should be understood as three shards.

3) Map stage: The processing logic of this stage is actually the Map function written by the programmer. Because a shard corresponds to a Map task and corresponds to a file block, this is actually a data localization operation, which is called Mobile computing not mobile data. As shown in Figure 1-7, the operation here is actually to divide each sentence, then get each word, and then map each word to get the key-value pair of word and 1.

4) Shuffle stage: This is where the "miracle" happens. The core of MapReduce is actually Shuffle. So what is the principle of Shuffle? Shuffle is to integrate the output of Map, and then send it to Reduce as the input of Reduce. A simple understanding is to sort the output of all Maps by key, and integrate the key-value pairs of relative keys into the same group. As shown in Figure 1-7, Bear, Car, Deer, River are sorted, and the Bear key has two key-value pairs.

5) Reduce phase: Similar to Map, this is where users write programs, which can be processed for grouped key-value pairs. As shown in Figure 1-7, an addition operation is performed on all the values ​​of the same key Bear, and we get

6) Output: The output of Reduce is directly written to HDFS, and the output file is also divided into blocks.

Having said so much, in fact, the essence of MapReduce can be fully expressed in a graph, as shown in Figure 1-8.

Figure 2-8 The essence of MapReduce

The essence of MapReduce is to convert a set of key-value pairs

Hadoop MapReduce can be divided into MR v1 and YARN/MR v2 versions according to the resource management framework it uses, as shown in Figure 1-9.

In the MR v1 version, the resource management is mainly Jobtracker and TaskTracker. Jobtracker is mainly responsible for: job control (job decomposition and status monitoring), mainly MR tasks and resource management; while TaskTracker is mainly responsible for scheduling each subtask task of Job; and receiving JobTracker commands.

Figure 1-9 MapReduce development history

In the YARN/MR v2 version, YARN divides the job of JobTracker into two parts:

  1. The ResourceManager globally manages the allocation of computing resources for all applications.

  2. The ApplicationMaster is responsible for the corresponding scheduling and coordination.

NodeManager is the agent of each machine framework, the container that executes the application, monitors the resource (CPU, memory, hard disk, network) usage of the application, and reports to the scheduler.

4. Hadoop Resource Management - YARN

In the previous section, we saw that when MapReduce developed to 2.x, it did not use JobTracker as its own resource management framework, but chose to use YARN. It should be noted here that if JobTracker is used as the resource management framework of the Hadoop cluster, other tasks cannot be run except MapReduce tasks. That is to say, if the MapReduce tasks of our cluster are not so full, the cluster resources will be wasted. Therefore, another resource management framework YARN (Yet Another Resource Manager) is proposed. It should be noted here that YARN is not a simple upgrade of JobTracker, but a "big change". Hadoop 2.X also includes this architecture. The Apache Hadoop 2.X project contains the following modules.

  • Hadoop Common: The base module that provides support for other modules of Hadoop.

  • HDFS: Hadoop: Distributed File System.

  • YARN: A framework for task allocation and cluster resource management.

  • MapReduce: Parallel and scalable pattern for processing big data.

As shown in Figure 1-10, the YARN resource management framework includes ResourceManager (resource manager), Applica-tionMaster, and NodeManager (node ​​manager). The individual components are described below.

Figure 1-10 YARN architecture

(1)ResourceManager

ResourceManager is a global resource manager responsible for resource management and allocation of the entire system. It mainly consists of two components: scheduler (Scheduler) and application manager (ApplicationManager, AM).

The Scheduler is responsible for allocating the least amount of resources required to run the Application to the Application. The Scheduler only schedules based on resource usage, and is not responsible for monitoring/tracking the status of the Application, and of course it will not handle failed Tasks.

The ApplicationManager is responsible for processing the Job submitted by the client and negotiating the first Container for the App-licationMaster to run, and will restart the ApplicationMaster when the ApplicationMaster fails (YARN uses the Resource Container concept to manage the resources of the cluster, and the Resource Container is an abstraction of resources , each Container includes certain resources such as memory, IO, and network).

(2)ApplicationMaster

ApplicatonMaster is a framework-specific library, each Application has an ApplicationMaster, which mainly manages and monitors various applications deployed on the YARN cluster.

(3)NodeManager

Mainly responsible for starting the Container assigned by the ResourceManager to the ApplicationMaster, and monitoring the operation of the Container. When starting the Container, NodeManager will set some necessary environment variables and related files; when all preparations are done, the Container will be started. After startup, NodeManager will periodically monitor the resources occupied by the Container. If it exceeds the amount of resources declared by the Container, the process represented by the Container will be killed.

As shown in Figure 1-11, there are two tasks on the cluster (corresponding to the AM on Node2 and Node6), and the task on Node2 has 4 Containers to execute the task; while the task on Node6 has 2 Containers to run the task perform tasks.

5. Hadoop Ecosystem

As shown in Figure 1-12, the Hadoop ecosystem is actually a carnival of animals. Let's take a look at some of the main frameworks.

Figure 1-11 YARN cluster

Figure 1-12 Hadoop ecosystem

(1)HBase

HBase (Hadoop Database) is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. Using HBase technology, a large-scale structured storage cluster can be built on a cheap PC Server.

(2)Hive

Hive is a data warehouse infrastructure built on Hadoop. It provides a set of tools for extract-transform-load (ETL), a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop.

(3)Pig

Pig is a large-scale data analysis platform based on Hadoop, and the SQL-LIKE language it provides is called Pig Latin. The language's compiler converts SQL-like data analysis requests into a series of optimized Map-Reduce operations.

(4)Sqoop

Sqoop is an open source tool, mainly used for data transfer between Hadoop (Hive) and traditional databases (MySQL, post-gresql, etc.). It can import data from a relational database into Hadoop's HDFS, and also You can import HDFS data into a relational database, as shown in Figure 1-13.

(5)Flume

Flume is a highly available, highly reliable, distributed massive log collection, aggregation, and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system for data collection. At the same time, Flume provides the ability to simply process data and write to various data receivers (customizable), as shown in Figure 1-14.

Figure 1-13 Sqoop function

Figure 1-14 Flume data transmission

(6) Oozie

Oozie is a Hadoop-based scheduler, which writes scheduling processes in XML format, and can schedule tasks such as Mr, Pig, Hive, shell, and jar.

The main functions are as follows.

  1. Workflow: Execute process nodes sequentially, support fork (branch multiple nodes), join (merge multiple nodes into one).

  2. Coordinator: Trigger Workflow regularly.

  3. Bundle Job: Bind multiple Coordinators.

(7) Hate

Chukwa is an open source data collection system for monitoring large distributed systems. It is built on Hadoop's HDFS and MapReduce framework and inherits the scalability and robustness of Hadoop. Chukwa also includes a powerful and flexible toolset for presenting, monitoring and analyzing collected data.

(8)ZooKeeper

ZooKeeper is an open source distributed application coordination service, an open source implementation of Google's Chubby, and an important component of Hadoop and Hbase, as shown in Figure 1-15. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name service, distributed synchronization, group service, etc.

Figure 1 - ZooKeeper Architecture

(9) Euro

Avro is a data serialization system. It can provide: rich data structure types, fast and compressible binary data form, file container for storing persistent data, remote procedure call RPC.

(10)Mahout

Mahout is an open source project under the Apache Software Foundation (ASF) that provides some scalable implementations of classic algorithms in the field of machine learning, aiming to help developers create intelligent applications more easily and quickly. Mahout contains many implementations including clustering, classification, recommendation filtering, frequent subitem mining. Additionally, Mahout can be effectively extended to the cloud by using the Apache Hadoop library.


Record a little bit every day. Content may not be important, but habits are!

This article is from "Hadoop and Big Data Mining"

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324676217&siteId=291194637