"Offline and real-time big data development combat" (3) Hadoop principle combat

Preface

Then the first two chapters build large data development of knowledge maps and large data platform architecture technology overview , this teacher to continue to share in the state of "off-line and real-time big data to develop real" study notes. Talk about the main battlefield of big data development-offline data development. Offline data technology has been developed for more than ten years and has been relatively stable. It has formed off-line data processing technology with Hadoop, MapReduce and Hive as de facto standards. Offline data platform is the foundation and foundation of the entire data platform, and it is also the mainstay of the current data platform. Stations.

1. Analysis of the advantages and disadvantages of HDFS and MapReduce

1.1 HDFS

The full name of HDFS is Hadoop Distributed File System, which is the core sub-project of Hadoop. In fact, there is a comprehensive file system abstraction in Hadoop, which provides various interfaces of the file system, and HDFS is just an implementation of this abstract file system, but HDFS is the most widely used and most widely used in various abstract interface implementations. The most widely known one.

HDFS is developed based on the requirement of streaming data mode to access and process very large files. Its main features are as follows:


Of course, the above-mentioned features of HDFS are very suitable for batch processing of large amounts of data, but not only have no advantages for some specific problems, but also have certain limitations, mainly in the following aspects:

  1. Not suitable for low-latency data access

For those applications with low latency requirements, HBase is a better choice, especially suitable for accessing massive data sets and requiring millisecond response times.

  1. Inability to store large numbers of small files efficiently

There are many ways to make HDFS handle small files well. For example, use SequenceFile, MapFile, Har and other methods to archive small files. The principle of this method is to archive small files for management. HBase is based on this. For this method, if you want to retrieve the contents of the original small files, you must know the mapping relationship with the archived files. In addition, it can also be scaled horizontally. If one NameNode is not enough, it can be designed with multiple Masters. The NameNode is replaced by a cluster. The design of Alibaba DFS is a multiple-Master design. It separates the mapping storage and management of Metadata. It consists of multiple Metadata storage nodes and Consists of a query Master node.

  1. Does not support multi-user writing and random file modification

There is only one writer in a file in HDFS, and the write operation can only be completed at the end of the file, that is, only append operations can be performed.

1.2 MapReduce

MapReduce is the core computing model of Google. It highly abstracts the complex parallel computing process running on large-scale clusters into two functions: Map and Reduce. MapReduce in Hadoop is a simple-to-use software framework. Applications based on it can run on a large cluster composed of thousands of commercial machines, and can reliably and fault-tolerantly process terabytes of data sets in parallel.

MapReduce is very popular at present, especially among Internet companies, the reason why MapReduce is so popular is that it has the following characteristics:

Insert picture description here

2. Basic architecture of HDFS and MapReduce

HDFS and MapReduce are the two cores of Hadoop, and their division of labor is also very clear. HDFS is responsible for distributed storage, and MapReduce is responsible for distributed computing.

First introduce the architecture of HDFS. HDFS adopts a master/slave (Master/Slave) structure model. An HDFS cluster is composed of a NameNode and several DataNodes. The NameNode is the master server and manages the file system's namespace (ie, file There are several blocks, which are stored on which node, etc.) and the client's file access operations; the DataNode in the cluster manages the stored data. HDFS allows users to store data in the form of files.

From an internal point of view, the file is divided into several data blocks, and these several data blocks are stored on a group of DataNodes. The NameNode performs the namespace operations of the file system, such as opening, closing, and renaming files or directories. It is also responsible for the mapping of data blocks to specific DataNodes. The DataNode is responsible for processing file read and write requests from the file system client, and creates, deletes, and replicates data blocks under the unified scheduling of the NameNode.

HDFS architecture
Both NameNode and DataNode are designed to run on ordinary commercial computers, and these computers usually run the Linux operating system. HDFS is developed in Java language, so any machine that supports Java can deploy NameNode and DataNode.

A typical deployment scenario is that one machine in the cluster runs a NameNode instance, and the other machines run a DataNode instance.

MapReduce also adopts the master-slave architecture of Master/Slave, and its architecture diagram is shown in the figure:

MapReduce architecture

MapReduce consists of 4 components, namely Client, JobTracker, TaskTracker and Task.

Three, MapReduce internal principle practice

As can be seen from the above MapReduce architecture, MapReduce job execution is mainly done by JobTracker and Task-Tracker.

  • The MapReduce program written by the client and the configured MapReduce job is a job. The job is submitted to the JobTracker, and the JobTracker will give the job a new ID value, and then check whether the output directory specified by the job exists and whether the input file exists If it does not exist, an error is thrown.

  • At the same time, the JobTracker will calculate the input split based on the input file. After these checks are passed, the JobTracker will configure the resources needed by the job and allocate the resources, and then the JobTracker will initialize the job, that is, put the job into an internal Queue, so that the configured job scheduler can schedule this job. The job scheduler will initialize the job. Initialization is to create a running Job object (encapsulating tasks and recording information) so that the JobTracker can track the status and progress of the job.

  • When the job is scheduled by the job scheduler, the job scheduler will obtain the input sharding information, and each shard will create a Map task, and assign the Map task and Reduce task to the TaskTraker according to the busyness and idle resources of the TaskTracker and idle resources, and pass the heartbeat The mechanism can also monitor the status and progress of TaskTracker, and can also calculate the status and progress of the entire Job.

  • When JobTracker gets the notification that the last TaskTracker operation to complete the specified task is successful, Jo Tracker will set the entire job status to success, and then when querying the job running status (note: this is an asynchronous operation), the client will find the job Notification of completion.

  • If the job fails halfway, MapReduce will have a corresponding mechanism to deal with it. Generally speaking, if it is not for the programmer's program itself to have a bug, the MapReduce error handling mechanism can ensure that the submitted job can be completed normally.

So, how exactly does MapReduce work?

In chronological order, the execution of MapReduce tasks includes:

Input the shard Map, Shuffle, and Reduce stages. The output of one stage is exactly the input of the next stage.

MapReduce execution phase and flowchart
The above figure shows the general phase division and overview of MapReduce from an overall perspective.

MapReduce execution phase and flowchart combined with word count examples
The specific functions of each stage can be referred to as follows:
Insert picture description here

Four, summary

This chapter mainly introduces Hadoop related knowledge from the perspective of data processing. Hadoop's HDFS and MapReduce are the underlying technologies for offline data processing. In actual development, you still rarely write MapReduce programs to process big data. Instead, you mainly use the high-level abstraction Hive based on MapReduce, which is more efficient and easier to use. This is also the main technology in offline data processing that will be highlighted below-Hive.

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/108884636