Getting Started with Hadoop Big Data Platform - HDFS and MapReduce

With the continuous improvement of the hardware level, the size of the data to be processed is also increasing. Everyone knows how popular big data is now, and they all think that the 21st century is the century of big data. Of course, I also want to hit the free ride of the times. So today, let's learn about big data storage and processing.

As the data continues to grow, bottlenecks appear in data processing: storage capacity, read and write rates, computing efficiency, and so on.

Google is worthy of being a big company in the forefront of the world. In order to process big data, Google has proposed big data technology, MapReduce, BigTable and GFS.

This technology has brought great changes to big data processing.

1. The cost of big data processing is reduced, and big data can be processed with a PC without the need to use mainframes and high-end equipment for storage.

2. On the basis of taking the hardware failure as the normal state, the software fault tolerance method is adopted to ensure the reliability of the software.

3. Simplify parallel distributed computing, do not need to control node synchronization and data exchange, and lower the threshold for big data processing.

 

Although Google's technology is very good, but Google has not open source technology. But fortunately, to imitate Google's open source implementation of big data, Hadoop appeared.

 

what is Hadoop

Hadoop mainly accomplishes two things, distributed storage and distributed computing.

Hadoop mainly consists of two core parts:

1. HDFS: Distributed file system used to store massive data.

2. MapReduce: A parallel processing framework that implements task decomposition and scheduling.

 

What Hadoop can do

Hadoop can complete the storage, processing, analysis, statistics and other services of big data, and is widely used in data mining and other aspects.

 

Advantages of Hadoop

1. High scalability. The effect can be improved simply by adding hardware.

2. Low cost, can be done with a PC.

3. Hadoop has a mature ecosystem, such as Hive, Hbase, zookeeper, etc., which makes Hadoop more convenient to use.

 

Having said all that, we still don't understand the mechanics of Hadoop.

We first need to understand the two core components of Hadoop: HDFS and MapReduce.

 

What is HDFS?

As mentioned earlier, HDFS is a distributed file system used to store and read data.

The file system has the smallest processing unit, and the processing unit of HDFS is the block. The files saved in HDFS are divided into blocks for storage, and the default block size is 64MB.

And there are two types of nodes in HDFS:

1.NameNode Sum DataNode.

NameNode :

NameNode is the management node, which stores file metadata. That is, the mapping table that stores files and data blocks, and the mapping table of data blocks and data nodes.

That is to say, through the NameNode, we can find the place where the file is stored and find the stored data.

DataNode :

DataNode is a worker node, which is used to store data blocks, that is, where files are actually stored.

That's a bit abstract, so let's look at the diagram:

The client sends a message to the NameNode to read metadata, and the NameNode will query its Block Map to find the corresponding data node. Then the client can go to the corresponding data node to find the data block and splicing it into a file. This is the flow of reading and writing.

As a distributed application, in order to achieve software reliability, as shown in the figure, each data block has three copies and is distributed on two racks.

In this way, if a data block is broken, it can be read from other data blocks, and when one rack is broken, it can also be read from another rack, thus achieving high reliability.

We can also see from the above figure that because the data block has multiple copies, the NameNode needs to know which nodes are alive, and the connection between them is achieved by heartbeat detection. This is also the method used by many distributed applications.

We can also see that the NameNode also has a Secondary NameNode. If the NameNode fails, the Secondary will become a backup to ensure the reliability of the software.

 

What are the characteristics of HDFS?

1. Data redundancy, software fault tolerance is very high.

2. Lost data access, that is, HDFS writes once, reads and writes many times, and there is no way to modify it, it can only be deleted and then recreated

3. Suitable for storing large files. If it is a small file, and there are many small files, even one block is not full, and many blocks are needed, it will greatly waste space.

 

Applicability and limitations of HDFS:

1. Batch data read and write, high throughput.

2. Not suitable for interactive applications, high latency.

3. It is suitable for writing once and reading many times, and reading sequentially.

4. Multi-user concurrent reading and writing of files is not supported.

 

After understanding HDFS, it is the turn of MapReduce.

 

What is MapReduce:

MapReduce is a parallel processing framework that implements task decomposition and scheduling.

In fact, the general principle is the idea of ​​divide and conquer. A large task is decomposed into multiple small tasks (map), and after the small tasks are executed, the calculation results are combined (reduce).

That is to say, after the JobTracker gets the job, it will divide the job into many maptasks and reducetasks and give them to them for execution. The input and output of MapTask and ReduceTask functions are in the form of <key, value>. After the input data stored in HDFS is parsed, it is input into the MapReduce() function in the form of key-value pairs for processing, and a series of key-value pairs are output as intermediate results. In the Reduce phase, the intermediate data with the same key value are merged form the final result.

First of all, we need to know a few small concepts:

1.job  2.task  3.jobTracker  4.taskTracker

Job: Inside Hadoop, Job is used to represent the collection of all jar files and classes required by the running MapReduce program. > These files are finally integrated into a jar file, and this jar file is submitted to the JobTraker, MapReduce program will execute.

task: The job will be divided into multiple tasks. Divided into MapTask and ReduceTask.

jobTracker: management node. Split the job into multiple map tasks and reduce tasks.

Functions:
1. Job scheduling
2. Assign tasks, monitor task execution progress
3. Monitor TaskTracker status

taskTracker: task node. Generally, it is the same node as the dataNode, so that the calculation can follow the data, and the overhead is minimized.

effect:

1. Perform tasks

2. Report task status

 

In MapReduce, there is also a fault tolerance mechanism.

1. Repeat the execution. A job can be executed up to 4 times.

2. Speculative execution. Because the Reduce will be executed after all the Maps are calculated, if one of the Maps is very slow, an additional task will be opened to complete the same work, whichever is faster to execute.

 

In this way, we have a general understanding of the principles of Hadoop, mainly the storage process of the HDFS file system and the job scheduling and allocation process of MapReduce.

 

Long press to identify and follow us, we have exciting content to share every day! ~

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325853881&siteId=291194637