Global sort large data suggested framework hadoop

Hadoop is actually a data-driven model and combines the HDFS and MapReduce, tasks running on the computing node data stored in full use of computing resources, computing and storage nodes, but also saves the overhead data transmission network .

Hadoop

1.Hellow Hadoop~~!

Hadoop (a fictitious name someone elephant son) is a complex to the extreme, and simple things to the extreme.

It complicated, because a hadoop cluster tend to have dozens or even hundreds or thousands of low cost computers, each running a task you have to do on those computers to distribute tasks to perform intermediate data sorting and final summary, during which also contains the node discovery, Retry, the task of replacing the failed node maintenance, etc., etc., and exception handling. Why did hadoop cluster by a number of civilians are often composed of a computer, all right what a workers strike, it is unusual, but things.

And say it simply, it is because, when it comes to those described above, all you do not control, you need to do is to write a program, of course, can be a script, a read from the standard input data, after processing, the result is output to the standard output.

Now, perhaps you will understand, hadoop is a computational model. A distributed computing model.

1.1Map和reduce

World affairs, long period of division, together for a long time to divide.

The so-called distributed computing, data for a lot of material is to cut a calculation, a plurality of thrown inside the pot, the boiling water is boiling water, the fried fried. Then we are almost ready, according to a certain order, such as poor cooked the first place, what good cooked after discharge, a lower Guochao out into a dish, side out and serve.

The previous steps, that is, map, distribution. Map of the role is to break up the input data, do simple processing and output. Hadoop will have the first intermediate data ordering, referred to as the shuffle, and then reduce the intermediate data to merge together. The output of the final result.

Here is a simple example: According to the Public Security Bureau to obtain the database ID number within each country and city population situation (well, this should be done Statistical Office), this task falls to your head, and you should all the ID number exported to a file, one per line, then these documents to the map. Do is intercept the ID number of the previous six in the Map, this six-digit output directly. Hadoop will then sort the first six of these ID numbers, the same data are routed together, to reduce, if the number of input and reduce each determination on a process identical, the accumulation, prior to put different the number and value of the output statistics. That way you get around the city's population statistics.

The following is shown in FIG map and reduce processing.

Big Data

The figure is a view of a data processing MapReduce. Divided map, shuffle, reduce three parts. Each task map scale read cut data and the divided data are processed as a series of key: value output, intermediate data outputted by the shuffle distribution program defined manner corresponding to reduce tasks. Shuffle program will sort the data transmitted to a reduce tasks defined manner. Reduce the final data processing.

MapReduce computing framework for ultra large scale data (100TB scale) and low correlation between the respective case data.

1.2HDFS

Before, maybe you've heard NTFS, VFS, NFS, etc., etc., yes, HDFS is the Hadoop file system.

Why do we need a specialized file system?

This is because the combination hadoop used loose network (say it loose, because hadoop cluster of any fault of a computer or irrelevant, will not affect cluster) of together. Multiple computer needs a unified file access. That is according to a path different computers can locate the same file.

HDFS is a distributed file system that provides better fault tolerance and scalability.

1.3 Node and slot

Hadoop cluster is composed of many low cost of a computer, into these computers are called nodes. Hadoop computer composition are usually fully functional, is not particularly dedicated section for calculating and storage.

Such benefits are obvious, especially since a large hard drive and particularly fast cpu, always mean unacceptable price. And such a configuration "special" node computer hang, find his substitute would be a very difficult thing.

Another benefit of a unified compute nodes and storage nodes, the task file is generated in the calculation process, it can be placed directly on the storage node of the unit, to reduce the network bandwidth and latency.

When the measure hadoop map and reduce the processing power generally are based on units of slots. Slot is the sum of the number of concurrent cpu each computer in the cluster (number of cpu cores * * hyperthreads) of. Each task will be arranged to allow in a slot, the slot arrangement task will not wait.

2.Hadoop Application examples: sorting large-scale data

Hadoop platform does not provide global data sorting, and global ordering of data in large-scale data processing is very common needs. A large number of large-scale data task cut into small data-scale data processing tasks must be sorted first large-scale global data. Groups such as processing large data sets of attributes are combined, the two can be globally ordered data is then broken down into a series of small problems Merging achieved.

2.1 hadoop application for large-scale data in the global ranking

Hadoop use a lot of data Sort most intuitive approach is to document everything after the given map, map without any processing, direct output to a reduce, to use their shuffle mechanism hadoop, and sort all the data, then reduce the direct output.

However, such an approach is no difference with the single, completely unable to use the convenient multi-machine distributed computing. Therefore, this method is not acceptable.

And rule calculation model, can refer to quickly sort of an idea of ​​using hadoop points. Here we briefly recall the quick sort. The basic step is the need to quickly sort all the data now selected as a fulcrum. Then set aside larger than the fulcrum, the fulcrum is less than this on the other side.

If we have N contemplated that fulcrum (herein may be referred to scale), all the data can be divided into one part N + 1, N + 1 th part these threw reduce, by the hadoop automatic sorting, the final output of the N + 1 internal sorted file, then this file N + 1 connected end to end merged into a file, call it a day.

Thus, we can summarize this step a large amount of data hadoop sort by:

1) Sort treated sample the data;

2) sort of sample data, generating the scale;

3) Map data is calculated for each of its inputs which is between the two scales; send data to reduce the corresponding section of the ID

4) Reduce the obtained data is directly output.

As used herein, a set of sorting by way of example url:

Big Data

There is also a little problem to be addressed: how data is sent to a specified ID reduce hadoop offers a variety of partitioning algorithm?. These algorithms to determine which data should reduce this issue (sort also reduce dependence key) based on key data map output. Thus, if the data needs to be sent to a reduce, at the same time as long as the output data, a key (in the example above is reduce the ID + url), the data in relation to where to go where to go.

2.2 Considerations

1) scale extraction should be as uniform as possible, which is fast sorting algorithm many variants are chosen to emphasize the fulcrum is the same.

2) HDFS is a read-write file system performance is asymmetric. Reading should use its strong performance characteristics as possible. Reduce reliance on written documents and shuffle operations. For example, when you need to decide when the processing of data according to statistics data. The statistics and data processing is divided into two map-reduce will be faster than the statistics and data processing are merged into a reduce in.

3. Summary

Hadoop is actually a data-driven model and combines the HDFS and MapReduce, tasks running on the computing node data stored in full use of computing resources, computing and storage nodes, but also saves the overhead data transmission network .

Hadoop provides a platform for easy use of cluster parallel computing. Can isolate the various models of computing correlation between the data sets can be well applied in Hadoop.

Recommended Reading

40 + annual salary of big data development [W] tutorial, all here!

Big Data technologies inventory

Training programmers to share large data arrays explain in Shell

Big Data Tutorial: SparkShell in writing Spark and IDEA program

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92206181