Detailed explanation of hadoop big data basic framework technology

1. What is big data

Since the beginning of this century, especially after 2010, with the development of the Internet, especially the mobile Internet, the growth of data has exploded. It is difficult to estimate how much data is stored in electronic devices around the world. The data describing the data system The unit of measurement of the amount is from MB (1MB is approximately equal to one million bytes), GB (1024MB), TB (1024GB), and has been rising. At present, PB (equivalent to 1024TB) level data systems are already very common. The amount of data, social networking sites, scientific computing, securities trading, website logs, and sensor network data continues to increase, and the total amount of domestic data has already exceeded the level of ZB (1ZB=1024EB, 1EB=1024PB).

The traditional data processing method is: as the amount of data increases, the hardware indicators are continuously updated, and measures such as more powerful CPUs and larger capacity disks are used, but the reality is: the speed of the increase in the amount of data far exceeds that of a single machine. The speed at which computing and storage power increases.

The processing method of "big data" is: using multiple machines and multiple nodes to process large amounts of data, and adopting this new processing method requires a new big data system to ensure that the system needs to handle communication between multiple nodes. Coordination, data separation and a series of issues.

In short, the use of multi-machine and multi-node methods to solve the communication coordination, data coordination and calculation coordination problems of each node, and the way to deal with massive data is the thinking of "big data". Its characteristic is that as the amount of data continues to increase, the number of machines can be increased and horizontal expansion, a big data system can be up to tens of thousands of machines or even more.

Two, hadoop overview

Hadoop is a software platform for developing and running large-scale data processing. It is an open-source software framework of Apache using Java language to realize distributed computing of massive data in a cluster composed of a large number of computers.

The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage of massive data, and MapReduce provides calculation of data.

In addition to the community's Apache hadoop, Hadoop distributions, cloudera, hortonworks, IBM , INTEL , Huawei , Dakuai Search, etc. all provide their own commercial versions. The commercial version mainly provides professional technical support, which is especially important for some large enterprises . DK.Hadoop is a deeply integrated, recompiled HADOOP distribution that can be released separately. A necessary component when deploying FreeRCH (big fast and big data integrated development framework) independently. DK.HADOOP integrates and integrates the NOSQL database, which simplifies the programming between the file system and the non-relational database; DK.HADOOP improves the cluster synchronization system, making the data processing of HADOOP more efficient.

3. Detailed explanation of hadoop development technology

1. Hadoop operating principle

Hadoop is an open source distributed parallel programming framework that can run on large-scale clusters. Its core designs include: MapReduce and HDFS. Based on Hadoop, you can easily write distributed parallel programs that can process massive data and run them on large-scale computer clusters consisting of hundreds or thousands of nodes.

It is relatively simple to write distributed parallel programs based on the MapReduce computing model. The main job of programmers is to design and implement Map and Reduce classes, and other complex problems in parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant processing, network Communication, etc., are handled by the MapReduce framework and the HDFS file system, and programmers don't have to worry about it at all. In other words, programmers only need to care about their own business logic, and do not need to care about the underlying communication mechanism and other issues, they can write complex and efficient parallel programs. If the difficulty of distributed parallel programming is enough to daunt ordinary programmers, the emergence of open source Hadoop has greatly lowered its threshold.

2. The principle of Mapreduce

The processing process of Map-Reduce mainly involves the following four parts:

• Client process: used to submit Map-reduce task jobs;

• JobTracker process: It is a Java process whose main class is JobTracker;

• TaskTracker process: it is a Java process whose main class is TaskTracker;

•HDFS: Hadoop distributed file system, used to share Job-related files among various processes;

The JobTracker process is used as the master to schedule and manage other TaskTracker processes. The JobTracker can run on any computer in the cluster. Usually, the JobTracker process is configured to run on the NameNode node. The TaskTracker is responsible for executing the tasks assigned by the JobTracker process, which must run on the DataNode, that is, the DataNode is both a data storage node and a computing node. JobTracker distributes Map tasks and Reduce tasks to idle TaskTrackers, lets these tasks run in parallel, and is responsible for monitoring the running status of tasks. If a TaskTracker fails, the JobTracker will transfer the task it is responsible for to another idle TaskTracker to run again.

3. The mechanism of HDFS storage

Hadoop's distributed file system HDFS is a virtual distributed file system built on the Linux file system. It consists of a management node (NameNode) and N data nodes (DataNode), each of which is an ordinary computer. In use, it is very similar to the file system on a single computer that we are familiar with. It can also create directories, create, copy, delete files, view file contents, etc. However, the underlying implementation is to cut the file into Blocks, and then these Blocks are stored in different DataNodes in a decentralized manner. Each Block can also be copied and stored in different DataNodes to achieve the purpose of fault tolerance and disaster tolerance. The NameNode is the core of the entire HDFS. By maintaining some data structures, it records how many blocks each file is cut into, which DataNodes these blocks can be obtained from, and the status of each DataNode.

HDFS data blocks

Each disk has a default data block size, which is the basic unit of reading and writing on the disk. A file system built on a single disk manages the blocks in the file system through disk blocks. The blocks in the file system are generally Integer multiples of disk blocks. Disk blocks are generally 512 bytes. HDFS also has the concept of blocks, the default is 64MB (the size of data processed by a map). Files on HDFS are also divided into multiple blocks of block size, and other Unlike file systems, files smaller than a block size in HDFS do not occupy the entire block space.

      Task granularity - data slices (Splits)

When dividing the original large data set into small data sets, the small data set is usually smaller than or equal to the size of a block in HDFS (64M by default), which can ensure that a small data set is located on one computer, which is convenient for local computing. When there are M small datasets to be processed, M Map tasks are started. Note that these M Map tasks are distributed on N computers to run in parallel, and the number R of Reduce tasks can be specified by the user.

The first obvious benefit of using block storage in HDFS is that the size of a file can be larger than the capacity of any disk in the network, and data blocks can be stored on any disk in the disk. The second simplifies the design of the system and will control the The unit is set to block, which simplifies storage management, and it is relatively easy to calculate how many blocks a single disk can store. It also eliminates concerns about metadata, such as permission information, which can be managed separately by other systems.

4. Give a simple example to illustrate the operation mechanism of MapReduce

Take the program that calculates the number of occurrences of each word in a text file as an example, <k1,v1> can be <the offset position of the line in the file, a line in the file>, after being mapped by the Map function, a batch of intermediate The result is <word, number of occurrences>, and the Reduce function can process the intermediate results, accumulate the number of occurrences of the same word, and get the total number of occurrences of each word.

5. The core process of MapReduce----Shuffle['ʃʌfl] and Sort

Shuffle is the heart of mapreduce. Understanding this process will help you write more efficient mapreduce programs and Hadoop tuning.

Shuffle refers to the process of starting from the output of the Map, including the system performing sorting and sending the Map output to the Reducer as input. As shown below:

First start the analysis from the Map side. When the Map starts to generate output, it does not simply write the data to the disk, because frequent operations will cause serious performance degradation, and his processing is more complicated. The data is first written to the memory. A buffer and do some pre-sorting to improve efficiency, as shown in the figure:

Each Map task has a "circular memory buffer" used to write "output data". The default size of this buffer is 100M (the specific size can be set through the io.sort.mb property). When the amount of data reaches a certain threshold (io.sort.mb * io.sort.spill.percent, where io.sort.spill.percent is 0.80 by default), the system will start a background thread to put the data in the buffer Content spills to disk. During the spill, the output of the Map will continue to be written to the buffer, but if the buffer is full, the Map will block until the spill is complete. Before the spill thread writes the data in the buffer to disk, it will perform a secondary sorting on it. First, it is sorted according to the partition to which the data belongs, and then each partition is sorted by key. The output includes an index file and data file, and if Combiner is set, it will be based on the sorted output. Combiner is a Mini Reducer. It runs on the node itself that executes the Map task. First, a simple Reduce is performed on the output of the Map, so that the output of the Map is more compact, and less data will be written to the disk and transmitted to the Reducer. Spill files are stored in the directory specified by mapred.local.dir and deleted after the Map task ends.

Whenever the data in memory reaches the spill threshold, a new spill file will be generated, so when the Map task finishes writing its last output record, there may be multiple spill files. Before the Map task is completed, All spill files will be merged and sorted into an index file and data file. As shown in Figure 3. This is a multi-way merge process, the maximum number of merge ways is controlled by io.sort.factor (default is 10). If Combiner is set and the number of spill files is at least 3 (controlled by the min.num.spills.for.combine property), the Combiner will run to compress the data before the output file is written to disk.

Da Kuai Big Data Platform (DKH) is a one-stop search engine-level, big data general computing platform designed by Da Kuai Company to open up the channel between the big data ecosystem and traditional non-big data companies. By using DKH, traditional companies can easily cross the technical gap of big data and realize the performance of big data platform at the level of search engines.

l DKH effectively integrates all the components of the entire HADOOP ecosystem, and is deeply optimized and recompiled into a complete higher-performance big data general computing platform, which realizes the organic coordination of various components. Therefore, compared with the open source big data platform, DKH has a performance improvement of up to 5 times (maximum) in computing performance.

l DKH, through the unique middleware technology of Dakuai, simplifies the complex big data cluster configuration to three nodes (master node, management node, computing node), which greatly simplifies the management, operation and maintenance of the cluster and enhances the High availability, high maintainability, and high stability of the cluster.

l DKH, although highly integrated, still maintains all the advantages of open source systems and is 100% compatible with open source systems. Big data applications developed based on open source platforms can run efficiently on DKH without any changes. And the performance will be improved by up to 5 times.

DKH standard platform technology framework

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325557517&siteId=291194637