Big Data Hadoop Tutorial: Hadoop core architecture detailed analysis

By this stage of the investigation summary, a detailed analysis from the perspective of the internal mechanism of how HDFS, MapReduce, Hbase, Hive is run, and build internal database and Hadoop distributed data warehouse based on the specific implementation. If lack of follow-up be revised.

More exciting content , please point me to learn

HDFS architecture

Entire Hadoop architecture is mainly to achieve the underlying support for distributed storage through HDFS, and to achieve the task of distributed parallel processing program support by MR.

HDFS master-slave (Master / Slave) structural model, a HDFS cluster consists of a number of DataNode NameNode and composed (in the configuration of Hadoop2.2 latest version has achieved more NameNode - which is why some large companies by modifying the source hadoop code to achieve functionality in the latest version has been achieved). NameNode as the primary server manages the file system namespace operations and client access to files. DataNode management data storage. HDFS supports the file format of the data.

Internally, the file is divided into several data blocks, which store a plurality of data blocks in a set of DataNode. NameNode namespace of the file system, such as open, close, rename a file or directory, the data block is also responsible for mapping the specific DataNode. DataNode responsible for handling the file read-write file system clients, and create a database in a unified NameNode, deleting and copying work. NameNode is the manager of all HDFS metadata, user data is never through NameNode.

FIG: HDFS Architecture FIG.

Figure involve three roles: NameNode, DataNode, Client. NameNode managers, DataNode is a file stored by, Client is the need to obtain an application distributed file system.

More exciting content , please point me to learn

File write:

1) Client initiated the request file is written to the NameNode.

2) NameNode configuration files based on file size and block, it is returned to the Client management information of DataNode.

3) Client block into a plurality of files, according to the DataNode address, in order to block written DataNode block.

File reads:

1) Client initiates a request to read the file NameNode.

2) NameNode return DataNode information stored files.

3) Client reads the file information.

HDFS as a distributed file system data management can learn points:

Placing the file blocks: Block will have a three backup, on a specified NameNode DateNode, on a specified DataNode DataNode not on the same machine, one that is specified in the same Rack DataNode the DataNode . The purpose is to backup data security, in this way in order to take into account the circumstances of the same problem Rack failure, and bring different copies of data performance.

MapReduce architecture

MR is a single frame running on the master node and JobTracker run from each cluster node on TaskTracker composed. The master node is responsible for scheduling a job constitutes all tasks, these tasks are distributed over different different from the node. The master node monitor their implementation, and re-execute the task failed before. From the node is responsible for tasks assigned by the master node only. After When a Job is submitted, JobTracker submit jobs and receive configuration information will be sent from the node configuration information decile, while scheduling tasks and monitoring of the implementation of TaskTracker. JobTracker can run on any computer in the cluster. TaskTracker responsible for executing the task, it must be run on DataNode, DataNode both a data storage node, also compute nodes. JobTracker the map task and reduce task distributed free TaskTracker, these tasks run in parallel, and run the task to monitor the situation. If JobTracker broke down, JobTracker task will be transferred to another free TaskTracker rerun.

HDFS and Hadoop MR together constitute the core distributed architecture of the system. HDFS cluster on a distributed file system, MR implements distributed computing and processing tasks on the cluster. HDFS provides the MR task processing and storage support file operations, MR achieved the distribution task on the basis of HDFS, tracking, work execution, and collect the results, interaction between the two, to complete the main task of a distributed cluster .

Parallel applications is the development of MR Hadoop programming framework. MR model programming principle: the use of a key-value input to generate a set of key-value pairs of the output set. MR library to implement this framework and Map Reduce by two functions. user-defined map function takes an input of the key-value pairs, and then generates a set of intermediate key-value pairs. MR combines all have the same value of key values, and reduce a transfer function. Reduce function key and accept the associated binding value, the reduce the combined value function values, forming a smaller set value. Usually we put the middle of the value provided to reduce the value of the function (iterator role is to collect these value value) by an iterator, so you can not handle a lot of value on the entire value of the in-memory collection.

More exciting content , please point me to learn

The third explanation :( Pictured companion pieces painted by)

Briefly process, large data set is divided into many small blocks of data sets, a plurality of data sets are grouped in a cluster node processing and produce intermediate results. After task on a single node, map data is read line by line to obtain the function data (K1, V1), the data into the cache, when the map by the map function (output frame will map sorted) ordering (key-value-based) performed input (k2, v2). Each machine will do the same. By sorting merge process (process shuffle may be appreciated that a process to reduce the front) on different machines (k2, v2), reduce combined to give the final, (k3, v3), file output to HDFS.

Speaking reduce, prior to reduce, can first intermediate data of data consolidation (Combine), is about the middle of the same key of the merger. Combine process and reduce the process is similar, but as part Combine map task, is then performed only after performing map function. Combine the intermediate results to reduce the number of key-value pairs, thereby reducing network traffic.

Map task of intermediate results in the finished Combine and Partition, in the form of files stored on the local disk. Location intermediate results will inform the master file JobTracker, JobTracker then notify reduce the task to which DataNode up to get intermediate results. Intermediate results of all the map tasks are by their key values ​​generated by a hash function is divided into parts of R, R are each responsible for some of the reduce task key interval. Each reduce the need to take a lot of map tasks nodes fall within the intermediate result which is responsible for the key section, and then reduce the function performed, and finally to form a final result. With R reduce tasks, the ultimate result will have the R, R a final result which in many cases do not need to be merged into a final result as a final result R which can be used as input to another computation task to start another parallel computing tasks. This forms a plurality of the above FIG output data segments (HDFS copy).

Hbase Data Management

Hbase is Hadoop database. With traditional mysql, oracle whether there is any difference. That columnar data and line data by the difference. NoSql with traditional relational database data by the difference:

Hbase VS Oracle

1, Hbase inserted into the case while there are a large number of suitable reading. Enter a Key Gets a value or enter some key to get some value.

2, Hbase bottleneck is the hard disk transfer rate. Hbase operation, the data to which it can insert, update some of the data may be, but is actually update the insert, but insert a new row time stamp. Delete data, also insert, delete just insert a line with row markers. All operations are Hbase additional insert operation. Hbase is a set of database logs. Its storage, such as log files. It is a lot of bulk to the hard disk write, usually in the form of a file to read and write. The read and write speed, depends on how fast the transmission between the hard disk and the machine. The Oracle bottleneck is the hard disk seek time. It is often the random access operation. To update a data block must first find the hard disk, and then put it into memory, modify the in-memory cache, over time come back to write back. Since you're looking for a different block, which there is a random read. Hard disk seek time is mainly determined by the speed. The seek time, the basic technology has not changed, which formed a seek time bottleneck.

3, Hbase data can be stored in many different versions timestamp (ie, the same data can be copied many different versions, Pets data redundancy, is also an asset). Chronological data, and therefore especially suitable Looking Looking Hbase Top n chronological order scenes. Find someone a message recently viewed, recently wrote a blog post N, N kinds of behavior, etc., so Hbase in Internet applications very much.

4, Hbase limitations. You can only do simple Key-value queries. It is suitable for high-speed insertion, while there are a large number of read operations scenarios. And this very extreme scenario, not every company has such a demand. In some companies, it is an ordinary OLTP (online transaction processing) random reads and writes. In this case, Oracle's reliability, the system

Published 41 original articles · won praise 28 · views 50000 +

Guess you like

Origin blog.csdn.net/HAOXUAN168/article/details/104110320