Introduction to Hadoop Distributed System of Hadoop Series

        With the rapid development of the era of intelligence and the Internet of Everything, the amount of data has begun to increase sharply. On the one hand, we need to start thinking about how to store massive amounts of data efficiently and reliably. On the other hand, we also need to analyze and process these data to obtain more Valuable information. At this time, we need to use Hadoop.

        Hadoop is the next open source distributed computing platform of the Apache Software Foundation. It uses hdfs (Hadoop Distributed File System) and MapReduce (Hadoop 2.0 to join YARN. Yarn is a resource scheduling framework that can manage and schedule tasks in a fine-grained manner, and can also support Other computing frameworks, such as spark) at the core of Hadoop provide users with a distributed infrastructure that is transparent to the underlying details of the system. The high fault tolerance, high scalability, and high efficiency of hdfs allow users to deploy Hadoop on low-cost hardware to form a distributed system.

Hadoop Ecosystem

        In addition to the basic Hadoop, Hadoop has developed a very complete and huge open source ecosystem today: HDFS provides file storage, YARN provides resource management, and on this basis, various processing, including mapreduce, Tez, Sprak, Storm, etc., To meet data usage scenarios with different requirements.

HDFS Architecture

HDFS Architecture Diagram

        HDFS adopts a master-slave structure model. An HDFS cluster consists of a NameNode and several DataNodes. The NameNode serves as the master server to manage the namespace of the file system and the client's access to files, while the DataNode is responsible for managing the stored data. The underlying data of HDFS is cut into multiple blocks, and these blocks are copied and stored on different DataNodes to achieve the purpose of fault tolerance and disaster tolerance.

MapReduce

        MapReduce is Google's core computing model, which highly abstracts the complex parallel computing process running on large-scale clusters into two functional processes: Map and Reduce ("Map (map)" and "Reduce (reduce)") . The map function takes key/value pairs as input, and generates another series of key/value pairs as intermediate output to write to the local disk. The MapReduce framework will automatically aggregate these intermediate data according to the key value, and the data with the same key value will be uniformly handed over to the reduce function for processing. The reduce function takes the key and the corresponding value list as input, and after merging the value with the same key, another series of key/value pairs are generated and written to HDFS as the final output.

Difference between Hive and Hbase

        In the basic ecology of Hadoop, there are two components to talk about their differences, they are hive and hbase. Hive is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for operation. HBase is a Hadoop database, a distributed, scalable, big data store.

        Hive itself does not store and calculate data, it completely relies on HDFS and MapReduce, the table in Hive is pure logic. hive needs to use hdfs to store files, and needs to use the MapReduce computing framework. hive can be thought of as a wrapper around map-reduce. The meaning of hive is to convert the well-written hive sql into a complex and difficult to write map-reduce program. HBase is a physical table, not a logical table. It provides a large in-memory hash table, which is used by search engines to store indexes and facilitate query operations. hbase can be thought of as a wrapper around hdfs. Its essence is data storage, which is a NoSql (not only sql) database; hbase is deployed on hdfs and overcomes the shortcomings of hdfs in random read and write.

Guess you like

Origin blog.csdn.net/qq_21137441/article/details/121638141