Talking about the differences, connections and applicable scenarios between Hive and HBase

In the process of learning big data analysis, Hive and HBase are two very important contents, which are easy to be confused for beginners. Therefore, comparing the connections and differences between the two can help us have a clear understanding and positioning of these two components. So, what are the differences and connections between Hive and HBase and what scenarios are they applicable to?
First, let’s start with the concept of both. Hive is a tool that runs on Hadoop, a search tool to be precise. When searching large amounts of data, Hadoop's computing engine is MapReduce. But the operation and programming of MapReduce is very complicated. So the existence of Hive simplifies the complex programming process into the operation of massive data in SQL language. This greatly reduces the workload of the programmer. It can be said that the existence of Hive makes the addition, deletion, modification and inspection of massive data more convenient. In fact, it can be seen from Hive's logo that Hive turns elephants into little bees, and simplifying complexity is its most essential highlight.
Let's talk about HBase, which is a sub-project of Hadoop, and can of course be understood as a tool. Hadoop's data operations are done by MapReduce, and data storage is done by HDFS. HDFS is distributed storage, which is the characteristic of Hadoop to store data, but the problem caused by this is the disorder and scatter of data. The generation of HBase solves this problem very well. HBase can map these data into a hash table, then, once the data becomes a table with actual storage significance, it will change from disorder to order, thus greatly improving the efficiency of data search and operation.

To sum up, both Hive and HBase are tools under the Hadoop cluster, Hive is the optimization of MapReduce, and HBase is the big housekeeper of HDFS data storage. So, which scenarios do each of these apply to?
1. The table in Hive is a purely logical table, which only defines the metadata of the table. Hive does not have the function of physical storage, it completely relies on HDFS and MapReduce. Mr. Chen from Shangxuetang pointed out that in this way, a structured data file can be mapped into a database table, and a complete SQL query function can be provided, and the SQL statement can be finally converted into a MapReduce task for operation. HBase tables are physical tables suitable for storing unstructured data.
2. Hive processes data on the basis of MapReduce, and MapReduce's data processing follows row mode; HBase is column mode, which makes random access to massive data feasible.
3. The storage density of HBase's storage table is small, so users can define rows into different columns; while Hive is a logical table, which is a dense type, that is, the number of columns is defined, and each row has fixed data for the number of columns.
4. Hive uses Hadoop to analyze and process data, and Hadoop system is a batch system, so there is a delay in data processing; and HBase is a quasi-real-time system, which can realize real-time data query.
5. Hive does not have row-level updates, it is suitable for batch processing of a large number of append-only datasets (such as logs). And HBase-based queries, support and row-level updates.
6. Hive fully supports SQL, which can generally be used for mining and analysis based on historical data. HBase is not suitable for application scenarios with joins, multi-level indexes, and complex table relationships.
The difference between the two usage scenarios:
The application scenario of HBase is usually the storage of collected web page data, because it is a key-value database, so it can be used in various key-value application scenarios, such as storing log information, and CMS-like applications that do not require fully structured content information. Wait. Note that hbase is still aimed at OLTP applications.
Hive is mainly aimed at OLAP applications, and its bottom layer is the hdfs distributed file system. The focus is on a unified query analysis layer that supports various association, grouping, and aggregation SQL statements in OLAP applications. hive is generally only used for query analysis and statistics, not for common CUD operations. You must know that HIVE needs to be synchronized from the existing database or log and finally entered into the hdfs file system. Currently, it is quite difficult to achieve incremental real-time synchronization. .
The above is a discussion of the differences, connections and applicable scenarios between Hive and HBase, and I hope it will be helpful to students studying big data analysis.
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324431190&siteId=291194637