An article lets you understand the difference between Hive and HBase

I believe that friends who do big data development must be familiar with hive and HBase.

An article lets you understand the difference between Hive and HBase

HBASE

If you want to know more about big data, you can add V to get the information for free: baizhan112

Hive is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provides a simple SQL query function, which can convert SQL statements into MapReduce tasks for execution. HBase is Hadoop's database, a distributed, scalable, big data storage. Individually, it may be difficult to see the difference between the two literally. Don't worry, let's give a detailed introduction to the two below.

characteristics of both

Hive helps people familiar with SQL to run MapReduce jobs. Because it is JDBC compatible, at the same time, it can also be integrated with existing SQL tools. Running a Hive query can take a long time because it iterates through all the data in the table by default. Although there are such shortcomings, the amount of data traversed at one time can be controlled by Hive's partition mechanism. Partitions allow filtering queries to be run on datasets stored in different folders, and only the data in the specified folders (partitions) are traversed when querying. This mechanism can be used, for example, to only process files within a certain time range, as long as those file names include a time format.

HBase works by storing key/value. It supports four main operations: adding or updating rows, viewing a range of cells, getting a specified row, and deleting a specified row, column, or column version. The version information is used to obtain historical data (historical data of each row can be deleted, and then the space can be released through Hbase compactions). Although HBase includes tables, schemas are only required for tables and column families, not columns. Hbase's tables include increment/count functionality.

limit

Hive currently does not support update operations. In addition, since hive runs batch operations on hadoop, it takes a long time, usually minutes to hours, to get the results of the query. Hive must provide a predefined schema to map files and directories to columns, and Hive is not ACID compliant.

HBase queries are written in a specific language that needs to be relearned. SQL-like functionality can be achieved through Apache Phonenix, but at the cost of having to provide a schema. In addition, Hbase is not compatible with all ACID features, although it supports some features. Last but not least – in order to run Hbase, Zookeeper is required, zookeeper is a service for distributed coordination, these services include configuration service, maintenance meta information and namespace service.

Application Scenario

Hive is suitable for analyzing and querying data over a period of time, for example, to calculate trends or logs of websites. Hive should not be used for real-time queries. Because it takes a long time to return the result.

Hbase is very suitable for real-time query of big data. Facebook uses Hbase for news and real-time analysis. It can also be used to count Facebook connections.

Summarize

Hive and Hbase are two different technologies based on Hadoop - Hive is a SQL-like engine and runs MapReduce tasks, Hbase is a NoSQL key/vale database on top of Hadoop. Of course, these two tools can be used at the same time. Just like using Google to search and Facebook to socialize, Hive can be used for statistical query, HBase can be used for real-time query, data can also be written from Hive to Hbase, set and then written back to Hive from Hbase.

Guess you like

Origin blog.csdn.net/wshyb0314/article/details/81475475