Can you tell the difference between HBase and Hive? (Zhu Lagou Education)

08 | Can you tell the difference between HBase and Hive?

2021/02/25 Arakawa

Article Directory


29.37M 00: 00/11: 15

Watch the video
In the last lecture, I walked you through the basic framework of HDFS and installed the Hadoop system by hand. We all know that HDFS is a system used to manage files in Hadoop, and it is one of the cores of Hadoop. In the actual production work, only a set of file management system can not support our business needs well. We also hope that the data can be operated more conveniently. In this talk, I will take you to understand that in daily work Two tools that are frequently used and are closely related to HDFS are Hive and HBase.

Hive

I have made some brief introductions in "05 | Essential Tools for Big Data Development-Hadoop". In the Hadoop system, the part responsible for calculation is MapReduce, which means that we have to process the data stored in HDFS and perform various Statistical analysis and calculations need to develop a MapReduce program.

Although MapReduce has already encapsulated distributed computing well (this part we will talk about in <11 | What are the basic ideas of MapReduce processing big data>), but the use of its API for development is still for many people It's a very difficult thing. For example, many data product managers or operators just want to count the numbers, but they have to learn how to write a set of code development programs. This difficulty can be imagined, so Hive came into being. Hive will parse SQL statements and convert them into MapReduce programs to run. Just learning SQL statements is much simpler for a product manager.

Now we will introduce Hive. Simply put, Hive is a data warehouse. The data in the warehouse is all data files managed in HDFS. At the same time, Hive supports functions similar to SQL statements. You can use these statements to analyze data, and Hive will convert these statements into Executable MapReduce code, and then perform calculations. The calculation here is limited to search and analysis.

Hive processes the data that has been stored, and this kind of calculation is what we call offline calculation (different from real-time calculation). Through the operation of Hive, you can feel that you are using MySQL when manipulating data, thereby reducing your learning costs.

Next, let's take a look at the basic architecture of Hive. As shown below:

Insert picture description here

Hive's architecture

The above picture is from the early Hive architecture document and is divided into two parts. The left side is the main body of Hive, the right side is the Hadoop system, the upper right is MapReduce, and the lower right is HDFS. There are several lines connecting in the middle, explaining Hive and Hadoop. The relationship between the two cores.

(1)UI

The user interface, we can also think of it as a client, which is mainly responsible for the interaction with the user. We submit queries and other operations to the system through the UI. Of course, ThriftServer is also encapsulated in Hive. We can use Java, Python, or C++ to access the Server during development to call Hive.

(2) Driver

After the driver receives the HiveQL statement, it creates a session to start the execution of the statement, and monitors the life cycle and progress of the execution. As you can see in the figure, the driver is not only responsible for the interaction with the compiler, but also for the interaction with the execution engine.

(3) Compiler

The compiler receives the HiveQL from the driver, and obtains the required metadata from the metadata warehouse, and then compiles the HiveQL statement, converts it into an executable plan, and splits it into MapReduce and HDFS according to different execution steps. The operation of each stage is sent to the drive.

(4) Execution Engine

After compilation and optimization, the executor will execute the task. It tracks and interacts with Hadoop jobs, and schedules tasks that need to be run.

(5) Metastore

Metadata refers to the table name, table fields, table structure, partition, type, storage path, etc. of the Hive table we build. Metadata is usually stored in a traditional relational database, such as MySQL.

Pros and cons of Hive

There are many advantages of Hive, I will mainly explain to you from the following aspects.

(1) Simple and easy to use

You only need to understand the SQL language to use Hive, which reduces the difficulty of using MapReduce for data analysis. Many Internet companies use Hive for log analysis, such as Taobao, Meituan, etc., and use Hive to count the PV, UV and other information of the website. Simple and convenient.

(2) Hive provides unified metadata management

Through metadata management, the description information can be formatted, so that the data can be shared with SQL query engines such as Presto, Impala, and SparkSQL.

(3) Good scalability

Like other components of Hadoop, Hive also has good scalability. You only need to add machines to deploy a distributed Hive cluster.

(4) Support custom function (UDF)

Although SQL has many functions, it is enough to support our usual statistical schemes, but for some personalized customized schemes, it is obviously troublesome to use SQL. Hive supports the use of custom functions to add functions written by yourself, which facilitates development personnel.

Of course, Hive also has disadvantages.

(1) The speed is slower

Since the underlying data of Hive is still stored on HDFS, the speed is relatively slow and it is only suitable for offline query. When writing programs, Hive is generally used to load data at one time, which is not suitable for repeated access in the code.

(2) Does not support single data operation

This is still related to HDFS storage. We can't arbitrarily modify the data in HDFS, so neither can Hive. If you want to modify the data, you can only replace the entire file.

HBase

Unlike Hive, HBase is a NoSQL database that is widely used in the Hadoop big data system. HBase is derived from a paper published by Google: Bigtable. HBase also uses HDFS as the underlying storage, but instead of simply using the original data, it just uses HDFS as its storage system. In other words, HBase just uses Hadoop's HDFS to help it manage persistent files of data. HBase provides the ability to process data in real time, which makes up for the lack of early Hadoop that can only process data offline.

HBase table structure

The following figure shows the table structure of HBase:
Insert picture description here

(1) Row Key

This is the unique identification of our row of data. For example, our usual data will have a unique ID, which can be used as a Row Key. However, it should be noted that HBase stores the Row Key in lexicographical order, so if your Row Key does not start with evenly distributed numbers or letters, it is likely that the storage will be concentrated on a certain machine. Will greatly reduce the query efficiency, so in this case, you need to design the stored Row Key, such as adding a HASH value in front of each ID to improve query performance.

(2)列簇(Column Family)

It can be regarded as a set of columns. In fact, a column cluster is also used to manage several columns and optimize query speed. Therefore, the name of the column cluster should be as short as possible, and the columns that are frequently queried together should be placed under a column cluster. For example, for user information, a user's static attributes (name, age, phone, address, etc.) can be placed under a column cluster, and dynamic attributes (likes, favorites, comments, etc.) can be placed under a column cluster. The column clusters in the HBase table need to be defined in advance, but the columns are not required. If you want to add a column cluster, you must first disable the table.

(3) Cell

Refers to a certain storage unit. Determined by Row Key, column cluster, column name and version number.

(4) Timestamp

Used to mark different versions of the aforementioned piece of data.

(5) Region

A Region can be regarded as a collection of multiple rows of data. When the data of a table gradually increases and needs to be stored in a distributed manner, the table will be automatically split into multiple Regions and then allocated to different RegionServers.

Pros and cons of HBase

The advantage of HBase lies in real-time calculation. All real-time data is directly stored in HBase, and the client directly accesses HBase through API to realize real-time calculation. Because it uses NoSQL, or columnar structure, it improves search performance and enables it to be used in big data scenarios. This is the difference between it and MapReduce.

In addition, it has other advantages.

Large capacity and high performance. An HBase table can support storage of tens of billions of rows and thousands of columns, and query efficiency will not change significantly. At the same time, HBase can also support highly concurrent read and write operations.

Column storage, no need to set the table structure. For traditional databases, such as MySQL, store by row, retrieval mainly relies on pre-established indexes. When the amount of data is large, adding columns or updating indexes is very slow, while each column in HBase is stored separately. The unit in each row and column is independent storage, that is, the data itself is the index. Because of this, columns can be added dynamically when data is written, instead of setting when the table is created. In specific storage, each row may have different columns.

HBase cannot support conditional queries, nor can it be queried with SQL statements. When in use, generally you can only use Row Key to query.

Use of HBase and Hive

In actual work, HBase and Hive are actually in different positions in our big data architecture, and they are usually used together.

Since HBase supports real-time random queries, HBase is mainly used to perform random real-time queries of a large amount of detailed data. For example, user information with the user ID as the key, product information with the Itemid as the key, various transaction details, and so on. After the data is collected, the real-time stream is parsed and then stored in HBase for query. When querying HBase, the entire data is generally queried.

As we have already introduced, Hive itself does not solve the storage problem, it just displays the structured data in HDFS, and the core function is to realize the query of these structured files. In daily work, Hive partition tables are usually used to record data within a period of time, and perform offline batch data calculations, such as statistical analysis of various data indicators.

Since both are important tools of the Hadoop system, Hive and HBase also provide some access mechanisms. Sometimes we want to be able to query HBase data in Hive. We can create an external table on Hive that points to the corresponding Hbase table by associating the external table.

to sum up

In this lecture, I briefly introduced two widely used tools in the Hadoop system: Hive and HBase. On the surface, they all have a strong relationship with HDFS storage, but there are also many differences. The functions they achieve and the problems they solve are also very different. In this lecture, I mainly talked about their basic structure, advantages and disadvantages and application situation, without involving specific use.

In order to better allow you to explore some usage after class, here is a small assignment: In Hive, we usually use partition tables to store data by time period, then create a Hive partitioned by day How to write a creation statement for the table? I hope you can think about it seriously, and if you have any questions or experiences in the exploration, please feel free to communicate with me in the comment area.

In the next lecture, let us discuss why big companies are doing cloud services. See you then.

00:00 The Basics of Big Data Technology 22 Lectures

Featured Message
Write Message
**Lin

I want to ask about the installation of big data components. Hadoop needs to be installed on each node to realize distributed computing and storage. Do hbase and hive also need to be installed on each node to realize distributed storage and computing? What about flume, sqoop, kafka and other components? I saw that some components in the instructional video are only installed on one node, and some components have to be copied for each node, so I have been puzzled by the
lecturer's reply: Not every component needs to be installed on every machine 1. hadoop The hdfs in HDFS is used for storage, and many machines can be used to store massive data. Therefore, if the mapreduce in Hadoop is installed on multiple nodes for calculation, many machines can be used together to efficiently and high-speed calculation of massive data. , So install on multiple nodes 3. hbase is a distributed massive columnar non-relational database based on hdfs in hadoop, which requires multiple nodes to install 4. hive, install on one node, it can be understood that hive is a customer The client tool translates the SQL-like statements we write into MapReduce tasks and submits them to the MapReduce cluster to run. Therefore, since it is such a client, it does not need to be installed on each node, just install one. 5. Flume collects log data. Generally, if your log data is distributed on multiple machines, install a flume on each machine to collect the logs on this machine. If it is only distributed on one machine, install one on this machine. 6. Sqoop does data migration, such as the migration between mysql and hdfs. It can connect to a certain mysql and hdfs platform, and write multiple scripts based on one sqoop software to execute multiple times, so there is no need to make one for each node. 7 , Kafka cluster mode requires multiple nodes to install, it can be understood that it is for distributed storage, relying on multiple machines to store message data, so the cluster mode is a multi-node installation

Guess you like

Origin blog.csdn.net/xianyu120/article/details/114848524