Technology hive in the era of big data: Introduction to hive

I recently researched the related technologies of hive, and I have some experience, which I will share with you here.

  First of all, we need to know what hive does. The following paragraphs describe the characteristics of hive very well:

  1. Hive is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provide complete SQL query functions, which can The sql statement is converted into a MapReduce task to run. Its advantages are that the learning cost is low, and simple MapReduce statistics can be quickly implemented through SQL-like statements, without the need to develop special MapReduce applications, which is very suitable for statistical analysis of data warehouses.

  2.Hive is a data warehouse infrastructure built on Hadoop. It provides a set of tools for extract-transform-load (ETL), a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language, called HQL, that allows SQL-savvy users to query data. At the same time, the language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot do.

  To understand hive, you must first understand hadoop and mapreduce. If you are unfamiliar with children's shoes, you can Baidu.

  Using the command line interface of hive feels a lot like operating a relational database, but there is still a big difference between hive and relational databases. Let me compare the differences between hive and relational databases, as follows:

hive and relational databases have different systems for storing files , hive uses Hadoop's HDFS (hadoop's distributed file system), and the relational database is the server's local file system;
the computing model used by hive is mapreduce, and the relational database is a computing model designed by itself;
Relational databases are designed for real-time query business, while hive is designed for data mining of massive data, and the real-time performance is very poor; the difference in real-time performance leads to the fact that the application scenarios of hive are very different from relational databases;
Hive It is easy to expand your own storage and computing capabilities. This is inherited from hadoop, and relational databases are much worse than databases in this regard.
  The above is a comparison of the differences between hive and relational databases from a macro perspective. There are many similarities and differences between hive and relational databases. I will describe them one by one later in the article.

  Let me talk about the technical architecture of hive. Let's first look at the following architecture diagram:

Technology hive in the era of big data: Introduction to hive As can be seen

  from the above diagram, hadoop and mapreduce are the foundation of the hive architecture. The Hive architecture includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver (Complier, Optimizer and Executor). These components can be divided into two categories: server components and clients components.

  Let's talk about the server components first:

  Driver component: This component includes Compiler, Optimizer and Executor. Its function is to parse, compile and optimize the HiveQL (SQL-like) statements we wrote, generate an execution plan, and then call the underlying mapreduce computing framework .

  Metastore component: metadata service component, this component stores the metadata of hive, and the metadata of hive is stored in the relational database. The relational databases supported by hive are derby and mysql. Metadata is very important to hive, so hive supports to separate the metastore service and install it in a remote server cluster, so as to decouple the hive service and the metastore service and ensure the robustness of hive operation. I will learn about this later. A detailed explanation is given in the metastore section.

  Thrift service: thrift is a software framework developed by facebook, which is used for the development of extensible and cross-language services. hive integrates the service and enables different programming languages ​​to call the hive interface.

  Client components:

  CLI: command line interface, command line interface.

  Thrift client: Thrift client is not written in the above architecture diagram, but many client interfaces of hive architecture are built on thrift client, including JDBC and ODBC interfaces.

  WEBGUI: The hive client provides a way to access the services provided by hive through web pages. This interface corresponds to the hwi component (hive web interface) of hive, and the hwi service must be started before use.

  Now I will focus on the metastore components, as follows:

  The metastore component of Hive is a centralized storage place for hive metadata. The Metastore component consists of two parts: the metastore service and the storage of the background data. The medium of background data storage is relational database, such as hive's default embedded disk database derby, and mysql database. The metastore service is a service component built on the background data storage medium and can interact with the hive service. By default, the metastore service and the hive service are installed together and run in the same process. I can also separate the metastore service from the hive service, install the metastore independently in a cluster, and call the metastore service remotely from hive, so that we can put the metadata layer behind the firewall, and the client can connect to the hive service by accessing the hive service. to the metadata layer, which provides better management and security. Using the remote metastore service allows the metastore service and the hive service to run in different processes, which also ensures the stability of hive and improves the efficiency of the hive service.

  The execution process of Hive is shown in the following figure:

Technology hive in the era of big data: Hive introduction

The picture is very clear, so I will not repeat it here.

Below I will show you a simple example to see how hive operates.

First, we create an ordinary text file with only one line of data, and this line only stores a string. The command is as follows:

echo
'sharpxiajun' > /home/hadoop/test.txt
Then we create a hive table:

hive
– e "create table test (value string
);
Next load the data:

Load
data local inpath 'home/hadoop/test.txt' overwrite into 
table test
Finally, we query the following table:

hive
-e '
select 
* from 
test'; As
  you can see, hive is very simple, very easy to get started, and very easy to operate and sql Like, I will analyze the difference between hive and relational database in depth below. Some people may not understand this part very well, but it is necessary to put forward it in advance. I will further describe hive in my article in the future. After reading this part, many problems will be clear, as follows:

In relational databases, the loading mode of a table is determined by force during data loading (the loading mode of a table refers to the file format in which data is stored in the database). When it is found that the loaded data does not conform to the schema, the relational database will refuse to load the data. This is called the "write-time mode". The write-time mode will check and verify the data schema when the data is loaded. Hive is different from relational databases when loading data. Hive does not check the data when loading data, nor does it change the loaded data file. The operation of checking the data format is performed during the query operation. This mode is called "" read mode". In practical applications, the write-time mode indexes the columns and compresses the data when loading data, so the speed of loading data is very slow, but when the data is loaded, when we query the data, the speed is very fast. However, when our data is unstructured and the storage mode is unknown, the scenario of relational data operation is much more troublesome, and hive will play its advantages at this time.
An important feature of a relational database is that it can update and delete data in a row or certain rows. Hive does not support operations on a specific row. Hive's operations on data only support overwriting original data and appending data. Hive also does not support transactions and indexes. Updates, transactions and indexes are all features of relational databases. These hives do not support and do not intend to support them. The reason is that hive is designed to process massive data, and the scanning of all data is normal, and the efficiency of operating on some specific data. It is very poor. For update operations, hive transforms the data of the original table through query and finally stores it in the new table, which is very different from the update operation of traditional databases.
Hive can also make its own contribution to hadoop's real-time query, that is, integrating with hbase, hbase can perform fast query, but hbase does not support SQL-like statements, then hive can provide hbase with a shell for sql parsing , you can use SQL-like statements to operate the hbase database.
  Today's hive is written here. I plan to write three articles about hive. This is the first one. The next one mainly talks about the data models supported by hive, such as: database (database), table (table), partition (partition) ) and bucket (bucket), as well as the format of hive file storage, and the data types supported by hive. The third article will talk about the use of hiveQL, combined with mapreduce query optimization techniques and custom functions, and examples of how we use hive in company projects.

  When Ma Yun retired, he said that the Internet has now entered the era of big data, big data is the trend of the Internet, and hadoop is the core technology in the era of big data, but hadoop and mapreduce operations are too professional, so facebook is based on these Developed the hive framework. After all, there are more people in the world who can understand sql than those who can understand java. Hive can be said to be a breakthrough in learning hadoop-related technologies. Those who are self-sufficient in the development of hadoop technology can start with hive. Oh.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326870700&siteId=291194637