Hive

Original link: http://www.cnblogs.com/sharpxiajun/archive/2013/06/02/3114180.html

First of all, we need to know what hive does. The following paragraphs describe the characteristics of hive very well:

  1. Hive is a data warehouse tool based on Hadoop, which can map a structured data file into a database table, and provides a complete SQL query function, which can convert SQL statements into MapReduce tasks for operation. The advantage is that the learning cost is low, and simple MapReduce statistics can be quickly implemented through SQL-like statements, without the need to develop special MapReduce applications, which is very suitable for statistical analysis of data warehouses.

  2.Hive is a data warehouse infrastructure built on Hadoop. It provides a set of tools for extract-transform-load (ETL), a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language, called HQL, that allows SQL-savvy users to query data. At the same time, the language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot do.

  To understand hive, you must first understand hadoop and mapreduce. If you are unfamiliar with children's shoes, you can Baidu.

  Using the command line interface of hive, it feels a lot like operating a relational database, but there is still a big difference between hive and relational databases. Let me compare the differences between hive and relational databases, as follows:

  1. Hive is different from the relational database storage system. Hive uses Hadoop's HDFS (hadoop's distributed file system), while the relational database is the server's local file system;
  2. The computing model used by hive is mapreduce, while the relational database is a computing model designed by itself;
  3. Relational databases are designed for real-time query business, while hive is designed for data mining of massive data, and the real-time performance is very poor; the difference in real-time performance leads to a big difference between the application scenarios of hive and relational databases;
  4. Hive can easily expand its storage and computing capabilities. This is inherited from hadoop, and relational databases are much worse than databases in this regard.

  The above is a comparison of the differences between hive and relational databases from a macro perspective. There are many similarities and differences between hive and relational databases. I will describe them one by one later in the article.

  Let me talk about the technical architecture of hive. Let's first look at the following architecture diagram:

 

  As can be seen from the above figure, hadoop and mapreduce are the foundation of the hive architecture. The Hive architecture includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver (Complier, Optimizer and Executor). These components can be divided into two categories: server components and clients components.

   Let's talk about the server components first:

  Driver component : This component includes Compiler, Optimizer, and Executor. Its function is to parse, compile and optimize the HiveQL (SQL-like) statements we write, generate execution plans, and then call the underlying mapreduce computing framework.

  Metastore component : metadata service component, this component stores the metadata of hive, and the metadata of hive is stored in the relational database. The relational databases supported by hive are derby and mysql. Metadata is very important to hive, so hive supports to separate the metastore service and install it in a remote server cluster, so as to decouple the hive service and the metastore service and ensure the robustness of hive operation. I will learn about this later. A detailed explanation is given in the metastore section.

  Thrift service : thrift is a software framework developed by facebook, which is used for the development of extensible and cross-language services. hive integrates the service and enables different programming languages ​​to call the hive interface.

  Client component:

  CLI : command line interface, command line interface.

  Thrift client : Thrift client is not written in the above architecture diagram, but many client interfaces of hive architecture are built on thrift client, including JDBC and ODBC interfaces.

  WEBGUI : The hive client provides a way to access the services provided by hive through web pages. This interface corresponds to the hwi component (hive web interface) of hive, and the hwi service must be started before use.

  Now I will focus on the metastore components, as follows:

  The metastore component of Hive is a centralized storage place for hive metadata. The Metastore component consists of two parts: the metastore service and the storage of the background data. The medium of background data storage is relational database, such as hive's default embedded disk database derby, and mysql database. The metastore service is a service component built on the background data storage medium and can interact with the hive service. By default, the metastore service and the hive service are installed together and run in the same process. I can also separate the metastore service from the hive service, install the metastore independently in a cluster, and call the metastore service remotely from hive, so that we can put the metadata layer behind the firewall, and the client can connect to the hive service by accessing the hive service. to the metadata layer, which provides better management and security. Using the remote metastore service allows the metastore service and the hive service to run in different processes, which also ensures the stability of hive and improves the efficiency of the hive service.

  The execution flow of Hive is shown in the following figure:

The picture description is very clear, I will not repeat it here.

Below I will show you a simple example to see how hive operates.

First, we create an ordinary text file with only one line of data, and this line only stores one string. The command is as follows:

echo  ‘sharpxiajun’ > /home/hadoop/test.txt

 Then we create a hive table:

hive –e “create table test (value string );

 Next load the data:

Load data local inpath ‘home/hadoop/test.txt’ overwrite into  table test

 Finally we query the table below:

hive –e ‘ select  * from  test’;

   As you can see, hive is very simple and easy to get started. The operation is very similar to sql. Next, I will analyze the difference between hive and relational database in depth. Some people may not understand this part very well, but it is necessary to propose in advance , I will further talk about hive in my article in the future. Children's shoes who didn't quite understand at that time will see this part, and many questions will be much clearer, as follows:

  1. In a relational database, the loading mode of a table is forcibly determined during data loading (the loading mode of a table refers to the file format in which the database stores data). If the loaded data does not conform to the schema when loading data, the relational database will refuse to load the data. , this is called "write-time mode", and the write-time mode will check and verify the data mode when the data is loaded. Hive is different from relational databases when loading data. Hive does not check the data when loading data, nor does it change the loaded data file. The operation of checking the data format is performed during the query operation. This mode is called "" read mode". In practical applications, the write-time mode indexes the columns and compresses the data when loading data, so the speed of loading data is very slow, but when the data is loaded, when we query the data, the speed is very fast. However, when our data is unstructured and the storage mode is unknown, the scenario of relational data operation is much more troublesome, and hive will play its advantages at this time.
  2. An important feature of a relational database is that it can update and delete data in a row or certain rows. Hive does not support operations on a specific row. Hive's operations on data only support overwriting original data and appending data. Hive also does not support transactions and indexes. Updates, transactions and indexes are all features of relational databases. These hives do not support and do not intend to support them. The reason is that hive is designed to process massive data, and the scanning of all data is normal, and the efficiency of operating on some specific data. It is very poor. For update operations, hive transforms the data of the original table through query and finally stores it in the new table, which is very different from the update operation of traditional databases.
  3. Hive can also make its own contribution to hadoop's real-time query, that is, integrating with hbase, hbase can perform fast query, but hbase does not support SQL-like statements, then hive can provide hbase with a shell for sql parsing , you can use SQL-like statements to operate the hbase database.

  Today's hive is written here. I plan to write three articles about hive. This is the first one. The next one mainly talks about the data models supported by hive, such as: database (database), table (table), partition (partition) ) and bucket (bucket), as well as the format of hive file storage, and the data types supported by hive. The third article will talk about the use of hiveQL, combined with mapreduce query optimization techniques and custom functions, and examples of how we use hive in company projects.

  When Ma Yun retired, he said that the Internet has now entered the era of big data, big data is the trend of the Internet, and hadoop is the core technology in the era of big data, but hadoop and mapreduce operations are too professional, so facebook is based on these Developed the hive framework. After all, there are more people in the world who can understand sql than those who can understand java. Hive can be said to be a breakthrough in learning hadoop-related technologies. Those who are self-sufficient in the development of hadoop technology can start with hive. Oh.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325260519&siteId=291194637