A brief introduction to hive architecture and metastore functions

In the past two days, I have been investigating how to use Java to read the data in the hive table through hive metastore (hive2 cannot be used). It is best to directly support sql query. After various inquiries and attempts, it was finally determined that this road would not work. During this period, I studied the internal architecture of hive, which actually meant reading the official documents. Record it for fear of forgetting.

There are two main components in hive: hive hive server2and hive metastorehive. The former is responsible for providing DML services to the outside world, and the latter records the meta-information of the data and provides a basis for SQL to generate execution plans. (Why can't I check the data in the table from the metastore? It's because it doesn't exist at all).


hive architecture

Insert image description here

This picture is taken from the official website . It shows the main components and its interaction with hadoop (spark). The main components are as follows:

  • UI: The interface for users to submit queries or other operations to the system. After 2011, it supports two operation interfaces: command line and web GUI.
  • Driver: This component receives the user's operation request (query), supports requests from the JDBC/ODBC interface, creates a session, executes the request and returns the result.
  • Compiler: This component parses query statements with the assistance of metastore, performs syntax analysis on different query blocks and query statements, and generates execution plans.
  • Metastore: This component stores the structural information and partition information of all tables in the data warehouse, including column and column type information. The serializers and deserializers required when reading and writing data, as well as the location information of the data on HDFS. The metastore stores meta-information about the data rather than the data itself.
  • Execution Engine: This component is responsible for executing the execution plan generated by the compiler (a directed acyclic graph composed of multiple stages). It manages the dependencies between different stages in the execution plan and executes the stages in the appropriate system components.

The above figure also shows the execution process of a typical query operation. Submit the query request to the Driver through the interactive interface (step 1). Driver creates a session for the query statement and sends it to the compiler to generate an execution plan (step 2). The compiler obtains the necessary metainformation from the metastore (steps 3 and 4). The query plan generated by the compiler is a directed acyclic graph composed of multiple stages, and each stage is a mapreduce task. The Execution Engine manages and submits these stages to the correct components (steps 6, 6.1, 6.2 and 6.3). The stage drawn in the picture is submitted to Hadoop for execution, and it is an early version of Hadoop (currently not used). The stage can also be submitted to Spark for execution. When each task is executed, reading the data in the hive table actually uses the deserializer corresponding to the table to directly read the data from HDFS and operate on the data. The generated intermediate output is written in a temporary file in HDFS using a serializer for subsequent use. For DML operations, the final temporary file will be moved to the location of the table. This scheme ensures that dirty data will not be read. For query requests, the execution engine reads the final file directly from HDFS and returns the result to the Driver (steps 7, 8 and 9).


Hive data model

Data in HIve is organized into:

  • Table: The HIve table is similar to a relational database table. It has the functions of filtered, projected, joined and union. The data in the table is stored directly on HDFS. Support appearance.
  • Partitions: The data in the table can have one or more partitions. The partition determines how the data is stored on HDFS. Data with the same partition is stored in the same directory. Directory nesting occurs when multiple partitions exist. Usually time is used as the partition.
  • Buckets: This concept is translated into Chinese as blocks, buckets, and sometimes partitions, but the name "partition" is easily confused with the above concepts. The data in the same partition in the table will be further divided into blocks according to the hash value of a certain column, and each block will be treated as a separate file. The final appearance is that the data of the same partition in the table is stored in the same directory, and there are many files in this directory. The appearance of blocks improves the efficiency of queries.

Metastore

Original design intention

Hive Metastore was originally designed to solve the metadata management problems of tables and partitions in Hive. In Hive, the metadata of tables and partitions includes table names, column names, data types, partition keys, storage locations and other information. These metadata need to be stored and managed. Hive Metastore is designed to manage this metadata. Without Hive Metastore, Hive will store the metadata of tables and partitions in HDFS metadata files, which are often called Hive metadata storage directories. Hive will create a directory with the same name as the table in this directory, and create a file named "_metadata" in this directory. This file contains the metadata information of the table, such as table name, column name, data type, etc. For partitions, Hive will record the partition key and partition value in the table's metadata file, but will not record the storage location of the partition. Therefore, without Hive Metastore, users need to manually manage the storage locations of partitions, which may cause inconvenience and inefficiency in partition management.

Specifically, users need to create a directory for each partition in HDFS and store the data of the partition under the directory. For example, if there is a table named "sales" that contains a partition named "2019", the user can manually manage the storage location of the partition by following these steps:

  1. Create a directory named "sales" in HDFS, which will be used to store the data of the "sales" table.
  2. Create a directory named "2019" under the "sales" directory, which will be used to store the data of the "2019" partition.
  3. Store the data of the "2019" partition in the "sales/2019" directory.

When manually managing partition storage locations, you need to pay attention to the following points:

  1. The naming of the partition directory must be the same as the value of the partition key. For example, the directory name of the "2019" partition must be "2019".
  2. The partition directory must be located under the storage directory of the table. For example, the directory of the "2019" partition must be located under the storage directory of the "sales" table.
  3. The permissions of the partition directory must be the same as the storage directory of the table to ensure that Hive can access the data in the directory.

How does Metastore manage metadata and partitions?

  1. Storage of table and partition metadata: Hive Metastore can store metadata information of tables and partitions, including table names, column names, data types, partition keys, storage locations, etc. When users create or modify tables and partitions, Hive Metastore automatically updates metadata information.
  2. Support metadata query: Hive Metastore can support metadata query, and users can query metadata information of tables and partitions through Hive SQL statements. Hive Metastore automatically converts queries into operations on metadata storage, thereby enabling efficient queries on metadata.
  3. Support metadata modification: Hive Metastore can support metadata modification. Users can modify metadata information of tables and partitions through Hive SQL statements. Hive Metastore will automatically convert modifications into operations on metadata storage, thereby achieving efficient modification of metadata.
  4. Support metadata version control and transaction management: Hive Metastore can support metadata version control and transaction management. Users can perform version control and transaction management on metadata through Hive SQL statements to ensure the consistency and reliability of metadata.
  5. Manage partition storage locations: Hive Metastore can automatically manage partition storage locations. When users create or modify partitions, Hive Metastore will automatically store partition data in the correct location. Users do not need to manually manage the storage location of partitions, thereby improving the efficiency and reliability of partition management.
  6. Support partition query: Hive Metastore can support partition query, and users can query partition data through Hive SQL statements. Hive Metastore automatically converts queries into HDFS file system operations to achieve efficient queries on partitioned data.
  7. Support partition permission control: Hive Metastore can support partition permission control. Users can set partition access permissions through Hive SQL statements to protect the security of partition data.

In short, Hive Metastore can automatically manage the storage location and metadata information of partitions, thereby improving the efficiency and reliability of partition management. At the same time, Hive Metastore also supports functions such as partition query and permission control, providing users with a more comprehensive partition management solution and better query efficiency.

metadata object

  • Database: The namespace of the table.
  • Table: The metastore of the table contains all columns, owner, storage, serialization and deserialization information, and can also continue to contain user-specified key-value pairs. Storage information includes location information of data in the table, input and output formats of files, and bucket information. All metainformation is generated and stored when the table is created.
  • Partition: Each partition has its own column information, serialization and deserialization, and storage information. This is done to modify the schema without affecting the old partition.

A few small questions

1. What is the difference between hive metastore and hive server2?

The functions are different. hive2 provides external data query functions. Metastore manages the metadata information of hive tables and provides external metadata information query functions.

2. Can the data in the table be queried through Hive Metastore?

cannot. The data in the table cannot be queried through the Hive Metastore itself, because the Hive Metastore is only the metadata storage service of Hive. It is responsible for managing the metadata information of the Hive table, including the table structure, partition information, table location, etc., rather than storing it. The actual data in the table. To query the data in the Hive table, you need to use hive2 query. When using the Hive query service, they will obtain the metadata information of the table through the Hive Metastore, including the table structure, partition information, table location, etc., and then read the actual data in the table from the storage system based on this information. Therefore, Hive Metastore is an important component of Hive query service, but it cannot query the data in the table itself.

3. Why can spark read the data in the table through Hive Metastore?

Spark can read data in Hive tables through Hive Metastore because Spark SQL and Hive are both based on the Hadoop ecosystem and they share Hadoop's storage and computing resources. Specifically, Spark SQL can obtain metadata information of Hive tables through Hive Metastore. In other words, spark knows what data it wants to use and where it is stored. Then the spark task will read the data on HDFS. At this time, there is no need to use hive2, and the entire reading, writing and calculation process is done in the cluster. Within, data security is guaranteed.

Guess you like

Origin blog.csdn.net/yy_diego/article/details/130887242