2. Detailed explanation of hive related concepts - architecture, reading and writing file mechanism, data storage

Apache Hive series of articles

1. Introduction and deployment of apache-hive-3.1.2 (three deployment methods - embedded mode, local mode and remote mode) and detailed verification 2. Detailed explanation of
hive related concepts - architecture, reading and writing file mechanism, data storage
3. hive Detailed explanation of usage examples - table creation, detailed explanation of data types, internal and external tables, partition tables, bucket tables 4. Detailed explanation
of hive usage examples - detailed operations of transaction tables, views, materialized views, DDL (database, tables and partitions) management
5 , hive's load, insert, transaction table usage details and examples
6. hive's select (GROUP BY, ORDER BY, CLUSTER BY, SORT BY, LIMIT, union, CTE), join usage details and examples
7, hive shell client and Attribute configuration, built-in operators, functions (built-in operators and custom UDF operators)
8. Detailed explanation of the syntax and usage examples of hive's relational operations, logical budget, mathematical operations, numerical operations, date functions, conditional functions and string functions
9. Detailed explanation of the use of hive's explode, Lateral View, aggregation function, window function, and sampling function
10. Hive comprehensive example: data multi-separator (regular RegexSerDe), url parsing, and common functions for column and column conversion (case when, union, concat and explode) Detailed usage examples
11. Hive comprehensive application examples: json parsing, window function application (continuous login, cascade accumulation, topN), zipper table application
12. Hive optimization - file storage format and compression format optimization and job execution optimization ( Execution plan, MR attributes, join, optimizer, predicate pushdown and data skew optimization) detailed introduction and examples
13. Java API access hive operation example



This article mainly introduces hive's architecture, components, data model, reading and writing mechanism, etc.
This article is divided into two parts, namely, architecture and component introduction, and file reading and writing mechanism.

Some of the pictures in this article come from the Internet.

1. Introduction to architecture and components

1. Hive overall architecture diagram

Insert image description here

2. Hive components

  • User interface
    includes CLI, JDBC/ODBC, and WebGUI.
    CLI (command line interface) is the shell command line.
    The Thrift server in Hive allows external clients to interact with Hive through the network. Similar to the JDBC or ODBC protocol, the
    WebGUI accesses Hive through the browser.
  • Metadata storage
    is usually stored in a relational database such as mysql/derby. Metadata in Hive includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), the directory where the table's data is located, etc.
  • Driver driver
    includes syntax parser, plan compiler, optimizer, and executor
    to complete HQL query statements. The
    query plan generated from lexical analysis, syntax analysis, compilation, optimization, and query plan generation is stored in HDFS and subsequently executed. Engine call execution
  • The execution engine
    Hive itself does not directly process data files, but processes them through the execution engine.
    Currently, Hive supports three execution engines: MapReduce, Tez, and Spark.

3. Hive data model (Data Model)

Hive's data model is used to describe, organize, and operate data
. It is similar to the RDBMS database table structure. In addition, it has its own unique model.
The data in Hive can be divided into three categories at the granular level: Table, Partition, and Bucket. Buckets.
Insert image description here

1)、Databases

Hive, as a data warehouse, contains databases (Schema), and each database has its own tables. The default database is default.
Hive data is stored on HDFS. There is a root directory by default, which is specified by the parameter hive.metastore.warehouse.dir in hive-site.xml. The default value is /user/hive/warehouse.
Therefore, the storage path of the database in Hive on HDFS is: ${hive.metastore.warehouse.dir}/databasename.db.
For example, the storage path of the database named test is: /user/hive/warehouse/test.db

2)、Tables

Hive tables are the same as tables in relational databases. The data corresponding to the table in Hive is stored in the Hadoop file system, and the metadata related to the table is stored in the RDBMS.
In Hadoop, data is usually saved in HDFS, although it can be saved in any Hadoop file system, including the local file system or S3.
Hive has two types of tables:

  • Managed Table internal table, managed table
  • External Table
    When creating a table, the default is an internal table. The storage path of the table data in Hive on HDFS is: ${hive.metastore.warehouse.dir}/databasename.db/tablename.
    For example, the storage path of the t_user table under the test database is: /user/hive/warehouse/test .db/t_user
    Insert image description here

3)、Partitions

Partition partition is an optimization method of hive table.
Partitioning refers to dividing the table into different partitions based on the value of the partition column (such as "date day"). This allows faster querying of specified partition data.
The performance of partitions at the storage level is that they exist in the form of subfolders in the table directory.
A folder represents a partition. Sub-file naming standard: partition column = partition value
Hive also supports the continued creation of partitions under partitions, so-called multiple partitions.
Insert image description here

4)、Buckets

Bucket table is an optimization table of hive.
Bucketing refers to dividing the data file into several specified small files through hash calculation rules based on the value of the field in the table (such as "number ID").
Insert image description here
Bucketing rules: hashfunc(ID) % bucket number, the ones with the same remainder will be divided into the same file.
The advantage of bucketing is that it can optimize join queries and facilitate sampling queries. The bucket table in HDFS appears as data in the same table directory being hashed into multiple files.
Insert image description here

2. Hive’s file reading and writing mechanism

1、SerDe operation

SerDe is the abbreviation of Serializer and Deserializer, which is used for serialization and deserialization. Serialization is the process of converting objects into bytecode; deserialization is the process of converting bytecode into objects.
Hive uses SerDe (and FileFormat) to read and write row objects.

# 读过程
HDFS files --> InputFileFormat --> <key,value> --> Deserializer(反序列化) --> Row Object
# 写过程
Row Object --> serializer(反序列化) --> <key,value> --> OutputFileFormat --> HDFS files

# 需要注意的是,“key”部分在读取时会被忽略,而在写入时key始终是常数。基本上行对象存储在“value”中。
# 通过desc formatted tablename查看表的相关SerDe信息,SerDe默认(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)如下:

0: jdbc:hive2://server4:10000> desc formatted t_user;
INFO  : Compiling command(queryId=alanchan_20221017153821_c8ac2142-aacf-479c-a8f2-e040f2f791cb): desc formatted t_user
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col_name, type:string, comment:from deserializer), FieldSchema(name:data_type, type:string, comment:from deserializer), FieldSchema(name:comment, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=alanchan_20221017153821_c8ac2142-aacf-479c-a8f2-e040f2f791cb); Time taken: 0.024 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=alanchan_20221017153821_c8ac2142-aacf-479c-a8f2-e040f2f791cb): desc formatted t_user
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=alanchan_20221017153821_c8ac2142-aacf-479c-a8f2-e040f2f791cb); Time taken: 0.037 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| id                            | int                                                |                                                    |
| name                          | varchar(255)                                       |                                                    |
| age                           | int                                                |                                                    |
| city                          | varchar(255)                                       |                                                    |
|                               | NULL                                               | NULL                                               |
| # Detailed Table Information  | NULL                                               | NULL                                               |
| Database:                     | test                                               | NULL                                               |
| OwnerType:                    | USER                                               | NULL                                               |
| Owner:                        | alanchan                                           | NULL                                               |
| CreateTime:                   | Mon Oct 17 14:47:08 CST 2022                       | NULL                                               |
| LastAccessTime:               | UNKNOWN                                            | NULL                                               |
| Retention:                    | 0                                                  | NULL                                               |
| Location:                     | hdfs://HadoopHAcluster/user/hive/warehouse/test.db/t_user | NULL                                               |
| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |
| Table Parameters:             | NULL                                               | NULL                                               |
|                               | COLUMN_STATS_ACCURATE                              | {
    
    \"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{
    
    \"age\":\"true\",\"city\":\"true\",\"id\":\"true\",\"name\":\"true\"}} |
|                               | bucketing_version                                  | 2                                                  |
|                               | numFiles                                           | 0                                                  |
|                               | numRows                                            | 0                                                  |
|                               | rawDataSize                                        | 0                                                  |
|                               | totalSize                                          | 0                                                  |
|                               | transient_lastDdlTime                              | 1665989228                                         |
|                               | NULL                                               | NULL                                               |
| # Storage Information         | NULL                                               | NULL                                               |
| SerDe Library:                | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL                                               |
| InputFormat:                  | org.apache.hadoop.mapred.TextInputFormat           | NULL                                               |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                                               |
| Compressed:                   | No                                                 | NULL                                               |
| Num Buckets:                  | -1                                                 | NULL                                               |
| Bucket Columns:               | []                                                 | NULL                                               |
| Sort Columns:                 | []                                                 | NULL                                               |
| Storage Desc Params:          | NULL                                               | NULL                                               |
|                               | field.delim                                        | ,                                                  |
|                               | serialization.format                               | ,                                                  |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
35 rows selected (0.081 seconds)

2. Hive reading and writing file process

  • Reading process
    HDFS files --> InputFileFormat --> <key, value> --> Deserializer (deserialization) --> Row Object
    Hive's file reading mechanism
    first calls InputFormat (default TextInputFormat) and returns one kv key-value pair Record (default is one row corresponding to one record).
    Then call the Deserializer of SerDe (default LazySimpleSerDe) to split the value in a record into various fields according to the delimiter.

  • Writing process
    Row Object --> serializer (deserialization) --> <key, value> --> OutputFileFormat --> HDFS files Hive
    file writing mechanism
    When writing Row to a file, the Serializer of SerDe (default LazySimpleSerDe) is first called Convert the object into a byte sequence
    and then call OutputFormat to write the data to the HDFS file.

3. SerDe related syntax

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe
In Hive’s table creation statement, the syntax related to SerDe is:

Insert image description here
Among them, ROW FORMAT is the syntax keyword, choose one of DELIMITED and SERDE.
If you use delimited, it means using the default LazySimpleSerDe class to process data. If the data file format is special, you can use ROW FORMAT SERDE serde_name to specify other Serde classes to process the data, and even support user-defined SerDe classes.

1), LazySimpleSerDe separator specification

LazySimpleSerDe is Hive's default serialization class, which contains 4 sub-syntaxes, which are used to specify the delimiter symbols between fields, between collection elements, between map mapping kv, and newline. When building a table, it can be used flexibly according to the characteristics of the data.
Insert image description here

2), default separator

If there is no row format syntax when hive creates a table. At this time, the default separator between fields is '\001', which is a special character and uses an ascii encoded value.
Insert image description here
In the vim editor, press Ctrl+v/Ctrl+a continuously to enter '\001', and the displayed ^A
Insert image description here
will be displayed in the form of SOH in some text editors:
Insert image description here

4. Hive data storage path

1), default storage path

The default storage path of the Hive table is specified by the hive.metastore.warehouse.dir attribute of the ${HIVE_HOME}/conf/hive-site.xml configuration file. The default value is: /user/hive/warehouse.
Under this path, files will be stored regularly in corresponding folders according to the libraries and tables they belong to.
Insert image description here

2) Specify the storage path

When building a table in Hive, you can use the location syntax to change the storage path of data on HDFS, making table building and data loading more flexible and convenient.
Syntax: LOCATION '<hdfs_location>'.
For already generated data files, it is convenient to use location to specify the path.

Above, the overall architecture, related components, data model, etc. of hive are introduced, and the process and mechanism of reading and writing files of hive are also introduced.

Guess you like

Origin blog.csdn.net/chenwewi520feng/article/details/131044112