Hive metadata storage metadata structure to explain the operation

This article describes some of the important uses of the table structure and Hive metabase, convenient Impala, SparkSQL, Hive appreciated that other components access the metabase.

1, the memory version of the metadata table Hive (VERSION)

The table is relatively simple, but very important.

VER_ID

SCHEMA_VERSION

VERSION_COMMENT

ID primary key

Hive version

Imprint

1

1.1.0

Set  by MetaStore

If the table there is a problem, do not enter the Hive-Cli. For example, the table does not exist, when starting Hive-Cli time, the error will be "Table 'hive.version' does not exist".

2, Hive database-related metadata table (DBS, DATABASE_PARAMS)

DBS: This table stores information Hive substantially all databases, the following fields:

Table field

Explanation

Sample Data

DB_ID

ID database

1

DESC

Database Description

Default  Hive database

DB_LOCATION_URI

HDFS data path

hdfs://193.168.1.75:9000/test-warehouse

NAME

data storage name

default

OWNER_NAME

The database owner user name

public

OWNER_TYPE

Owner Role

ROLE

DATABASE_PARAMS: related parameters stored in the database table, when using the CREATE DATABASE WITH DBPROPERTIES (property_name = property_value, ...) the specified parameters.

Table field

Explanation

Sample Data

DB_ID

ID database

1

PARAM_KEY

parameter name

createdby

PARAM_VALUE

Parameter Value

root

DBS and DATABASE_PARAMS two tables by associating DB_ID field.

3, Hive tables and views related metadata table

There are TBLS, TABLE_PARAMS, TBL_PRIVS, these three tables by TBL_ID association.

The basic information stored in the table Hive table, view, index table: TBLS

Table field

Explanation

Sample Data

TBL_ID

表ID

21

CREATE_TIME

Created

1447675704

DB_ID

ID database

1

LAST_ACCESS_TIME

Last Accessed

1447675704

OWNER

owner

root

RETENTION

reserved text

0

SD_ID

Serialization configuration information

41, SDS correspondence table SD_ID

TBL_NAME

Table Name

ex_detail_ufdr_30streaming

TBL_TYPE

Table Type

EXTERNAL_TABLE

VIEW_EXPANDED_TEXT

HQL statement detailed view

 

VIEW_ORIGINAL_TEXT

View original HQL statement

 

TABLE_PARAMS: This table stores the attribute information table / view

Table field

Explanation

Sample Data

TBL_ID

表ID

1

PARAM_KEY

Property name

totalSize,numRows,EXTERNAL

PARAM_VALUE

Property Value

970107336、21231028、TRUE

TBL_PRIVS: This table stores the authorization information table / view

Table field

Explanation

Sample Data

TBL_GRANT_ID

Authorization ID

1

CREATE_TIME

Authorized time

1436320455

GRANT_OPTION

 

0

GRANTOR

Users authorized to perform

root

GRANTOR_TYPE

Grantee type

USER

PRINCIPAL_NAME

Authorized user

username

PRINCIPAL_TYPE

Authorized user type

USER

TBL_PRIV

Competence

Select、Alter

TBL_ID

表ID

21, the correspondence table TBL_ID TBLS

4, Hive file storing information related to the metadata table

Mainly related to SDS, SD_PARAMS, SERDES, SERDE_PARAMS, because a lot of HDFS file format support, built Hive table when you can also specify a variety of file formats, in the Hive HQL parsed into MapReduce time, need to know where to go, what format to use HDFS file read and write, and this information is saved in this a few tables.

SDS:

The basic information table storage to save the file, such as INPUT_FORMAT, OUTPUT_FORMAT, whether compression. The association table SD_ID TBLS table with stored information can be acquired Hive table.

Table field

Explanation

Sample Data

SD_ID

ID information storage

41

CD_ID

ID field information

21, correspondence table CDS

INPUT_FORMAT

Input File Format

org.apache.hadoop.mapred.TextInputFormat

IS_COMPRESSED

Whether to compress

0

IS_STOREDASSUBDIRECTORIES

Whether subdirectories storage

0

LOCATION

HDFS path

hdfs://193.168.1.75:9000/detail_ufdr_streaming_test

NUM_BUCKETS

分桶数量

0

OUTPUT_FORMAT

文件输出格式

org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

SERDE_ID

序列化类ID

41,对应SERDES表

SD_PARAMS: 该表存储Hive存储的属性信息,在创建表时候使用STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)指定。

表字段

说明

示例数据

SD_ID

存储配置ID

41

PARAM_KEY

存储属性名

 

PARAM_VALUE

存储属性值

 

SERDES:该表存储序列化使用的类信息

表字段

说明

示例数据

SERDE_ID

序列化类配置ID

41

NAME

序列化类别名

NULL

SLIB

序列化类

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

SERDE_PARAMS:该表存储序列化的一些属性、格式信息,比如:行、列分隔符

表字段

说明

示例数据

SERDE_ID

序列化类配置ID

41

PARAM_KEY

属性名

field.delim

PARAM_VALUE

属性值

|

5、Hive表字段相关的元数据表

主要涉及COLUMNS_V2

COLUMNS_V2:该表存储表对应的字段信息

表字段

说明

示例数据

CD_ID

字段信息ID

21

COMMENT

字段注释

NULL

COLUMN_NAME

字段名

air_port_duration

TYPE_NAME

字段类型

bigint

INTEGER_IDX

字段顺序

119

6、Hive表分分区相关的元数据表

主要涉及PARTITIONS、PARTITION_KEYS、PARTITION_KEY_VALS、PARTITION_PARAMS

PARTITIONS:该表存储表分区的基本信息

表字段

说明

示例数据

PART_ID

分区ID

21

CREATE_TIME

分区创建时间

1450861405

LAST_ACCESS_TIME

最后一次访问时间

0

PART_NAME

分区名

hour=15/last_msisdn=0

SD_ID

分区存储ID

43

TBL_ID

表ID

22

LINK_TARGET_ID

 

NULL

PARTITION_KEYS:该表存储分区的字段信息

表字段

说明

示例数据

TBL_ID

表ID

22

PKEY_COMMENT

分区字段说明

NULL

PKEY_NAME

分区字段名

hour

PKEY_TYPE

分区字段类型

int

INTEGER_IDX

分区字段顺序

0

PARTITION_KEY_VALS:该表存储分区字段值

表字段

说明

示例数据

PART_ID

分区ID

21

PART_KEY_VAL

分区字段值

INTEGER_IDX

分区字段值顺序

1

PARTITION_PARAMS:该表存储分区的属性信息

表字段

说明

示例数据

PART_ID

分区ID

21

PARAM_KEY

分区属性名

numFiles,numRows

PARAM_VALUE

分区属性值

1,502195

6、其他不常用的元数据表

DB_PRIVS

数据库权限信息表。通过GRANT语句对数据库授权后,将会在这里存储。

IDXS

索引表,存储Hive索引相关的元数据

INDEX_PARAMS

索引相关的属性信息

TBL_COL_STATS

表字段的统计信息。使用ANALYZE语句对表字段分析后记录在这里

TBL_COL_PRIVS

表字段的授权信息

PART_PRIVS

分区的授权信息

PART_COL_PRIVS

分区字段的权限信息

PART_COL_STATS

分区字段的统计信息

FUNCS

用户注册的函数信息

FUNC_RU

用户注册函数的资源信息

7、元数据库一些查询
  有时根据需求,需要对hive中的表批量处理,这时可以到元数据库中进行一些查询操作,操作请慎重!! 
  下面会根据元数据库中的表结构和关联关系,陆续补充一些工作中使用到的查询语句。

1、查询某表的分区
  在Spark-sql查询hive表时,会由于元数据中文件与hdfs文件不一致而出现TreeNodeException的异常。比如说,在hive中show partitions时有分区pt=20160601,但是对应HDFS路径下并没有这个子文件夹时,在Spark-sql中就会出现该异常。这时如果需要查询某表的分区,就可以使用如下语句

SELECT p.* from PARTITIONS p
JOIN TBLS t
ON t.TBL_ID=p.TBL_ID
WHERE t.TBL_NAME='table'
AND PART_NAME like '%pt=20160601%';

2、查询指定库中stored as textfile类型的所有表名

select 
  d.NAME, 
  t.TBL_NAME,
  s.INPUT_FORMAT,
  s.OUTPUT_FORMAT
from TBLS t
join DBS d
join SDS s
where t.DB_ID = d.DB_ID
and t.SD_ID = s.SD_ID
and d.NAME='test'
and s.INPUT_FORMAT like '%TextInputFormat%';

3、查询指定库中的分区表

select
  db.NAME,
  tb.TBL_NAME,
  pk.PKEY_NAME 
from TBLS tb
join DBS db
join PARTITION_KEYS pk
where tb.DB_ID = db.DB_ID
and tb.TBL_ID=pk.TBL_ID
and db.NAME='test';

4、查询指定库的非分区表

select
  db.NAME,
  tb.TBL_NAME
from TBLS tb
join DBS db
where tb.DB_ID = db.DB_ID
and db.NAME='test'
and tb.TBL_ID not in (
  select distinct TBL_ID from PARTITION_KEYS
) ;

5、查询指定库中某种存储类型的分区表

select
  db.NAME,
  tb.TBL_NAME,
  pk.PKEY_NAME,
  s.INPUT_FORMAT,
  s.OUTPUT_FORMAT
from TBLS tb
join DBS db
join PARTITION_KEYS pk
join SDS s
where tb.DB_ID = db.DB_ID
and tb.TBL_ID=pk.TBL_ID
and tb.SD_ID = s.SD_ID
and db.NAME='test'
and s.INPUT_FORMAT like '%TextInputFormat%';

6、查询指定库中某种存储类型的非分区表

select
  db.NAME,
  tb.TBL_NAME,
  s.INPUT_FORMAT,
  s.OUTPUT_FORMAT
from TBLS tb
join DBS db
join SDS s
where tb.DB_ID = db.DB_ID
and tb.SD_ID = s.SD_ID
and db.NAME='test'
and s.INPUT_FORMAT like '%TextInputFormat%'
and tb.TBL_ID not in (select distinct TBL_ID from PARTITION_KEYS);

 

Guess you like

Origin blog.csdn.net/qq_26442553/article/details/89841759