This article describes some of the important uses of the table structure and Hive metabase, convenient Impala, SparkSQL, Hive appreciated that other components access the metabase.
1, the memory version of the metadata table Hive (VERSION)
The table is relatively simple, but very important.
VER_ID |
SCHEMA_VERSION |
VERSION_COMMENT |
ID primary key |
Hive version |
Imprint |
1 |
1.1.0 |
Set by MetaStore |
If the table there is a problem, do not enter the Hive-Cli. For example, the table does not exist, when starting Hive-Cli time, the error will be "Table 'hive.version' does not exist".
2, Hive database-related metadata table (DBS, DATABASE_PARAMS)
DBS: This table stores information Hive substantially all databases, the following fields:
Table field |
Explanation |
Sample Data |
DB_ID |
ID database |
1 |
DESC |
Database Description |
Default Hive database |
DB_LOCATION_URI |
HDFS data path |
hdfs://193.168.1.75:9000/test-warehouse |
NAME |
data storage name |
default |
OWNER_NAME |
The database owner user name |
public |
OWNER_TYPE |
Owner Role |
ROLE |
DATABASE_PARAMS: related parameters stored in the database table, when using the CREATE DATABASE WITH DBPROPERTIES (property_name = property_value, ...) the specified parameters.
Table field |
Explanation |
Sample Data |
DB_ID |
ID database |
1 |
PARAM_KEY |
parameter name |
createdby |
PARAM_VALUE |
Parameter Value |
root |
DBS and DATABASE_PARAMS two tables by associating DB_ID field.
3, Hive tables and views related metadata table
There are TBLS, TABLE_PARAMS, TBL_PRIVS, these three tables by TBL_ID association.
The basic information stored in the table Hive table, view, index table: TBLS
Table field |
Explanation |
Sample Data |
TBL_ID |
表ID |
21 |
CREATE_TIME |
Created |
1447675704 |
DB_ID |
ID database |
1 |
LAST_ACCESS_TIME |
Last Accessed |
1447675704 |
OWNER |
owner |
root |
RETENTION |
reserved text |
0 |
SD_ID |
Serialization configuration information |
41, SDS correspondence table SD_ID |
TBL_NAME |
Table Name |
ex_detail_ufdr_30streaming |
TBL_TYPE |
Table Type |
EXTERNAL_TABLE |
VIEW_EXPANDED_TEXT |
HQL statement detailed view |
|
VIEW_ORIGINAL_TEXT |
View original HQL statement |
TABLE_PARAMS: This table stores the attribute information table / view
Table field |
Explanation |
Sample Data |
TBL_ID |
表ID |
1 |
PARAM_KEY |
Property name |
totalSize,numRows,EXTERNAL |
PARAM_VALUE |
Property Value |
970107336、21231028、TRUE |
TBL_PRIVS: This table stores the authorization information table / view
Table field |
Explanation |
Sample Data |
TBL_GRANT_ID |
Authorization ID |
1 |
CREATE_TIME |
Authorized time |
1436320455 |
GRANT_OPTION |
0 |
|
GRANTOR |
Users authorized to perform |
root |
GRANTOR_TYPE |
Grantee type |
USER |
PRINCIPAL_NAME |
Authorized user |
username |
PRINCIPAL_TYPE |
Authorized user type |
USER |
TBL_PRIV |
Competence |
Select、Alter |
TBL_ID |
表ID |
21, the correspondence table TBL_ID TBLS |
4, Hive file storing information related to the metadata table
Mainly related to SDS, SD_PARAMS, SERDES, SERDE_PARAMS, because a lot of HDFS file format support, built Hive table when you can also specify a variety of file formats, in the Hive HQL parsed into MapReduce time, need to know where to go, what format to use HDFS file read and write, and this information is saved in this a few tables.
SDS:
The basic information table storage to save the file, such as INPUT_FORMAT, OUTPUT_FORMAT, whether compression. The association table SD_ID TBLS table with stored information can be acquired Hive table.
Table field |
Explanation |
Sample Data |
SD_ID |
ID information storage |
41 |
CD_ID |
ID field information |
21, correspondence table CDS |
INPUT_FORMAT |
Input File Format |
org.apache.hadoop.mapred.TextInputFormat |
IS_COMPRESSED |
Whether to compress |
0 |
IS_STOREDASSUBDIRECTORIES |
Whether subdirectories storage |
0 |
LOCATION |
HDFS path |
hdfs://193.168.1.75:9000/detail_ufdr_streaming_test |
NUM_BUCKETS |
分桶数量 |
0 |
OUTPUT_FORMAT |
文件输出格式 |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
SERDE_ID |
序列化类ID |
41,对应SERDES表 |
SD_PARAMS: 该表存储Hive存储的属性信息,在创建表时候使用STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)指定。
表字段 |
说明 |
示例数据 |
SD_ID |
存储配置ID |
41 |
PARAM_KEY |
存储属性名 |
|
PARAM_VALUE |
存储属性值 |
SERDES:该表存储序列化使用的类信息
表字段 |
说明 |
示例数据 |
SERDE_ID |
序列化类配置ID |
41 |
NAME |
序列化类别名 |
NULL |
SLIB |
序列化类 |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
SERDE_PARAMS:该表存储序列化的一些属性、格式信息,比如:行、列分隔符
表字段 |
说明 |
示例数据 |
SERDE_ID |
序列化类配置ID |
41 |
PARAM_KEY |
属性名 |
field.delim |
PARAM_VALUE |
属性值 |
| |
5、Hive表字段相关的元数据表
主要涉及COLUMNS_V2
COLUMNS_V2:该表存储表对应的字段信息
表字段 |
说明 |
示例数据 |
CD_ID |
字段信息ID |
21 |
COMMENT |
字段注释 |
NULL |
COLUMN_NAME |
字段名 |
air_port_duration |
TYPE_NAME |
字段类型 |
bigint |
INTEGER_IDX |
字段顺序 |
119 |
6、Hive表分分区相关的元数据表
主要涉及PARTITIONS、PARTITION_KEYS、PARTITION_KEY_VALS、PARTITION_PARAMS
PARTITIONS:该表存储表分区的基本信息
表字段 |
说明 |
示例数据 |
PART_ID |
分区ID |
21 |
CREATE_TIME |
分区创建时间 |
1450861405 |
LAST_ACCESS_TIME |
最后一次访问时间 |
0 |
PART_NAME |
分区名 |
hour=15/last_msisdn=0 |
SD_ID |
分区存储ID |
43 |
TBL_ID |
表ID |
22 |
LINK_TARGET_ID |
NULL |
PARTITION_KEYS:该表存储分区的字段信息
表字段 |
说明 |
示例数据 |
TBL_ID |
表ID |
22 |
PKEY_COMMENT |
分区字段说明 |
NULL |
PKEY_NAME |
分区字段名 |
hour |
PKEY_TYPE |
分区字段类型 |
int |
INTEGER_IDX |
分区字段顺序 |
0 |
PARTITION_KEY_VALS:该表存储分区字段值
表字段 |
说明 |
示例数据 |
PART_ID |
分区ID |
21 |
PART_KEY_VAL |
分区字段值 |
0 |
INTEGER_IDX |
分区字段值顺序 |
1 |
PARTITION_PARAMS:该表存储分区的属性信息
表字段 |
说明 |
示例数据 |
PART_ID |
分区ID |
21 |
PARAM_KEY |
分区属性名 |
numFiles,numRows |
PARAM_VALUE |
分区属性值 |
1,502195 |
6、其他不常用的元数据表
DB_PRIVS
数据库权限信息表。通过GRANT语句对数据库授权后,将会在这里存储。
IDXS
索引表,存储Hive索引相关的元数据
INDEX_PARAMS
索引相关的属性信息
TBL_COL_STATS
表字段的统计信息。使用ANALYZE语句对表字段分析后记录在这里
TBL_COL_PRIVS
表字段的授权信息
PART_PRIVS
分区的授权信息
PART_COL_PRIVS
分区字段的权限信息
PART_COL_STATS
分区字段的统计信息
FUNCS
用户注册的函数信息
FUNC_RU
用户注册函数的资源信息
7、元数据库一些查询
有时根据需求,需要对hive中的表批量处理,这时可以到元数据库中进行一些查询操作,操作请慎重!!
下面会根据元数据库中的表结构和关联关系,陆续补充一些工作中使用到的查询语句。
1、查询某表的分区
在Spark-sql查询hive表时,会由于元数据中文件与hdfs文件不一致而出现TreeNodeException的异常。比如说,在hive中show partitions时有分区pt=20160601,但是对应HDFS路径下并没有这个子文件夹时,在Spark-sql中就会出现该异常。这时如果需要查询某表的分区,就可以使用如下语句
SELECT p.* from PARTITIONS p
JOIN TBLS t
ON t.TBL_ID=p.TBL_ID
WHERE t.TBL_NAME='table'
AND PART_NAME like '%pt=20160601%';
2、查询指定库中stored as textfile类型的所有表名
select
d.NAME,
t.TBL_NAME,
s.INPUT_FORMAT,
s.OUTPUT_FORMAT
from TBLS t
join DBS d
join SDS s
where t.DB_ID = d.DB_ID
and t.SD_ID = s.SD_ID
and d.NAME='test'
and s.INPUT_FORMAT like '%TextInputFormat%';
3、查询指定库中的分区表
select
db.NAME,
tb.TBL_NAME,
pk.PKEY_NAME
from TBLS tb
join DBS db
join PARTITION_KEYS pk
where tb.DB_ID = db.DB_ID
and tb.TBL_ID=pk.TBL_ID
and db.NAME='test';
4、查询指定库的非分区表
select
db.NAME,
tb.TBL_NAME
from TBLS tb
join DBS db
where tb.DB_ID = db.DB_ID
and db.NAME='test'
and tb.TBL_ID not in (
select distinct TBL_ID from PARTITION_KEYS
) ;
5、查询指定库中某种存储类型的分区表
select
db.NAME,
tb.TBL_NAME,
pk.PKEY_NAME,
s.INPUT_FORMAT,
s.OUTPUT_FORMAT
from TBLS tb
join DBS db
join PARTITION_KEYS pk
join SDS s
where tb.DB_ID = db.DB_ID
and tb.TBL_ID=pk.TBL_ID
and tb.SD_ID = s.SD_ID
and db.NAME='test'
and s.INPUT_FORMAT like '%TextInputFormat%';
6、查询指定库中某种存储类型的非分区表
select
db.NAME,
tb.TBL_NAME,
s.INPUT_FORMAT,
s.OUTPUT_FORMAT
from TBLS tb
join DBS db
join SDS s
where tb.DB_ID = db.DB_ID
and tb.SD_ID = s.SD_ID
and db.NAME='test'
and s.INPUT_FORMAT like '%TextInputFormat%'
and tb.TBL_ID not in (select distinct TBL_ID from PARTITION_KEYS);