Sample data:
1,小明1,lol-book-movie,beijing:shangxuetang-shanghai:pudong
2,小明2,lol-book-movie,beijing:shangxuetang-shanghai:pudong
3,小明3,lol-book-movie,beijing:shangxuetang-shanghai:pudong
4,小明4,lol-book-movie,beijing:shangxuetang-shanghai:pudong
5,小明5,lol-movie,beijing:shangxuetang-shanghai:pudong
6,小明6,lol-book-movie,beijing:shangxuetang-shanghai:pudong
7,小明7,lol-book,beijing:shangxuetang-shanghai:pudong
8,小明8,lol-book,beijing:shangxuetang-shanghai:pudong
9,小明9,lol-book-movie,beijing:shangxuetang-shanghai:pudong
Hive complete construction of the table DDL syntax rules
create [temporary] [external] table [if not exists] [db_name.]table_name -- (note: temporary available in hive 0.14.0 and later)
[(col_name data_type [comment col_comment], ... [constraint_specification])]
[comment table_comment]
[partitioned by (col_name data_type [comment col_comment], ...)]
[clustered by (col_name, col_name, ...) [sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
[skewed by (col_name, col_name, ...) -- (note: available in hive 0.10.0 and later)]
on ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[stored as directories]
[
[row format row_format]
[stored as file_format]
| stored by 'storage.handler.class.name' [with serdeproperties (...)] -- (note: available in hive 0.6.0 and later)
]
[location hdfs_path]
[tblproperties (property_name=property_value, ...)] -- (note: available in hive 0.6.0 and later)
[as select_statement]; -- (note: available in hive 0.5.0 and later; not supported for external tables)
Hive construction of the table (the default internal table)
-- 内部表
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';
-- 外部表
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n'
location '/usr';
View table describes
语法:describe [extended|formatted] table_name
describe formatted person;
The difference between internal and external table table
hive internal table
create table [if not exists] table_name
delete table metadata and the data will be deleted
hive external table
Create External Table [IF Not EXISTS] table_name LOCATION hdfs_path
remove external table delete only metastore metadata is not deleted hdfs in table data
Three kinds of ways to build the table
1, create table table_name Table statement routine
2, create table ... as select .. (CTAS) Snooze
3, create table like semi-automatic mode,
Address: Three ways to build the table https://blog.csdn.net/qq_26442553/article/details/85621767
CTAS built table points to note: https://blog.csdn.net/qq_26442553/article/details/79593504
Meaning the partition table: to optimize queries. Try to use a partition field queries. If you do not use the partition field, will all scans.
Static partition
1, static partition - table operations
Internal table, the corresponding partition metadata and data will be deleted.
-- 创建分区表 create table p_person ( id int, name string, likes array<string>, address map<string,string> ) partitioned by (sex string,age int) row format delimited fields terminated by ',' collection items terminated by '-' map keys terminated by ':' lines terminated by '\n'; -- 添加分区:注意要讲所有的分区字段都写上 alter table p_person add partition (sex='man',age=20); -- 删除分区:删除的时候只写需要删除的分区即可 alter table p_person drop partition (age=20);
2, Hive query syntax table partition information:
show partitions day_hour_table;
Dynamic Partitioning
1, creating a dynamic partition table
-- 要开启支持动态分区参数设置 set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nostrict; -- 首先创建数据表 create table person( id int , name string, age int, sex string, likes array<string>, address map<string,string> ) row format delimited fields terminated by ',' collection items terminated by '-' map keys terminated by ':' lines terminated by '\n'; -- 然后创建结构分区表-分区表 create table psn_partitioned_dongtai( id int , name string, likes array<string>, address map<string,string> ) partitioned by (age int,sex string) row format delimited fields terminated by ',' collection items terminated by '-' map keys terminated by ':' lines terminated by '\n'; -- 向数据表中加载数据 load data local inpath '/root/hivedata/psn1' into table person; --向结构分区表中加载数据 from person insert overwrite table psn_partitioned_dongtai partition(age,sex) select id,name,likes,address,age,sex distribute by age,sex;
2, parameter
开启支持动态分区 set hive.exec.dynamic.partition=true; 默认:false set hive.exec.dynamic.partition.mode=nostrict; 默认:strict(至少有一个分区列是静态分区) set hive.exec.max.dynamic.partitions.pernode; 每一个执行mr节点上,允许创建的动态分区的最大数量(100) set hive.exec.max.dynamic.partitions; 所有执行mr节点上,允许创建的所有动态分区的最大数量(1000) set hive.exec.max.created.files; 所有的mr job允许创建的文件的最大数量(100000)
Points barrel
1, kit of parts described
Divided barrel exists in the form of files. Kit of parts table is divided manner tub column then taken modulo hash value, the different data elements into different problems, each file identifying a bucket
application scenarios: a data sampling (sampling), map-join2, create a sub-bucket table
-- 开启支持分桶 set hive.enforce.bucketing=true; 注意,设置该参数后,mr运行时,bucket个数与reduce task个数会保持一致 默认:false;设置为true之后,mr运行时会根据bucket的个数自动分配reduce task个数。(用户也可以通过mapred.reduce.tasks自己设置reduce任务个数,但分桶时不推荐使用) 注意:一次作业产生的桶(文件数量)和reduce task个数一致 -- 创建数据表 create table psn_fentong( id int, name string, age int ) row format delimited fields terminated by ','; -- 加载数据 load data local inpath '/root/hivedata/psn_fentong' into table psn_fentong; -- 创建分通表 create table psn_fentong2( id int, name string, age int ) clustered by (age) into 4 buckets row format delimited fields terminated by ','; -- 加载数据 insert into table psn_fentong2 select id ,name ,age from psn_fentong; -- 数据抽样 select * from psn_fentong2 (bucket 2 out of 4 on age); tablesample语法: tablesample(bucket x out of y) x:表示从哪个bucket开始抽取数据 y:必须为该表总bucket数的倍数或因子 例: 当表总bucket数为32时 TABLESAMPLE(BUCKET 3 OUT OF 8),抽取哪些数据? 共抽取2(32/16)个bucket的数据,抽取第2、第18(16+2)个bucket的数据
3, from insert loading data
from psn21 insert overwrite table psn22 partition(age, sex) select id, name, age, sex, likes, address distribute by age, sex;
hive Lateral View virtual table
1, the description, the role of
Use: Lateral View for and UDTF function ( the explode , Split ) be used in combination.
Firstly UDTF function split into multiple lines, and then the results are combined into a plurality of lines to support a virtual table alias.
Mainly to solve: the select use UDTF do during queries, the query can only contain a single UDTF , can not contain other fields, as well as multiple UDTF problem
2, case
How many hobbies, how many cities there are statisticians table ?
-- 查看表描述 describe formatted psn2;
select count(distinct(myCol1)), count(distinct(myCol2)) from psn2 LATERAL VIEW explode(likes) myTable1 AS myCol1 LATERAL VIEW explode(address) myTable2 AS myCol2, myCol3;
hive View View
1. Description
And relational databases, like normal view, Hive also supports the view
Features:
It does not support materialized views
Only query, you do not load the data manipulation
Create a view, just keep a meta data, execute queries if the corresponding sub-query view
view definition included if ORDER BY / LIMIT statement, query view also when the ORDER BY / LIMIT statement operation, view them define higher priority
view supports an iterative view
2、SQL
-- 语法 create view [if not exists] [db_name.]view_name [(column_name [comment column_comment], ...) ] [comment view_comment] [tblproperties (property_name = property_value, ...)] as select ... ; -- 创建视图 create view v_psn as select * from psn; -- 查询视图 select colums from view; -- 删除视图 drop view v_psn;
Hive Index
1. Objective: To optimize the query and retrieval performance
2、SQL
-- 创建索引 create index t1_index on table psn2(name) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild in table t1_index_table; as:指定索引器; in table:指定索引表,若不指定默认生成在default__psn2_t1_index__表中 -- 查询索引 show index on psn2; -- 重建索引(建立索引之后必须重建索引才能生效) alter index t1_index on psn2 rebuild; -- 删除索引 删除索引是会连带删除维护该索引的索引表t1_index_table. 注意:删除索引不要手动删除索引表,这个索引表是系统维护的。如果需要删除索引,只删除相应的索引就可以了,其他的不用管 drop index t1_index on psn2;
Hive scripts are run
1, the command line cli: Console mode
--hive Syntax Operation
- hdfs interact with
example: Hive> -ls DFS /;
- system interaction and linux, to! Beginning
for example: hive> pwd!2, scripts are run ( with a maximum of actual production environment )
--hive -e 'the SELECT * from PSN2' results will be output to the console
aaa output to aaa --hive -e 'select * from psn2 '> query results file
--hive -S -e 'select * from psn2 '> aaa -S expressed as the muting output, use less
--hive -f file (this method is used up) Note: hive written to the file operation command syntax using -f to call the file file, execute the commands statements
--hive -i file Note: -f with the implementation of the latter do not quit when the hive console, but to stay in the hive console. More than a few grammar are executed after the completion of the console to exit the hive
--source is to get to the contents of the file linux system in hive and then execute the console, file stored in a hive manipulation statements
hive> Source / root / File
3, JDBC wayhiveserver2 way to start the hive service
4, webGUI interfaces (hwi, hue, etc., generally used hue, hwi do not have too spicy chicken)
please visit
1, construction of the table
create table logdata ( host string, identity string, t_user string, time string, request string, referer string, agent string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)" ) stored as textfile;
Serializer and Deserializer
SerDe to do serialization and de-serialization.
Construction of data between the storage and execution engine, to both the decoupling.
Hive read and write the content and row format delimited by serde.
serde serialization and deserialization (line converter) 'org.apache.hadoop.hive.serde2.RegexSerDe' canonical converter,
may be other converters may be implemented serde customize their interface definition converter
row SerDe format 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties (
"input.regex" = "is being parsed for each row expressions"
)
Download Data
1, load data mode
When data is loaded into the table, without any data conversion. Load operations are only copy the data to the Hive positions corresponding to the table. Automatically creates a directory under the table when the data is loaded
-- 普通表加载数据 load data local inpath '/root/hivedata/person' into table person; -- 分区表加载数据(静态分区) load data local inpath '/root/hivedata/person' into table p_person partition (sex='man',age=10);
2, from insert mode
grammar
from from_statement insert overwrite table tablename1 [partition (partcol1=val1, partcol2=val2 ...) [if not exists]] select_statement1 [insert overwrite table tablename2 [partition ... [if not exists]] elect_statement2] [insert into table tablename2 [partition ...] select_statement2] ...;
Case
-- 静态分区表 from person insert overwrite table p_person partition (sex='woman', age=60) select id,name,likes,address;
Custom Functions
Hive 自定义函数 Hive的UDF开发只需要重构UDF类的evaluate函数即可。例: package com.hrj.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public class helloUDF extends UDF { public String evaluate(String str) { try { return "HelloWorld " + str; } catch (Exception e) { return null; } } } Hive 自定义函数调用 将该java文件编译成helloudf.jar hive> add jar helloudf.jar; hive> create temporary function helloworld as 'com.hrj.hive.udf.helloUDF'; hive> select helloworld(t.col1) from t limit 10; hive> drop temporary function helloworld; 注意 1.helloworld为临时的函数,所以每次进入hive都需要add jar以及create temporary操作 2.UDF只能实现一进一出的操作,如果需要实现多进一出,则需要实现UDAF
hive parameters
Namespaces |
Read and write permissions |
meaning |
hiveconf |
Can read and write |
hive-site.xml among each configuration variables 例:hive --hiveconf hive.cli.print.header=true |
system |
Can read and write |
System variables include JVM operating parameters 例:system:user.name=root |
env |
Read-only |
Environment Variables 例:env:JAVA_HOME |
hivevar |
Can read and write |
Li: hive the -d val = key |
1, by $ {} for reference, wherein the System , the env variables must begin with the prefix
2, Hive parameter setting mode
(1) , modify the configuration file $ HIVE_HOME} {/ the conf /hive-site.xml
(2) Start hive cli when, by - hiveconf Key value = manner set
例:hive --hiveconf hive.cli.print.header=true
(3) into cli Thereafter, by using the set command to set the
3, hive set command
In the hive CLI can console set to hive ongoing parameter query, set
set settings:
set hive.cli.print.header=true;
set Views:
set hive.cli.print.header
hive parameters initial configuration:
Under the current user's home directory . Hiverc file
Such as :. ~ / Hiverc
If not, you can directly create the file, you need to set the parameters written in the file, Hive when up and running, it will load the configuration file change.
hive history Operation Command Set:
~ /. hivehistory