Hive SQL, parameters, custom functions

Sample data:

1,小明1,lol-book-movie,beijing:shangxuetang-shanghai:pudong
2,小明2,lol-book-movie,beijing:shangxuetang-shanghai:pudong
3,小明3,lol-book-movie,beijing:shangxuetang-shanghai:pudong
4,小明4,lol-book-movie,beijing:shangxuetang-shanghai:pudong
5,小明5,lol-movie,beijing:shangxuetang-shanghai:pudong
6,小明6,lol-book-movie,beijing:shangxuetang-shanghai:pudong
7,小明7,lol-book,beijing:shangxuetang-shanghai:pudong
8,小明8,lol-book,beijing:shangxuetang-shanghai:pudong
9,小明9,lol-book-movie,beijing:shangxuetang-shanghai:pudong

Hive complete construction of the table DDL syntax rules

create [temporary] [external] table [if not exists] [db_name.]table_name    -- (note: temporary available in hive 0.14.0 and later)
  [(col_name data_type [comment col_comment], ... [constraint_specification])]
  [comment table_comment]
  [partitioned by (col_name data_type [comment col_comment], ...)]
  [clustered by (col_name, col_name, ...) [sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
  [skewed by (col_name, col_name, ...)                  -- (note: available in hive 0.10.0 and later)]
     on ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [stored as directories]
  [
   [row format row_format] 
   [stored as file_format]
     | stored by 'storage.handler.class.name' [with serdeproperties (...)]  -- (note: available in hive 0.6.0 and later)
  ]
  [location hdfs_path]
  [tblproperties (property_name=property_value, ...)]   -- (note: available in hive 0.6.0 and later)
  [as select_statement];   -- (note: available in hive 0.5.0 and later; not supported for external tables)

Hive construction of the table (the default internal table)

-- 内部表
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 外部表
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n'
location '/usr';

View table describes

语法:describe [extended|formatted] table_name

describe formatted person;

The difference between internal and external table table

hive internal table
       create table [if not exists] table_name
       delete table metadata and the data will be deleted
hive external table
       Create External Table [IF Not EXISTS] table_name LOCATION hdfs_path
       remove external table delete only metastore metadata is not deleted hdfs in table data

Three kinds of ways to build the table

1, create table table_name Table statement routine

2, create table ... as select .. (CTAS) Snooze

3, create table like semi-automatic mode,

Address: Three ways to build the table https://blog.csdn.net/qq_26442553/article/details/85621767

                  CTAS built table points to note: https://blog.csdn.net/qq_26442553/article/details/79593504

Meaning the partition table: to optimize queries. Try to use a partition field queries. If you do not use the partition field, will all scans.

Static partition

1, static partition - table operations

     Internal table, the corresponding partition metadata and data will be deleted.

-- 创建分区表
create table p_person (
id int,
name string,
likes array<string>,
address map<string,string>
) 
partitioned by (sex string,age int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 添加分区:注意要讲所有的分区字段都写上
alter table p_person add partition (sex='man',age=20);

-- 删除分区:删除的时候只写需要删除的分区即可
alter table p_person drop partition (age=20);

2, Hive query syntax table partition information:

show partitions day_hour_table;

Dynamic Partitioning

1, creating a dynamic partition table

-- 要开启支持动态分区参数设置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;

-- 首先创建数据表
create table person(
id int ,
name string,
age int,
sex string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 然后创建结构分区表-分区表
create table psn_partitioned_dongtai(
id int ,
name string,
likes array<string>,
address map<string,string>
)
partitioned by (age int,sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 向数据表中加载数据
load data local inpath '/root/hivedata/psn1' into table person;
--向结构分区表中加载数据
from person
insert overwrite table psn_partitioned_dongtai partition(age,sex)
select id,name,likes,address,age,sex distribute by age,sex;

2, parameter

开启支持动态分区
set hive.exec.dynamic.partition=true;
默认:false
set hive.exec.dynamic.partition.mode=nostrict;
默认:strict(至少有一个分区列是静态分区)


set hive.exec.max.dynamic.partitions.pernode;
每一个执行mr节点上,允许创建的动态分区的最大数量(100)
set hive.exec.max.dynamic.partitions;
所有执行mr节点上,允许创建的所有动态分区的最大数量(1000)
set hive.exec.max.created.files;
所有的mr job允许创建的文件的最大数量(100000)

 

Points barrel

1, kit of parts described

      Divided barrel exists in the form of files. Kit of parts table is divided manner tub column then taken modulo hash value, the different data elements into different problems, each file identifying a bucket
      application scenarios: a data sampling (sampling), map-join

2, create a sub-bucket table

-- 开启支持分桶
set hive.enforce.bucketing=true;
注意,设置该参数后,mr运行时,bucket个数与reduce task个数会保持一致
默认:false;设置为true之后,mr运行时会根据bucket的个数自动分配reduce task个数。(用户也可以通过mapred.reduce.tasks自己设置reduce任务个数,但分桶时不推荐使用)
注意:一次作业产生的桶(文件数量)和reduce task个数一致


-- 创建数据表
create table psn_fentong(
id int,
name string,
age int	
)
row format delimited 
fields terminated by ',';

-- 加载数据
load data local inpath '/root/hivedata/psn_fentong' into table psn_fentong;

-- 创建分通表
create table psn_fentong2(
id int,
name string,
age int
)
clustered by (age) into 4 buckets 
row format delimited fields terminated by ',';

-- 加载数据
insert into table psn_fentong2 select id ,name ,age from psn_fentong;

-- 数据抽样
select * from psn_fentong2 (bucket 2 out of 4 on age);

tablesample语法:
tablesample(bucket x out of y)
x:表示从哪个bucket开始抽取数据
y:必须为该表总bucket数的倍数或因子

例:
当表总bucket数为32时
TABLESAMPLE(BUCKET 3 OUT OF 8),抽取哪些数据?
共抽取2(32/16)个bucket的数据,抽取第2、第18(16+2)个bucket的数据

3, from insert loading data

from psn21
insert overwrite table psn22 partition(age, sex)  
select id, name, age, sex, likes, address distribute by age, sex;

 

hive Lateral View virtual table

1, the description, the role of

      Use: Lateral View for and UDTF function ( the explode , Split ) be used in combination.

      Firstly UDTF function split into multiple lines, and then the results are combined into a plurality of lines to support a virtual table alias.

      Mainly to solve: the select use UDTF do during queries, the query can only contain a single UDTF , can not contain other fields, as well as multiple UDTF problem

2, case

      How many hobbies, how many cities there are statisticians table ?

-- 查看表描述
describe formatted psn2;

select count(distinct(myCol1)), count(distinct(myCol2)) from psn2 
LATERAL VIEW explode(likes) myTable1 AS myCol1 
LATERAL VIEW explode(address) myTable2 AS myCol2, myCol3;

hive View View

1. Description

      And relational databases, like normal view, Hive also supports the view

      Features:

              It does not support materialized views

              Only query, you do not load the data manipulation

              Create a view, just keep a meta data, execute queries if the corresponding sub-query view

              view definition included if ORDER BY / LIMIT statement, query view also when the ORDER BY / LIMIT statement operation, view them define higher priority

              view supports an iterative view

2、SQL

-- 语法
create view [if not exists] [db_name.]view_name 
  [(column_name [comment column_comment], ...) ]
  [comment view_comment]
  [tblproperties (property_name = property_value, ...)]
  as select ... ;

-- 创建视图
create view v_psn as select * from psn;

-- 查询视图
select colums from view;

-- 删除视图
drop view v_psn;

 

Hive Index

1. Objective: To optimize the query and retrieval performance

2、SQL

-- 创建索引
create index t1_index on table psn2(name) 
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild 
in table t1_index_table;
as:指定索引器;
in table:指定索引表,若不指定默认生成在default__psn2_t1_index__表中

-- 查询索引
show index on psn2;

-- 重建索引(建立索引之后必须重建索引才能生效)
alter index t1_index on psn2 rebuild;

-- 删除索引
删除索引是会连带删除维护该索引的索引表t1_index_table.
注意:删除索引不要手动删除索引表,这个索引表是系统维护的。如果需要删除索引,只删除相应的索引就可以了,其他的不用管
drop index t1_index on psn2;

Hive scripts are run

1, the command line cli: Console mode
        --hive Syntax Operation
        - hdfs interact with
            example: Hive> -ls DFS /;
        - system interaction and linux, to! Beginning
            for example: hive> pwd!

2, scripts are run ( with a maximum of actual production environment )
        --hive -e 'the SELECT * from PSN2' results will be output to the console
        aaa output to aaa --hive -e 'select * from psn2 '> query results file
        --hive -S -e 'select * from psn2 '> aaa -S expressed as the muting output, use less
        --hive -f file (this method is used up) Note: hive written to the file operation command syntax using -f to call the file file, execute the commands statements
        --hive -i file Note: -f with the implementation of the latter do not quit when the hive console, but to stay in the hive console. More than a few grammar are executed after the completion of the console to exit the hive
        --source is to get to the contents of the file linux system in hive and then execute the console, file stored in a hive manipulation statements
            hive> Source / root / File
 3, JDBC way

       hiveserver2 way to start the hive service
 4, web

      GUI interfaces (hwi, hue, etc., generally used hue, hwi do not have too spicy chicken)

please visit

1, construction of the table 

create table logdata (
    host string,
    identity string,
    t_user string,
    time string,
    request string,
    referer string,
    agent string)
  row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
  with serdeproperties (
    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
  )
  stored as textfile;

  Serializer and Deserializer

  SerDe to do serialization and de-serialization.

  Construction of data between the storage and execution engine, to both the decoupling.

  Hive read and write the content and row format delimited by serde.

  serde serialization and deserialization (line converter) 'org.apache.hadoop.hive.serde2.RegexSerDe' canonical converter,
        may be other converters may be implemented serde customize their interface definition converter
    row SerDe format 'org.apache.hadoop.hive.serde2.RegexSerDe'
    with serdeproperties (
        "input.regex" = "is being parsed for each row expressions"
    )

 

Download Data

1, load data mode

     When data is loaded into the table, without any data conversion. Load operations are only copy the data to the Hive positions corresponding to the table. Automatically creates a directory under the table when the data is loaded

-- 普通表加载数据
load data local inpath '/root/hivedata/person' into table person;

-- 分区表加载数据(静态分区)
load data local inpath '/root/hivedata/person' into table p_person partition (sex='man',age=10);

2, from insert mode

      grammar

from from_statement 
insert overwrite table tablename1 [partition (partcol1=val1, partcol2=val2 ...) [if not exists]] select_statement1 
[insert overwrite table tablename2 [partition ... [if not exists]] elect_statement2] 
[insert into table tablename2 [partition ...] select_statement2] ...;

     Case

-- 静态分区表
from person 
insert overwrite table p_person partition (sex='woman', age=60) 
select id,name,likes,address;

 

Custom Functions

Hive  自定义函数
Hive的UDF开发只需要重构UDF类的evaluate函数即可。例:

package com.hrj.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public class helloUDF extends UDF {
    public String evaluate(String str) {
        try {
            return "HelloWorld " + str;
        } catch (Exception e) {
            return null;
        }
    }
} 

Hive  自定义函数调用
将该java文件编译成helloudf.jar
hive> add jar helloudf.jar;
hive> create temporary function helloworld as 'com.hrj.hive.udf.helloUDF';
hive> select helloworld(t.col1) from t limit 10;
hive> drop temporary function helloworld;

注意 
1.helloworld为临时的函数,所以每次进入hive都需要add jar以及create temporary操作
2.UDF只能实现一进一出的操作,如果需要实现多进一出,则需要实现UDAF

 

hive parameters

Namespaces

Read and write permissions

meaning

hiveconf

Can read and write

hive-site.xml among each configuration variables

例:hive --hiveconf hive.cli.print.header=true

system

Can read and write

System variables include JVM operating parameters

例:system:user.name=root

env

Read-only

Environment Variables

例:env:JAVA_HOME

hivevar

Can read and write

Li: hive the -d val = key

1, by $ {} for reference, wherein the System , the env variables must begin with the prefix

2, Hive parameter setting mode

     (1) , modify the configuration file $ HIVE_HOME} {/ the conf /hive-site.xml

     (2) Start hive cli when, by - hiveconf Key value = manner set

                  例:hive --hiveconf hive.cli.print.header=true

     (3) into cli Thereafter, by using the set command to set the

 

3, hive set command

      In the hive CLI can console set to hive ongoing parameter query, set

      set settings:

                   set hive.cli.print.header=true;

      set Views:

                  set hive.cli.print.header

      hive parameters initial configuration:

                 Under the current user's home directory . Hiverc file

                 Such as :. ~ / Hiverc

                 If not, you can directly create the file, you need to set the parameters written in the file, Hive when up and running, it will load the configuration file change.

      hive history Operation Command Set:

                 ~ /. hivehistory

Published 92 original articles · won praise 3 · Views 5104

Guess you like

Origin blog.csdn.net/qq_22049773/article/details/103924087