Hive知识点总结二

hive的修改表操作

alter table table_name rename to new_table_name
添加列信息
alter table dept_partition add columns(depedesc string);
更新列信息
alter table dept_partition change column deptdesc desc int;
替换列
alter table dept_partition replace columns(deptno string,dname string,loc string);

向表中加载数据的基本语法

load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,....)]

1. load data 表示加载数据
2. local 表示从本地加载数据到hive表，否则从hdfs加载数据到hive表
3. inpath 表示加载数据的路径
4. overwrite 表示覆盖表中已有的数据　否则表示追加
5. into table  表示加载哪张表
6. student 表示具体的表
7. partition 表示上传到指定的分区

通过查询语句向表中插入数据

insert into student partiiton(month = '20190812') values(?,?) , (?,?);

insert overwrite table student partition (month = '201708') select id,name from student where month = 201709; 以覆盖模式写入

from  student 
insert overwrite table student partition(month='201707')  
select id,name from student where month = '201709' 
insert overwrite table student partition(month='201706')  
select id,name from student where month = '201709' ;

根据查询结果创建表

create table if not exists student1 as select id,name from student;

创建表时并指定在hdfs上的位置

create external table if not exists student5(
id int,name string
)
row format delimited fiels terminated by '\t'
location '/student';

数据导出

import 数据到指定的hive表

import table student2 partition(month ='202009') from '/usr/hive/warehouse/export/student'  (hdfs)

hive表数据导出

insert oerwrite local directory  '/usr/local/student'  select * from student;  查询结果导出到本地

insert overwrite ［local］ directory '/usr/local/student' row format delimited fields terminated by '\t'  select * from student;//将查询的数据格式化导出到本地（或者hdfs）

hadoop　命令导出到本地

dfs -get /usr/local/data/hive/student/month=201708/00000_0 /usr/local/student.txt

hive shell 命令导出到本地

bin/hive -e 'select * from default.student;' > /usr/local/hive/student.txt

export 导出到hdfs上

export table default.student to '/usr/local/hive/student.txt'

排序

全局排序　order by

每个mapreduce里面内部排序(Sort by)

对于大规模的数据集orderby的效率非常低。在很多情况下，并不需要全局排序，此时可以使用sort by

设置reduce的个数

set mapreduce.job.reduces=3;

分区排序distribute by

规则：

distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后，余数相同的分到一个区．
hive要求distribute by语句写在sort by 语句之前

cluster by

当distribute by 与sort by 字段相同时，可以使用cluster by 方式排序的话只能升序排列.

扫描二维码关注公众号，回复： 11852334 查看本文章

分桶及抽样查询

创建分桶表

create table school(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

记得设置属性
set hive.enforce.bucketing=true;
Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求．

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y)。
y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。
例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，
当y=8时，抽取(4/8=)1/2个bucket的数据。x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。
例如，table总bucket数为4，tablesample(bucket 1out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据