Hadoop 上Hive 的操作

数据dept表的准备：

--创建dept表
CREATE TABLE dept(
  deptno int, 
  dname string, 
  loc string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','  STORED AS textfile;

数据文件准备：

vi detp.txt
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

数据表emp准备：

CREATE TABLE emp(
  empno int, 
  ename string, 
  job string, 
  mgr int, 
  hiredate string, 
  sal int, 
  comm int, 
  deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;

表emp数据准备：

vi emp.txt
7369,SMITH,CLERK,7902,1980-12-17,800,null,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,null,20
7654,MARTIN,SALESMAN,7698,1981-09-28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01,2850,null,30
7782,CLARK,MANAGER,7839,1981-06-09,2450,null,10
7788,SCOTT,ANALYST,7566,1987-04-19,3000,null,20
7839,KING,PRESIDENT,null,1981-11-17,5000,null,10
7844,TURNER,SALESMAN,7698,1981-09-08,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23,1100,null,20
7900,JAMES,CLERK,7698,1981-12-03,950,null,30
7902,FORD,ANALYST,7566,1981-12-02,3000,null,20
7934,MILLER,CLERK,7782,1982-01-23,1300,null,10

把数据文件装到表里

load data local inpath '/home/hadoop/tmp/dept.txt' overwrite into table dept;
load data local inpath '/home/hadoop/tmp/emp.txt' overwrite into table emp;

查询语句

select d.dname,d.loc,e.empno,e.ename,e.hiredate from dept d join emp e on e.deptno = d.deptno ;
  * 可以看到走的是map reduce 程序

扫描二维码关注公众号，回复： 5711315 查看本文章

二、Hive分区
hive分区的目的
* hive为了避免全表扫描，从而引进分区技术来将数据进行划分。减少不必要数据的扫描，从而提高效率。

hive分区和mysql分区的区别
* mysql分区字段用的是表内字段；而hive分区字段采用表外字段。

hive的分区技术
* hive的分区字段是一个伪字段，但是可以用来进行操作。
* 分区字段不进行区分大小写
* 分区可以是表分区或者分区的分区，可以有多个分区

hive分区根据
* 看业务，只要是某个标识能把数据区分开来。比如：年、月、日、地域、性别等

分区关键字
* partitioned by(字段)

分区本质
* 在表的目录或者是分区的目录下在创建目录，分区的目录名为指定字段=值

创建分区表：

create table if not exists u1( id int, name string, age int ) partitioned by(dt string) row format delimited fields terminated by ' ' 
stored as textfile;

数据准备：

[hadoop@master tmp]$ more u1.txt 
1 xm1 16
2 xm2 18
3 xm3 22

加载数据：

load data local inpath '/home/hadoop/tmp/u1.txt'  into table  u1 partition(dt="2018-10-14");

查询：

hive> select * from u1;
OK
1       xm1     16      2018-10-14
2       xm2     18      2018-10-14
3       xm3     22      2018-10-14
Time taken: 5.919 seconds, Fetched: 3 row(s)

查询分区：

hive> select * from u1 where dt='2018-10-15';
OK
1       xm1     16      2018-10-15
2       xm2     18      2018-10-15
3       xm3     22      2018-10-15
Time taken: 0.413 seconds, Fetched: 3 row(s)

Hive的二级分区

创建表u2

create table if not exists u2(id int,name string,age int)
partitioned by(month int,day int) row format delimited fields terminated by ' ' stored as textfile;

导入数据：

load data local inpath '/home/hadoop/tmp/u2.txt' into table u2 partition(month=9,day=14);

数据查询：

hive> select * from u2;
OK
1       xm1     16      9       14
2       xm2     18      9       14
Time taken: 0.303 seconds, Fetched: 2 row(s)

分区修改：

查看分区：

hive>  show partitions u1;
OK
dt=2018-10-14
dt=2018-10-15

增加分区：

 > alter table u1 add partition(dt="2018-10-16");
  OK

查看新增加的分区：

hive>  show partitions u1;
OK
dt=2018-10-14
dt=2018-10-15
dt=2018-10-16
Time taken: 0.171 seconds, Fetched: 3 row(s)

删除分区：

hive> alter table u1 drop partition(dt="2018-10-15");
Dropped the partition dt=2018-10-15
OK
Time taken: 0.576 seconds
hive>  select * from u1 ;
OK
1       xm1     16      2018-10-14
2       xm2     18      2018-10-14
3       xm3     22      2018-10-14
Time taken: 0.321 seconds, Fetched: 3 row(s)

三、hive动态分区

hive配置文件hive-site.xml 文件里有配置参数：

hive.exec.dynamic.partition=true;                 是否允许动态分区
hive.exec.dynamic.partition.mode=strict/nostrict; 动态区模式为严格模式
    strict:  严格模式，最少需要一个静态分区列(需指定固定值)
    nostrict:非严格模式，允许所有的分区字段都为动态。
hive.exec.max.dynamic.partitions=1000;        允许最大的动态分区
hive.exec.max.dynamic.partitions.pernode=100; 单个节点允许最大分区

创建动态分区表

动态分区表的创建语句与静态分区表相同，不同之处在与导入数据，静态分区表可以从本地文件导入，但是动态分区表需要使用from…insert into语句导入。

create table if not exists u3(id int,name string,age int) partitioned by(month int,day int) 
row format delimited fields terminated by ' ' stored as textfile;

导入数据，将u2表中的数据加载到u3中：

from u2
insert into table u3 partition(month,day)
select id,name,age,month,day;

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
解决方法：

要动态插入分区必需设置hive.exec.dynamic.partition.mode=nonstrict
hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
hive> set hive.exec.dynamic.partition.mode=nonstrict;

然后再次插入就可以了

查询：

hive> select * from u3;
OK
1       xm1     16      9       14
2       xm2     18      9       14
Time taken: 0.451 seconds, Fetched: 2 row(s)

hive分桶

分桶目的作用
* 更加细致地划分数据；对数据进行抽样查询，较为高效；可以使查询效率提高
分桶原理关键字
* 分桶字段是表内字段，默认是对分桶的字段进行hash值，然后再模于总的桶数，得到的值则是分区桶数。
bucket
clustered by(id) into 4 buckets
分桶的本质
* 在表目录或者分区目录中创建文件。

分桶案例
* 分四个桶

create table if not exists u4(id int, name string, age int) partitioned by(month int,day int)
clustered by(id) into 4 buckets row format delimited fields terminated by ' ' stored as textfile;

Hadoop 上Hive 的操作

创建动态分区表

猜你喜欢