Hive学习笔记(11)- Hive新功能 import,export; order by,sort by,distribute by, cluster by

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012292754/article/details/86530539

1 hive 的 import,export

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport

1.1 export

  • 这里的导出路径指的是 HDFS 里的路径
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
  TO 'export_target_path' [ FOR replication('eventid') ]

export table emp to '/user/hive/warehouse/test_export/emp_exp';
在这里插入图片描述

2 Sort

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

  • order by: 全局排序,一个 reduce
  • sort by : 每个 reduce 内部进行排序,全局不是排序
  • distribute by : 类似 MR 中的 partition,进行分区,结合 sort by 使用;
  • cluster by : 当使用 distribute 和 sort 字段相同时,使用

2.1 order by(谨慎使用)

  • 全局数据排序
    SELECT * FROM emp order by empno desc;
hive (default)> SELECT * FROM emp order by empno desc;
Query ID = hadoop_20190117103838_eb7585e6-fbbc-4fc8-978a-fb5216cc70d0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1547692669080_0013, Tracking URL = http://node1:8088/proxy/application_1547692669080_0013/
Kill Command = /home/hadoop/apps/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1547692669080_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-01-17 22:04:54,709 Stage-1 map = 0%,  reduce = 0%
2019-01-17 22:04:58,936 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.34 sec
2019-01-17 22:05:05,106 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.78 sec
MapReduce Total cumulative CPU time: 2 seconds 780 msec
Ended Job = job_1547692669080_0013
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.78 sec   HDFS Read: 8185 HDFS Write: 674 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 780 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7934	Miller	clerk	7782	1982-1-23	1300.0	10.0	NULL
7902	Ford	analyst	7566	1981-12-3	3000.0	20.0	NULL
7900	Jamas	clerk	7698	1981-12-3	950.0	30.0	NULL
7876	Adams	clerk	7788	1987-5-23	1100.0	20.0	NULL
7844	Turner	salesman	7698	1981-9-8	1500.0	0.0	30
7839	King	president	NULL	5000.0	10.0	NULL	NULL
7788	Scott	analyst	7566	1987-4-19	3000.0	20.0	NULL
7782	Clark	manager	7839	1981-6-9	2450.0	10.0	NULL
7698	Blake	manager	7839	1981-5-1	2850.0	30.0	NULL
7654	Martin	salesman	7698	1981-9-28	1250.0	1400.0	30
7566	Jones	manager	7839	1981-4-2	2975.0	20.0	NULL
7521	Ward	salsesman	7698	1981-2-22	1250.0	500.0	30
7499	Allen	salesman	7698	1981-2-20	1600.0	300.0	30
7369	Smith	clerk	7902	1980-12-17	800.0	20.0	NULL
Time taken: 17.23 seconds, Fetched: 14 row(s)

2.2 sort by

  • 对每一个 reduce 内部数据进行排序
set mapreduce.job.reduces=3;

INSERT overwrite local directory '/home/hadoop/tempdata/sortby_res'
SELECT * FROM emp sort by empno asc;

在这里插入图片描述

2.3 distribute by

  • 分区 partition
  • 类似于 MapReduce 中分区 partition,对数据进行分区,结合 sort by 使用
  • distribute by必须要在sort by 前面
INSERT overwrite local directory '/home/hadoop/tempdata/distby_res'
SELECT * FROM emp distribute by deptno sort by empno asc;

2.4 cluster by

  • distribute bysort by 字段相同时,可以使用

猜你喜欢

转载自blog.csdn.net/u012292754/article/details/86530539