Article directory
build data table
/opt/datafiles/dept.txt
10 ACCOUNTING 1700
20 RESEARCH 1800
30 SALES 1900
40 OPERATIONS 1700
/opt/datafiles/emp.txt
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANALYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10
create external table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
create external table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datafiles/dept.txt' into table dept;
load data local inpath '/opt/datafiles/emp.txt' into table emp;
Implementation plan
View the execution plan of the statement
No MR tasks are generated
explain select * from emp;
There are generated MR tasks
explain select deptno,avg(sal) avg_sal from emp group by deptno;
View detailed execution plan
explain extended select * from emp;
explain extended select deptno,avg(sal) avg_sal from emp group by deptno;
FetchCrawl
Fetch capture means that queries in Hive do not need to use MapReduce calculations for certain situations. For example: SELECT * FROM employees;
In this case, Hive can simply read the files in the storage directory corresponding to employee, and then output the query results to the console.
In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is changed to more, mapreduce will not be used for global search, field search, and limit search.
application
(1) Set hive.fetch.task.conversion to none, and then execute the query statement, the mapreduce program will be executed.
set hive.fetch.task.conversion=none;
select * from emp;
select ename from emp;
select ename from emp limit 3;
都会执行mapreduce程序。
(2) Set hive.fetch.task.conversion to more, and then execute the query statement. The following query methods will not execute the mapreduce program.
set hive.fetch.task.conversion=more;
select * from emp;
select ename from emp;
select ename from emp limit 3;
都不会执行mapreduce程序。
local mode
Most Hadoop jobs require the full scalability provided by Hadoop to handle large data sets. However, sometimes the amount of input data to Hive is very small. In this case, triggering execution tasks for the query may take much more time than the actual job execution time. For most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, execution time can be significantly reduced.
Users can set hive.exec.mode.local.auto
the value true
to let Hive automatically start this optimization at an appropriate time.
(1) Turn on the local mode and execute the query statement
set hive.exec.mode.local.auto=true;
select * from emp cluster by deptno;
14 rows selected (8.13 seconds)
(2) Close the local mode and execute the query statement
set hive.exec.mode.local.auto=false;
select * from emp cluster by deptno;
14 rows selected (38.737 seconds)
It can be seen that the time-consuming gap is obvious
table optimization
(to be perfected)