Hive tuning (to be improved)

build data table

/opt/datafiles/dept.txt
10	ACCOUNTING	1700
20	RESEARCH	1800
30	SALES	1900
40	OPERATIONS	1700
/opt/datafiles/emp.txt
7369	SMITH	CLERK	7902	1980-12-17	800.00		20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		20
7839	KING	PRESIDENT		1981-11-17	5000.00		10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.00		20
7900	JAMES	CLERK	7698	1981-12-3	950.00		30
7902	FORD	ANALYST	7566	1981-12-3	3000.00		20
7934	MILLER	CLERK	7782	1982-1-23	1300.00		10
create external table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
create external table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string, 
sal double, 
comm double,
deptno int)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datafiles/dept.txt' into table dept;
load data local inpath '/opt/datafiles/emp.txt' into table emp;

Implementation plan

View the execution plan of the statement

No MR tasks are generated

explain select * from emp;

insert image description here

There are generated MR tasks

explain select deptno,avg(sal) avg_sal from emp group by deptno;

insert image description here

View detailed execution plan

explain extended select * from emp;
explain extended select deptno,avg(sal) avg_sal from emp group by deptno;

FetchCrawl

Fetch capture means that queries in Hive do not need to use MapReduce calculations for certain situations. For example: SELECT * FROM employees;In this case, Hive can simply read the files in the storage directory corresponding to employee, and then output the query results to the console.
In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is changed to more, mapreduce will not be used for global search, field search, and limit search.

application

(1) Set hive.fetch.task.conversion to none, and then execute the query statement, the mapreduce program will be executed.

set hive.fetch.task.conversion=none;

select * from emp;

insert image description here
insert image description here

select ename from emp;
select ename from emp limit 3;
都会执行mapreduce程序。

(2) Set hive.fetch.task.conversion to more, and then execute the query statement. The following query methods will not execute the mapreduce program.

set hive.fetch.task.conversion=more;

select * from emp;
select ename from emp;
select ename from emp limit 3;
都不会执行mapreduce程序。

local mode

Most Hadoop jobs require the full scalability provided by Hadoop to handle large data sets. However, sometimes the amount of input data to Hive is very small. In this case, triggering execution tasks for the query may take much more time than the actual job execution time. For most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, execution time can be significantly reduced.
Users can set hive.exec.mode.local.autothe value trueto let Hive automatically start this optimization at an appropriate time.
(1) Turn on the local mode and execute the query statement

set hive.exec.mode.local.auto=true;
select * from emp cluster by deptno;
14 rows selected (8.13 seconds)

(2) Close the local mode and execute the query statement

set hive.exec.mode.local.auto=false;
select * from emp cluster by deptno;
14 rows selected (38.737 seconds)

It can be seen that the time-consuming gap is obvious

table optimization

(to be perfected)

Guess you like

Origin blog.csdn.net/weixin_46322367/article/details/125031172