Good programmers to share large data line learning mode of operation of the hive

  Good programmers to share large data line learning operation mode of the hive, the hive property: 1, is provided at the end cli (only for the current session) 3, provided (current connection) java code 2, set in the configuration file ( All session active)

Set the properties of descending priority order. cli end properties can only be set to start non-hive needs. (Log attribute, the metadata connection properties)

Find all properties: hive> set; view the current value of the property: usually hadoop hive> set -v; fuzzy search attributes: hive -S -e "set" | grep current; hive -S -e "set" | grep index ;

hive variables: system, env, hivevar, hiveconf

system: system level environment variables (jvm, hadoop, etc.), read and write hive> set system: min.limit = 3; hive> set system: min.limit; system: min.limit = 3

env: environment variables (HADOOP_HOME), read-only can not write. hive> set env: PWD; env: PWD = / usr / local / hive-1.2.1

hivevar: Custom temporary variables (read and write)

Hive> set hivevar: min.limit = 3;
Hive> set hivevar: min.limit;
hivevar: min.limit = 3
hive> set hivevar: min.limit = 2;
Hive> set hivevar: min.limit;
hivevar: min.limit = 2

hiveconf: Custom temporary variable attributes (read and write)

hive> set hiveconf:max.limit=10;
hive> set hiveconf:max.limit;
hiveconf:max.limit=10
hive> set hiveconf:max.limit=6;
hive> set hiveconf:max.limit;
hiveconf:max.limit=6

hive mode of operation: 1, cli end run (provisional statistics, development) 2, hive -S -e "hql statement"; (hql a query for a single statement) 3, hive -S -f / hql file; (hql file script)

No arguments

hive -S -e "use qf1603;select * from user1;" hive -S -f /home/su.hql;

hive in the previous version 0.9 is not supported by the implementation of the -f arguments:

hive --hivevar min_limit=3 -hivevar -hivevar t_n=user1 -e 'use qf1603;select * from {hive:t_n} limit {hivevar:min_limit};'

hive --hiveconf min_lit=3 -e "use qf1603;select * from user1 limit ${hiveconf:min_lit};"

hive -S --hiveconf t_n=user1 --hivevar min_limit=3 -f ./su.hql

hive in Notes: - Notes content

insert overwrite local directory '/home/out/05'
select * from user1 limit 3;

### three, optimizing Hive 1, 2 environment to optimize, optimization (linux number of handles, the application memory allocation, if the load, etc.) in terms of the configuration attributes. 3, code optimization (hql, try a different one of written hql).

1, learn to look explain

explain: show program hql query. explain extended: show program hql query. Also displays the abstract expression tree hql. (Interpreter is doing something)

explain select from user1;
explain extended select
from user1;

A hql statements will have one or more stage constitution. Each stage corresponds to one of mr job, stage may be a Fetch, map join, limit other operations. Each stage will be performed in order dependency, the dependency can not be executed in parallel.

2, optimization of the limit of:

hive.limit.row.max.size=100000
hive.limit.optimize.limit.file=10
hive.limit.optimize.enable=false

3, optimization of the join:

Always a small table-driven large table (small result set to drive large result sets) when necessary, use a small table identifies / + STREAMTABLE (small table alias) / adjust the business to be able to make use of map-side join: hive.auto.convert.join : smalltable: try to avoid Cartesian product join queries, even if there is also need to use slightly on or where to filter. hive now join only supports the equivalent connection (= and). No other

4, using the hive local mode (running inside a jvm)

hive.exec.mode.local.auto=false
hive.exec.mode.local.auto.inputbytes.max=134217728
hive.exec.mode.local.auto.input.files.max=4

5, hive parallel execution (no interdependencies may be performed in parallel between Stage)

hive.exec.parallel=false
hive.exec.parallel.thread.number=8

6, strict mode:

Hive barrier strict mode provides three kinds of queries: 1, a lookup table with the partition 2, the queries with orderby 3, join query, without on the condition or where conditions.

7, a mapper, and reduce the number of

Too many mapper number, start time-consuming, the number of too little, too much the number of underutilized resources reducer, start the time-consuming, the number is too small, underutilized resources

Number Mapper: Manual Setting:

set mapred.map.tasks=2;

Block size appropriately adjusted, thereby changing the number of fragments to change the number of mapper:

To reduce the number of mapper file by merging small files:

set mapred.max.split.size=25600000; 256M
set mapred.min.split.per.node=1
set mapred.min.split.per.rack=1
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

reducer number (usually set manually):

set mapreduce.job.reduces=-1;

8, hive using jvm reuse

mapreduce.job.jvm.numtasks = 1 set mapred.job.reuse.jvm.num = 8; ## jvm number of tasks in the task of running

9, data skew (see: Hive optimization .docx file)

Tilt data: a column value data nonuniform distribution. Tilt causes of data: 1, the original data is tilted 2, hql statement may result in 3, join very easily lead to 4, count (distinct col) 5, group by statement is also easy

Solution: 1, if the data itself is inclined, can be isolated directly see the data (the data to find the inclination) 2, the inclination calculation data out separately, and then the normal data union all 3, the inclination data given random number conduct join query, the task of equalization for each task. 4, attempts to rewrite the constant demand hql statement.

Tilt solve a few attribute settings:

hive.map.aggr=true
hive.groupby.skewindata=false
hive.optimize.skewjoin=false  

10, job number of control

Connection type of connection queries on the field as much as possible the same. Hql is usually a simple statement generates a job, there is join, limit, group by will likely generate a separate job.

select
u.uid,
u.uname
from user1 u
where u.uid in (select l.uid from login l where l.uid=1 limit 1)
;
select
u.uid,
u.uname
from user1 u
join login l
on u.uid = l.uid
where l.uid = 1
;

Partition, divide the barrel, the index itself is a hive of these kind of optimization.

Guess you like

Origin blog.51cto.com/14479068/2427458