Article Directory
- 1, what is the Hive
- 2, Hive significance (originally developed reasons)
- 3, Hive internal component modules, respectively, what is the role of
- 4, Hive supported storage format
- 5.Hive supported data types
- 6, a window into the way Hiveshell
- 7. What path when Hive database tables stored on HDFS
- 8 difference, like with the rlike
- 9, the difference between the inner table and the outer table
- 10, the partition table is the advantage? Partition field is required?
- 11 points advantage barrel table is? Sub-barrel field requirement is that? Minutes barrels rules?
- 12, the data into a table manner
- 13, the data table deriving manner
- 14, order by and sort by the difference
- 15, the difference in where and having the hive
- 16, distribute by when to use, which is usually used in combination with
- 17, Cluster by when to use
- 18, distribute by + sort by (same field) and Cluster by difference
- 19, hive -e / -f / -hiveconf respectively, what is the meaning
- 20, hive declared parameters of what way, what is the priority
- 21, hiveUDF write the code, what is the name of the method
- 22, what is the enterprise hive commonly used data storage format? What common data compression format?
- 23, hive custom function type
- 24, Fetch crawl set more what effect, what effect setting none
- 25, local mode What are the benefits
- 26, when a key data skew data is too large, how to deal with
- 27, Count (distinct) how to write an alternative statement
- 28, how to use the partition cut, cut row
- 29, how to understand the dynamic partitioning adjustment
- 30, the inclination data, how many write data files 10 disposed
- 31, reduce the number of calculations is how
- 32. What are the benefits of parallel execution
- 33, which commands the implementation of strict mode can not be
- 34, JVM reuse What are the advantages and disadvantages
- 35, what is the Hive local mode
- 36. What is the local computations MR
- 36, after the first join of the filter optimization
- 37, the number of factors that affect the map
1, what is the Hive
hive is a hadoop-based database tool, you can map the structure of the data into a data table and provides a SQL-like query.
2, Hive significance (originally developed reasons)
Reduce the development work for programmers. Reduce the cost of learning.
3, Hive internal component modules, respectively, what is the role of
Metadata
parser: parser SQL statement
compilers: the SQL statement is compiled into MapReduce programs
optimizer: Optimize MapRedue program
executor: the results will be submitted to the HDFS MapReduce running
4, Hive supported storage format
Supported formats include storage hive TextFile, SequenceFile, ParquetFile, RCFile, ORC Files, (Avro Files).
5.Hive supported data types
Hive supports the original data and complex types, including primitive types of numeric, Boolean, string, timestamp. Complex type comprising a array, map, struct, union.
The original data type (basic data types):
type name | size | Remark |
---|---|---|
TINYINT | 1 byte integer | 45Y |
SMALLINT | 2-byte integer | 12S |
INT | 4-byte integer | 10 |
BIGINT | 8-byte integer | 244L |
FLOAT | 4-byte single-precision floating-point | 1.0 |
DOUBLE | 8-byte double-precision floating-point | 1.0 |
DECIMAL | Arbitrary-precision signed decimal | DECIMAL range (4, 2): -99.99 to 99.99 |
BOOLEAN | true/false | TRUE |
STRING | String of indefinite length | "Abcaa" does not distinguish between single and double quotes |
VARCHAR | String, variable length, with an upper limit | 0.12.0 version introduces |
CHAR | String, fixed length | "A" does not distinguish between single and double quotes |
BINARY | Storing variable-length binary data | |
TIMESTAMP | Time stamp, nanosecond precision | 122327258765 |
DATE | date | ‘2019-011-28’ |
Complex type:
type name | size | Examples |
---|---|---|
ARRAY | The same type of data storage | ARRAY< data_type> |
MAP | key-value, key must be a primitive type, value may be any type | MAP< primitive_type, data_type> |
STRUCT | Different types can | STRUCT< col_name : data_type [COMMENT col_comment], …> |
UNION | A value within a limited range of | UNIONTYPE< data_type, data_type, …> |
6, a window into the way Hiveshell
1. Use the premise configured environment variables direct
hive
command
2. hive --service hiveserver2 => into the open beline => connect jdbc:! Hive2 : // hadoop01: 10000
7. What path when Hive database tables stored on HDFS
The default is stored in the / user / hive / warehouse
8 difference, like with the rlike
like a fuzzy query
rlike supports regular expressions
9, the difference between the inner table and the outer table
When deleting internal tables: Internal tables Delete also deletes the table metadata and data.
When you delete the external table: external metadata table is deleted, the data itself is not deleted.
10, the partition table is the advantage? Partition field is required?
advantage:
1, to improve query performance, speed up the search
(for partition object query can only search their own partition of concern to improve the retrieval speed).
2, enhanced usability
(If a partition table fails, the data in the other partition tables are still available);
3, easy to maintain
if a partition table failure, need to fix the data, only repair the partition;
4, balancing the I / O
(different partitions can be mapped to different disks to balance the I / O, to improve overall system performance).
Claim:
Partition field can not appear in the data table with some fields.
Partition field do not have Chinese (otherwise error).
11 points advantage barrel table is? Sub-barrel field requirement is that? Minutes barrels rules?
advantage:
1, to improve the join query efficiency (provided that, join field to field sub-barrel)
2, increase sampling efficiency
Claim:
Field must be divided barrel fields in the table
Divided barrel rules:
First of sub-barrel field hash value, then the remainder of the number of barrels, more than a few of which will be placed in the bucket.
12, the data into a table manner
-- (以下四种都是直接向表中导入数据)
-- 1.从linux中加载数据到hive
load data local inpath ‘数据路径’ into table 表名;
-- 2.从linux中加载数据到hive,并覆盖
load data local inpath ‘数据路径’ overwrite into table 表名;
-- 3. 从hdfs中加载数据到hive
load data inpath ‘数据路径’ into table 表名;
-- 4.从hdfs中加载数据到hive,并覆盖
load data inpath ‘数据路径’ overwrite into table 表名;
-- 5、直接向分区表中插入数据
insert into table score3 partition(month =‘201807’) values (‘001’,‘002’,‘100’);
-- 6、多插入模式
from score
insert overwrite table score_first partition(month=‘201806’) >select s_id,c_id
insert overwrite table score_second partition(month = >‘201806’) select c_id,s_score;
--7、查询语句中创建表并加载数据(as select)
create table tbname2 as select * from tbname1;
--8、创建表时通过location指定加载数据路径
create external table score6 (s_id string,c_id string,s_score int)
row format delimited fields
terminated by ‘\t’ location ‘/myscore6’;
13, the data table deriving manner
-- 1、将查询的结果导出到本地
insert overwrite local directory '/export/servers/exporthive/a'
select * from score;
-- 2、将查询的结果格式化导出到本地
insert overwrite local directory '/export/servers/exporthive'
row format delimited fields terminated by '\t'
collection items terminated by '#'
select * from student;
-- 3、将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/export/servers/exporthive'
row format delimited fields terminated by '\t'
collection items terminated by '#'
select * from score;
-- 4、Hadoop命令导出到本地
dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;
-- 5 、 hive shell 命令导出
bin/hive -e "select * from yhive.score;" > /export/servers/exporthive/score.txt
-- 6、export导出到HDFS上(全表导出)
export table score to '/export/exporthive/score';
-- 7、SQOOP导出
14, order by and sort by the difference
order by sorting out the data is
全局有序
, but the only一个
partition partition
sort by data is sorted out局部有序
, but全局无序
there may多个
partition partition
15, the difference in where and having the hive
1.where is filtered prior to the data packet, after HAVING packet data is filtered
after 2.where not function with the polymerization, the polymerization can be followed HAVING function (as where the polymerization is greater than the order of execution function, and the execution order of less than HAVING aggregate functions)
3.where can filter any column, hive only on the results of a query to filter columns
16, distribute by when to use, which is usually used in combination with
In case you need to partition a field, distribute by regular and sort by used in conjunction with
17, Cluster by when to use
Need to partition and use the Cluster by the time the field is sort of a field.
18, distribute by + sort by (same field) and Cluster by difference
Cluster by ordering only be positive, you can not specify the collation for the ASC or DESC
19, hive -e / -f / -hiveconf respectively, what is the meaning
hive -e 'sql' within quotation marks run sql query
sql statement hive -f file run file
-related configuration properties hive -hiveconf set runtime
20, hive declared parameters of what way, what is the priority
Parameter declaration> command> profile
parameter declaration: set param = value;
Command line: hive --hiveconf param = value;
profile: Review hive-site.xml
21, hiveUDF write the code, what is the name of the method
evaluate()
22, what is the enterprise hive commonly used data storage format? What common data compression format?
In a real project which, hive table data storage format generally used ORC Files and Parquet, compression generally selected snappy.
23, hive custom function type
Custom functions are divided into three categories:
the UDF (the User the Defined Function): one into a
UDAF (User Defined Aggregation Function): aggregate function, into a plurality (e.g. COUNT / max / min)
UDTF used by the (the Defined the Table Generating the User Function ): a multiple-out, as lateral view explode ()
24, Fetch crawl set more what effect, what effect setting none
None will set all the sql statements are converted into MapReduce programs
to more in the future, the basic query is executed directly, not converted into MapReduce programs
25, local mode What are the benefits
Under the premise of a small amount of data, improve query efficiency
26, when a key data skew data is too large, how to deal with
Open (Map) partial aggregation, MapReduce Hive creates two procedures, the first data is partially polymerized, the second data of the final summary.
A large file is divided into multiple small files
27, Count (distinct) how to write an alternative statement
-- 先去重,再求总数量
select count(distinct id) from bigtable;
-- 替换方案
select count(id) from (select id from bigtable group by id) a;
28, how to use the partition cut, cut row
Zoning cut: just take the required partition
column cut: just take the required cut
用什么拿什么
29, how to understand the dynamic partitioning adjustment
In a table of rule partitioning, the partitioning rule corresponds to the second table, all the partition of the first table copied to all the second table, the second table at the time of loading the data need not be specified partition, partition can be used directly on a table.
30, the inclination data, how many write data files 10 disposed
Provided reduceTask number of 10
. 1: the distribute by (field)
2 by the distribute RAND ()
31, reduce the number of calculations is how
Equation:
N min = (parameter 2, the total amount of input data / parameter 1)
Parameter 1: Reduce the maximum amount of data for each process
parameter 2: Reduce the maximum number of each task
32. What are the benefits of parallel execution
In the absence of reliance on the premise that the entire job execution time is shortened to improve query efficiency
33, which commands the implementation of strict mode can not be
1. Limit the Cartesian product of a query
2. Use the query order by the statement, which calls for the use of limit statements
3. The user is not allowed to scan all partitions
34, JVM reuse What are the advantages and disadvantages
Advantages: 1. reduce the overhead of task start, improve the efficiency of the task.
2. Allow the use of a multiple task jvm
inadequate: jvm not released before the end of the entire mission, prolonged occupation, leading to insufficient resources, resources (without the use of occupancy )waste
35, what is the Hive local mode
Tasks on the node submit SQL statements "local" execution, the task will not be assigned to a cluster
36. What is the local computations MR
After the data is stored to the HDFS, write code to achieve analysis calculation program, the program performing the distribution, the distribution priority placed on the program used by the node to which the data resides.
36, after the first join of the filter optimization
-- 1.先关联再过滤
select a.id from bigtable a join bigtable b on (b.id>10 and a.id =b.id);
-- 2.先过滤再关联
select a.id from bigtable join (select id from bigtable where id > 10) b on a.id = b.id;
37, the number of factors that affect the map
Small file when the file number of
the number of large files when the block
Finally: limited capacity, if there is something wrong I welcome message.