Big Data interview short answer questions (c) -hive

Article Directory

1, what is the Hive

hive is a hadoop-based database tool, you can map the structure of the data into a data table and provides a SQL-like query.

2, Hive significance (originally developed reasons)

Reduce the development work for programmers. Reduce the cost of learning.

3, Hive internal component modules, respectively, what is the role of

Metadata
parser: parser SQL statement
compilers: the SQL statement is compiled into MapReduce programs
optimizer: Optimize MapRedue program
executor: the results will be submitted to the HDFS MapReduce running

4, Hive supported storage format

Supported formats include storage hive TextFile, SequenceFile, ParquetFile, RCFile, ORC Files, (Avro Files).

5.Hive supported data types

Hive supports the original data and complex types, including primitive types of numeric, Boolean, string, timestamp. Complex type comprising a array, map, struct, union.

The original data type (basic data types):
type name size Remark
TINYINT 1 byte integer 45Y
SMALLINT 2-byte integer 12S
INT 4-byte integer 10
BIGINT 8-byte integer 244L
FLOAT 4-byte single-precision floating-point 1.0
DOUBLE 8-byte double-precision floating-point 1.0
DECIMAL Arbitrary-precision signed decimal DECIMAL range (4, 2): -99.99 to 99.99
BOOLEAN true/false TRUE
STRING String of indefinite length "Abcaa" does not distinguish between single and double quotes
VARCHAR String, variable length, with an upper limit 0.12.0 version introduces
CHAR String, fixed length "A" does not distinguish between single and double quotes
BINARY Storing variable-length binary data
TIMESTAMP Time stamp, nanosecond precision 122327258765
DATE date ‘2019-011-28’
Complex type:
type name size Examples
ARRAY The same type of data storage ARRAY< data_type>
MAP key-value, key must be a primitive type, value may be any type MAP< primitive_type, data_type>
STRUCT Different types can STRUCT< col_name : data_type [COMMENT col_comment], …>
UNION A value within a limited range of UNIONTYPE< data_type, data_type, …>

6, a window into the way Hiveshell

1. Use the premise configured environment variables direct hivecommand
2. hive --service hiveserver2 => into the open beline => connect jdbc:! Hive2 : // hadoop01: 10000

7. What path when Hive database tables stored on HDFS

The default is stored in the / user / hive / warehouse

8 difference, like with the rlike

like a fuzzy query
rlike supports regular expressions

9, the difference between the inner table and the outer table

When deleting internal tables: Internal tables Delete also deletes the table metadata and data.
When you delete the external table: external metadata table is deleted, the data itself is not deleted.

10, the partition table is the advantage? Partition field is required?

advantage:

1, to improve query performance, speed up the search
(for partition object query can only search their own partition of concern to improve the retrieval speed).
2, enhanced usability
(If a partition table fails, the data in the other partition tables are still available);
3, easy to maintain
if a partition table failure, need to fix the data, only repair the partition;
4, balancing the I / O
(different partitions can be mapped to different disks to balance the I / O, to improve overall system performance).

Claim:

Partition field can not appear in the data table with some fields.
Partition field do not have Chinese (otherwise error).

11 points advantage barrel table is? Sub-barrel field requirement is that? Minutes barrels rules?

advantage:

1, to improve the join query efficiency (provided that, join field to field sub-barrel)
2, increase sampling efficiency

Claim:

Field must be divided barrel fields in the table

Divided barrel rules:

First of sub-barrel field hash value, then the remainder of the number of barrels, more than a few of which will be placed in the bucket.

12, the data into a table manner

	-- (以下四种都是直接向表中导入数据)-- 1.从linux中加载数据到hiveload  data local   inpath    ‘数据路径’    into  table    表名;
​	-- 2.从linux中加载数据到hive,并覆盖load  data local  inpath    ‘数据路径’   overwrite  into  table    表名;
	-- 3.​	从hdfs中加载数据到hiveload  data inpath    ‘数据路径’    into  table    表名;
​	-- 4.从hdfs中加载数据到hive,并覆盖load  data  inpath    ‘数据路径’   overwrite  into  table    表名;

	-- 5、直接向分区表中插入数据
			insert into table score3 partition(month =201807) values (001,002,100);
	-- 6、多插入模式
			from score
			insert overwrite table score_first partition(month=201806) >select s_id,c_id
			insert overwrite table score_second partition(month = >201806) select c_id,s_score;
 	--7、查询语句中创建表并加载数据(as select)
			create table tbname2 as select * from tbname1;
	--8、创建表时通过location指定加载数据路径
			create external table score6 (s_id string,c_id string,s_score int) 
			row format delimited fields 
			terminated by ‘\t’ location ‘/myscore6’;

13, the data table deriving manner

-- 1、将查询的结果导出到本地
	insert overwrite local directory '/export/servers/exporthive/a'
	select * from score;
-- 2、将查询的结果格式化导出到本地
	insert overwrite local directory '/export/servers/exporthive' 
    	row format delimited fields terminated by '\t' 
  	 	collection items terminated by '#'
	select * from student;
-- 3、将查询的结果导出到HDFS上(没有local)
	insert overwrite directory '/export/servers/exporthive' 
 	   row format delimited fields terminated by '\t' 
  	   collection items terminated by '#'
	select * from score;
-- 4、Hadoop命令导出到本地
	dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;
-- 5 、 hive shell 命令导出
	bin/hive -e "select * from yhive.score;" > /export/servers/exporthive/score.txt
-- 6、export导出到HDFS上(全表导出)
	export table score to '/export/exporthive/score';
-- 7、SQOOP导出

14, order by and sort by the difference

order by sorting out the data is 全局有序, but the only 一个partition partition
sort by data is sorted out 局部有序, but 全局无序there may 多个partition partition

15, the difference in where and having the hive

1.where is filtered prior to the data packet, after HAVING packet data is filtered
after 2.where not function with the polymerization, the polymerization can be followed HAVING function (as where the polymerization is greater than the order of execution function, and the execution order of less than HAVING aggregate functions)

3.where can filter any column, hive only on the results of a query to filter columns

16, distribute by when to use, which is usually used in combination with

In case you need to partition a field, distribute by regular and sort by used in conjunction with

17, Cluster by when to use

Need to partition and use the Cluster by the time the field is sort of a field.

18, distribute by + sort by (same field) and Cluster by difference

Cluster by ordering only be positive, you can not specify the collation for the ASC or DESC

19, hive -e / -f / -hiveconf respectively, what is the meaning

hive -e 'sql' within quotation marks run sql query
sql statement hive -f file run file
-related configuration properties hive -hiveconf set runtime

20, hive declared parameters of what way, what is the priority

Parameter declaration> command> profile
parameter declaration: set param = value;
Command line: hive --hiveconf param = value;
profile: Review hive-site.xml

21, hiveUDF write the code, what is the name of the method

evaluate()

22, what is the enterprise hive commonly used data storage format? What common data compression format?

In a real project which, hive table data storage format generally used ORC Files and Parquet, compression generally selected snappy.

23, hive custom function type

Custom functions are divided into three categories:
the UDF (the User the Defined Function): one into a
UDAF (User Defined Aggregation Function): aggregate function, into a plurality (e.g. COUNT / max / min)
UDTF used by the (the Defined the Table Generating the User Function ): a multiple-out, as lateral view explode ()

24, Fetch crawl set more what effect, what effect setting none

None will set all the sql statements are converted into MapReduce programs
to more in the future, the basic query is executed directly, not converted into MapReduce programs

25, local mode What are the benefits

Under the premise of a small amount of data, improve query efficiency

26, when a key data skew data is too large, how to deal with

Open (Map) partial aggregation, MapReduce Hive creates two procedures, the first data is partially polymerized, the second data of the final summary.
A large file is divided into multiple small files

27, Count (distinct) how to write an alternative statement

-- 先去重,再求总数量
select count(distinct id) from bigtable;
-- 替换方案
select count(id) from (select id from bigtable group by id) a;

28, how to use the partition cut, cut row

Zoning cut: just take the required partition
column cut: just take the required cut
用什么拿什么

29, how to understand the dynamic partitioning adjustment

In a table of rule partitioning, the partitioning rule corresponds to the second table, all the partition of the first table copied to all the second table, the second table at the time of loading the data need not be specified partition, partition can be used directly on a table.

30, the inclination data, how many write data files 10 disposed

Provided reduceTask number of 10
. 1: the distribute by (field)
2 by the distribute RAND ()

31, reduce the number of calculations is how

Equation:
N min = (parameter 2, the total amount of input data / parameter 1)
Parameter 1: Reduce the maximum amount of data for each process
parameter 2: Reduce the maximum number of each task

32. What are the benefits of parallel execution

In the absence of reliance on the premise that the entire job execution time is shortened to improve query efficiency

33, which commands the implementation of strict mode can not be

1. Limit the Cartesian product of a query
2. Use the query order by the statement, which calls for the use of limit statements
3. The user is not allowed to scan all partitions

34, JVM reuse What are the advantages and disadvantages

Advantages: 1. reduce the overhead of task start, improve the efficiency of the task.
2. Allow the use of a multiple task jvm
inadequate: jvm not released before the end of the entire mission, prolonged occupation, leading to insufficient resources, resources (without the use of occupancy )waste

35, what is the Hive local mode

Tasks on the node submit SQL statements "local" execution, the task will not be assigned to a cluster

36. What is the local computations MR

After the data is stored to the HDFS, write code to achieve analysis calculation program, the program performing the distribution, the distribution priority placed on the program used by the node to which the data resides.

36, after the first join of the filter optimization

-- 1.先关联再过滤
select a.id from bigtable a join bigtable b on (b.id>10 and a.id =b.id);
-- 2.先过滤再关联
select a.id from bigtable join (select id from bigtable  where id > 10) b on a.id = b.id;

37, the number of factors that affect the map

Small file when the file number of
the number of large files when the block

Finally: limited capacity, if there is something wrong I welcome message.

Published 88 original articles · won praise 114 · Views 2987

Guess you like

Origin blog.csdn.net/hongchenshijie/article/details/103289115
Recommended