Big Data interview questions (five) ---- HIVE interview questions

"I stumbled on a giant cow artificial intelligence course, could not help but share to everyone. Tutorial is not only a zero-based, user-friendly, and very humorous, like watching a fiction! Think too much bad, so to others. the point where you can jump to the tutorial.. "

Big Data Collection catalog interview, please click

HIVE face questions catalog

1. association Hive query, how to solve the problem of data skew?
2. Please talk about Hive features, Hive and RDBMS what similarities and differences?
3. Please describe the hive in the Sort By, Order By, Cluster By, Distrbute By each What does it mean?
4. Brief description of the database null, null is stored in the hive to say how the bottom and explained select a * from t1 a left outer join t2 b on a.id = b.idwhere b.id is null;. The meaning of the sentence?
5. Write the hive split, coalesce and collect_list usage functions (available for example)?
What are the characteristics 6. Hive has saved metadata which way, each?
7. Hive difference of internal and outer tables?
8. Hive of HSQL converted into MapReduce process? (☆☆☆☆☆)
9. Hive database interaction with the underlying principle? (☆☆☆☆☆)
10. Hive join the process of placing the order large table small table?
11. Hive two tables associated with the use of MapReduce how to achieve? (☆☆☆☆☆)
12. Hive used in the query instead of what?
13. All the Hive MapReduce tasks will have to perform it?
14. Hive functions: UDF, UDAF, UDTF difference?
15. Talk about the understanding of the Hive bucket list?
16. Hive custom UDF function of the process?


1. association Hive query, how to solve the problem of data skew? (☆☆☆☆☆)

1) Cause inclination:
       Map output data are allocated to reduce the Hash key, since the key uneven distribution, service data Laid itself, ill-considered when building the table, and so reduce the amount of data on the cause of the difference is too large.
(. 1) uneven distribution key;
characteristic (2) of the traffic data itself;
(3) ill-considered when building the table;
(4) itself some SQL statement data skew;
       avoiding: For data generated key is empty the inclination can be imparted with a random value.
2) Solutions
(1) parameter adjustment:
       hive.map.aggr = to true
       hive.groupby.skewindata = to true
       data when the tilt load balancing, when the option is set bit true, the resulting query plans have two MR Job . The first MR Job, the output will be a set of randomly distributed Map to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same as Group By Key likely to be distributed to the different Reduce , so as to achieve load balancing purposes; second MR Job then the data pre-processing according to the results of Group by Key Reduce the distributed (this process can ensure the same Group by Key are distributed to Reduce the same), and finally complete the final the polymerization operation.
(2) SQL statement adjustment:
       ① selection of the most uniform distribution of the join key table as the driving table. Column good crop and filter operations to achieve two tables do join, when the amount of data relatively small effect.
       ② table size Join: use the map join to make a small dimension tables (record number of 1000 or less) advanced memory. Created map-side reduce.
       After the empty key values into a string with random number data assigned to the different inclination of reduce, since the correlation value is not null, the processing is not: ③ large table Join large table affect the final result.
       ④ count distinct large number of identical special value: when count distinct, when the value of the empty processing alone, if it is calculated count distinct, can not handle, direct filtration, add 1 to the final result. If additional calculation required group by, recording may be first processed separately is empty, then the calculation results, and other union.

2. Please talk about Hive features, Hive and RDBMS what similarities and differences?

       hive is a Hadoop-based data warehousing tools, you can map the structure of the data file to a database table, and provide a complete sql query function, you can convert the sql statement to run MapReduce tasks. The advantage is the low cost of learning, you can quickly achieve a simple MapReduce statistics by type of SQL statements, without having to develop specialized MapReduce applications, data warehouse is very suitable for statistical analysis, but the Hive does not support real-time queries.
       The difference between Hive and relational databases:
Here Insert Picture Description

3. Please describe the hive in the Sort By, Order By, Cluster By, Distrbute By each What does it mean?

        order by: will do enter the global ordering, so only a reducer (more reducer can not guarantee the global order). Only a reducer, when the input will lead to large-scale, lengthy calculation time.
        sort by: not a global sort, which in data sorting is completed before entering the reducer.
        by the distribute : dividing the data into different reduce output in the specified field.
        by Cluster : distribute by addition to the functions also sort by both functions.

4. Brief description of the database null, null is stored in the hive to say how the bottom and explained select a * from t1 a left outer join t2 b on a.id = b.idwhere b.id is null;. The meaning of the sentence?

       The result of any operation with the null value is null, use is null, is not null function specified value in the case of a null value.
       modifying; default null in the hive bottom is '\ N' to store, can test SETSERDEPROPERTIES ( 'serialization.null.format' = ' a') by the alter table. Query all information table t1 and t2 table id equal.

5. Write the hive split, coalesce and collect_list usage functions (available for example)?

       split into an array of strings, namely: split ( 'a, b, c, d', ',') ==> [ "a", "b", "c", "d"].
       coalesce (T v1, T v2, ...) returns a non-null value of the first parameter; If each value is NULL, return NULL.
       collect_list lists all values of the field, not to re-select collect_list (id) from table.

What are the characteristics 6. Hive has saved metadata which way, each?

       Hive supports three different storage servers yuan, respectively: embedded cell storage server, storage server locally yuan, yuan remote storage server, each using a different storage configuration parameters.
       The main element embedded memory means for testing, in this mode, only one process may be connected to the storage element, the Derby is the default database stored in the embedded element.
       In local mode, each client Hive will open a connection to the data store and requests the SQL query on the connection.
       In remote mode, all Hive client opens a connection to the server metadata, the metadata server query sequence, Thrift protocol used for communication between the client and server metadata.

7. Hive difference of internal and outer tables?

       When you create a table: When you create an internal table, will move to the path of the data warehouse data points; if you create an external table, only records where the data path, the location does not make any changes to the data.
       When you delete a table: In the deleted table, the metadata for internal tables and data will be deleted together, and external table only remove metadata, do not delete the data. Such external table relatively more safer, more flexible organization of data to facilitate the sharing of source data.

8. Hive of HSQL converted into MapReduce process? (☆☆☆☆☆)

       HiveSQL -> AST (abstract syntax tree) -> QB (query block) -> OperatorTree (Operation tree) -> optimized operation of tree -> mapreduce task tree - after> mapreduce optimization task tree
Here Insert Picture Description

Described as follows:
        SQL Parser : of Antlr defined SQL syntax rules, to complete the SQL morphology, syntax analysis, the SQL into an abstract syntax tree AST Tree;
        the Semantic Analyzer : traversing AST Tree, abstract basic unit QueryBlock query;
        the Logical Plan : traversing QueryBlock, translated to perform operations tree OperatorTree;
        the logical optimizer Plan : OperatorTree logical layer optimizer for converting the combined unnecessary ReduceSinkOperator, shuffle reduce the amount of data;
        the physical Plan : traversing OperatorTree, translated MapReduce task;
        the logical optimizer Plan : physical layer the optimizer transforms MapReduce tasks, generate the final implementation plan;

9. Hive database interaction with the underlying principle? (☆☆☆☆☆)

        Since the metadata Hive might have to face constantly update, modify, and read operations, so it is obviously not suitable for use Hadoop file system for storage. Currently Hive metadata stored in an RDBMS, such as storage in MySQL, Derby in. Metadata information includes: columns, permissions, and more information on other tables, table of existence.

Here Insert Picture DescriptionHere Insert Picture Description

10. Hive join the process of placing the order large table small table?

       The largest table placed in the rightmost JOIN statement, or directly using the / * + streamtable (table_name) * / said.
       When writing code statement with a join operation, a small entry table / sub-query should be on the left Join operator. Because Reduce stage, Join operation table of contents is located on the left character is loaded into memory, loading less entry table can be effectively reduced OOM (out of memory) that is out of memory. So for a key with it, before and after the value put small value, corresponding to a large place, and this is "before a small table to put" principle. If a plurality of Join statement, based Join conditions identical or not, different processing methods.

11. Hive two tables associated with the use of MapReduce how to achieve? (☆☆☆☆☆)

       If there is a table where the table is small, the direct use of the join end map (map-side loading small table) polymerization.
       If the two tables are large, then the use of combination key, the first key component of a joint is the join on the common field, the second portion is a flag, 0 representatives Table A, 1 representative of tables B, thereby allow Reduce distinguishing customer and order information; Mapper simultaneously processing two information table, will join on the same common data fields into the same partition, and then transferred to a Reduce then implemented Reduce the polymerization.

12. Hive used in the query instead of what?

       Before Hive 0.13 version, left outer join in a query in SQL to achieve through, after the 0.13 version, Hive has support in the query.

13. All the Hive MapReduce tasks will have to perform it?

       Not from Hive0.10.0 version, not required for a simple SELECT from similar polymeric

LIMIT n statement, no start MapReduce job, acquiring data directly through Fetch task.

14. Hive 的函数:UDF、UDAF、UDTF 的区别?

       UDF: 单行进入,单行输出
       UDAF: 多行进入,单行输出
       UDTF: 单行输入,多行输出

15. 说说对Hive 桶表的理解?

       桶表是对数据进行哈希取值,然后放到不同文件中存储。
       数据加载到桶表时,会对字段取hash 值,然后与桶的数量取模。把数据放到对应的文件中。物理上,每个桶就是表(或分区)目录里的一个文件,一个作业产生的桶(输出文件)和reduce 任务个数相同。
       桶表专门用于抽样查询,是很专业性的,不是日常用来存储数据的表,需要抽样查询时, 才创建和使用桶表。

16. Hive 自定义UDF 函数的流程?

1) 写一个类继承(org.apache.hadoop.hive.ql.)UDF 类;
2) 覆盖方法evaluate();
3) 打JAR 包;
4) 通过hive 命令将JAR 添加到Hive 的类路径:
       hive> addjar /home/ubuntu/ToDate.jar;
5) 注册函数:
       hive> create temporary function xxx as ‘XXX’;
6) 使用函数;
7)[可选] drop 临时函数;

Published 422 original articles · won praise 357 · Views 1.24 million +

Guess you like

Origin blog.csdn.net/silentwolfyh/article/details/103864595