Big data development must-have interview questions Hive

1. What are the commonly used models for data modeling?

(1) Star model .
Star schema (Star Schema) is the most commonly used dimensional modeling method. The star schema is centered on the fact table, and all dimension tables are directly connected to the fact table, like stars . The dimensional modeling of the star schema consists of a fact table and a set of dimension tables, and has the following characteristics:
A. Dimension tables are only associated with fact tables, and there is no association between dimension tables;
B. The primary key of each dimension table is a single column, And the primary key is placed in the fact table as a foreign key connecting both sides;
C. With the fact table as the core, the dimension tables are distributed in a star shape around the core.
(2) Snowflake model .
Snowflake Schema is an extension of Star Schema . The dimension table of the snowflake schema can have other dimension tables. Although this model is more standardized than the star model, because this model is not easy to understand, the maintenance cost is relatively high, and in terms of performance, multi-layer dimension tables need to be associated. Performance Lower than star schema.
(3) Constellation model .
The constellation schema is an extension of the star schema. The star schema is based on one fact table, while the constellation schema is based on multiple fact tables and shares dimension information . The two dimensional modeling methods introduced above are multidimensional tables corresponding to single fact tables, but in many cases there is more than one fact table in a dimension space, and a dimension table may also be used by multiple fact tables. In the late stage of business development, most of the dimensional modeling uses the constellation model.

2. The difference between Hive internal tables and external tables.

Those not modified by external are internal tables, and those modified by external are external tables . The main differences between the two are as follows:
(1) The internal table data is managed by Hive itself, and the external table data is managed by HDFS;
(2) The storage location of the internal table data is hive.metastore.warehouse.dir (default: /user /hive/warehouse), the storage location of the external table data is determined by oneself (if there is no LOCATION, Hive will create a folder with the table name of the external table under the /user/hive/warehouse folder on HDFS, and will belong to this The data of the table is stored here);
(3) Deleting the internal table will directly delete the metadata (metadata) and stored data; deleting the external table will only delete the metadata, and the files on HDFS will not be deleted .

3. Does Hive support indexing? If so, what are the applicable scenarios?

Hive supports indexes (before version 3.0), but Hive's indexes are not the same as those in relational databases. For example, Hive does not support primary keys or foreign keys. Moreover, the functions provided by the Hive index are very limited and the efficiency is not high , so the Hive index is rarely used.
Applicable scenarios for indexes : applicable to static fields that are not updated . So as not to always rebuild the index data. Every time you create and update data, you need to rebuild the index to build the index table.

4. Briefly describe the mechanism of Hive indexing.

When hive builds an index on the specified column, an index table (a physical table of Hive) will be generated. The fields in it include: the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file . The bitmap index processor was introduced after Hive 0.8. This processor is suitable for columns with fewer values ​​after deduplication (for example, the value of a field can only be a few enumeration values) because the index trades space for time, and the index Too many column values ​​will lead to too large a bitmap index table.
Note: Every time there is data in Hive, the index needs to be updated in time, which is equivalent to rebuilding a new table, otherwise it will affect the efficiency and accuracy of data query. The official Hive document has clearly indicated that the index of Hive is not recommended to be used. In the new version It has been deprecated in Hive.
Extension: Hive supports indexes after version 0.7, introduces the bitmap index processor after version 0.8, removes the index function from version 3.0, and replaces it with the materialized view starting from version 2.3, and automatically rewrites the materialized view instead of index function.

5. How does O&M schedule Hive.

(1) Define the sql of hive in the script;
(2) Use azkaban or oozie to schedule tasks;
(3) Monitor the task scheduling page.

6. Briefly describe the advantages of Parquet columnar storage.

(1) Parquet supports nested data models , similar to Protocol Buffers. The schema of each data model contains multiple fields, and each field has three attributes: number of repetitions, data type, and field name.
The number of repetitions can be the following three types: required (appears only 1 time), repeated (appears 0 or more times), optional (appears 0 or 1 time). The data type of each field can be divided into two types: group (complex type) and primitive (basic type).
(2) There is no complex data structure such as Map and Array in Parquet, but it can be realized through the combination of repeated and group .
(3) Since the data model supported by Parquet is relatively loose, there may be a relatively deep nesting relationship in a record. If a similar tree-like structure is maintained for each record, it may take up a large storage space. Therefore, in the Dremel paper An efficient compression algorithm for nested data formats is proposed: Striping/Assembly algorithm. Through the Striping/Assembly algorithm, parquet can use less storage space to represent complex nested formats , and usually the Repetition level and Definition level are small integer values, which can be compressed by the RLE algorithm to further reduce storage space .
(4) Parquet files are stored in binary mode and cannot be read and modified directly. Parquet files are self-parsing , and the file includes the data and metadata of the file.

7. Briefly describe the advantages of ORC columnar storage.

(1) The ORC file is self-describing , its metadata is serialized using Protocol Buffers , and the data in the file is compressed as much as possible to reduce storage space consumption.
(2) Similar to Parquet , ORC files are also stored in binary mode, so they cannot be read directly. ORC files are also self-parsing, and contain a lot of metadata, which are serialized by isomorphic ProtoBuffer .
(3) ORC will merge as many discrete intervals as possible to reduce the number of I/O as much as possible .
(4) ORC uses more accurate index information, so that when reading data, you can specify to start reading from any row, and finer-grained statistical information makes reading ORC files skip the entire rowgroup, and ORC defaults to any block Data and index information are compressed using ZLIB , so ORC files take up less storage space. (5) Support for Bloom Filter
has also been added to the new version of ORC , which can further improve the efficiency of predicate pushdown, and support for this has also been added after Hive 1.2.0.

8. Why layer the data warehouse?

(1) Trade space for time , and improve the user experience (efficiency) of the application system through a large amount of preprocessing, so there will be a large amount of redundant data in the data warehouse.
(2) If there is no layering, if the business rules of the source business system change, the entire data cleaning process will be affected, and the workload will be huge.
(3) The process of data cleaning can be simplified through data hierarchical management , because the original one-step work is divided into multiple steps to complete, which is equivalent to splitting a complex job into multiple simple jobs, and splitting a big black The box becomes a white box, and the processing logic of each layer is relatively simple and easy to understand, so it is easier for us to ensure the correctness of each step. When the data is wrong, often we only need to partially adjust a certain step. Can.

9. Briefly explain how to use Hive to parse JSON strings.

Hive generally has two directions for processing json data:
(1) Put json into the Hive table as a string , and then parse the data that has been imported into hive by using UDF functions, such as using LATERAL VIEW json_tuple method to get the required column names.
(2) Split the json into various fields before importing, and the data imported into the Hive table has been parsed. This will require the use of a third-party SerDe .

10. Briefly describe the difference between sort by and order by.

(1) order by will globally sort the input , so there is only one reducer (multiple reducers cannot guarantee global order) and only one reducer, which will result in a long calculation time when the input scale is large.
(2) sort by is not a global sort , it completes sorting before the data enters the reducer. Therefore, if sort by is used for sorting and mapred.reduce.tasks>1 is set, sort by only ensures that the output of each reducer is in order, Global order is not guaranteed .

11. What are the main data skews?

(1) Data skew caused by null values;
(2) Data skew caused by different data types;
(3) Data skew caused by large files that cannot be split;
(4) Data skew caused by data expansion;
(5) When tables are joined (6) It is indeed impossible to reduce the data skew
caused by the amount of data.

12. How to solve the problem of too many small files in Hive?

(1) Use the concatenate command that comes with hive to automatically merge small files;
(2) Adjust parameters to reduce the number of Maps;
(3) Reduce the number of Reduces;
(4) Use hadoop's archive to archive small files.

13. What are Hive optimizations?

(1) Data storage and compression .
The storage formats for tables in hive usually include orc and parquet, and the compression format generally uses snappy. Compared with textfile format tables, orc occupies less storage. Because the bottom layer of hive uses the MR computing architecture, the data flow is from hdfs to disk and then to hdfs, and there will be many times, so the use of orc data format and snappy compression strategy can reduce IO read and write, and can also reduce network transmission volume, so to a certain extent It can save storage and improve the execution efficiency of hql tasks;
(2) Optimization through parameter tuning .
A. Parallel execution, adjust parallel parameters;
B. Adjust jvm parameters, reuse jvm;
C. Set map and reduce parameters;
D. Turn on strict mode;
E. Turn off speculative execution settings.
(3) Effectively reduce the data set and split the large table into sub-tables; use external tables and partition tables in combination .
(4) SQL optimization .
A. Large table vs. large table: Minimize the data set, avoid scanning the entire table or all fields by partitioning the table; B.
Large table vs. small table: set the automatic identification of small tables, and put the small tables into memory for execution.

14. What are the advantages of Tez engine?

Tez can convert multiple dependent jobs into one job, so that HDFS only needs to be written once, and there are fewer intermediate nodes, thus greatly improving the computing performance of the job .
Differences between Mr/tez/spark :
(1) Mr engine : Multiple jobs are connected in series, based on disk, and there are more places to write to disk. Although it is slow, it will definitely run out of results. General processing, week, month, year indicators.
(2) Spark engine : Although it is also placed in the Shuffle process, not all operators need Shuffle, especially the multi-operator process, and the DAG directed acyclic graph is not placed in the middle process. Balance reliability and efficiency. Generally deal with day indicators.
(3) Tez engine : completely memory-based. Note: If the amount of data is particularly large, use it with caution. Easy OOM. It is generally used in scenarios where the results are obtained quickly and the amount of data is relatively small.

Guess you like

Origin blog.csdn.net/m0_46983541/article/details/130028551