Latest data warehouse interview questions_zhixing education data warehouse project

Article Directory

1. Can you brief us on the current project?

Including:
•What the project has done
Our education big data analysis platform project is to apply big data technology to the education industry to provide data support for business operations

• What technology was used
Sqoop: sqoop
Mysql: 5.7
based on hadoop, hive, hue, Oozie on CM
• What problems
were solved? First, the amount of data was too large, and the existing Mysql database could not meet the needs of business statistical performance and efficiency.
The second is that there are many systems and scattered data. Unable to get through the data connection of marketing, consulting, registration, teaching and other business links.
Third, the statistical analysis is difficult and the workload is heavy. Lack of standardized storage, when business departments have requirements, programmers and DBAs are required to check data and make reports, especially when the departments queue up at the end of the year and wait for the DBA to assist in producing data.
• In which industry? What pain point does the industry have is the
education industry that needs to be solved by projects. The insufficient sharing and utilization of information has caused the school to accumulate a large amount of data for many years of informatization applications. However, the barriers of information islands have not been broken. If these data cannot be further mined , Analysis, processing, and sorting cannot provide scientific and effective data support for school education, teaching, research and development, general affairs and other management decisions.

Take the third sign-up user as an example:

37. There are multiple data tables for your original data?

Four data sheets.
customer_relationship (registration information), itcast_clazz (campus and subject information after registration), employee (internal employee information), scrm_department (department information).

38. Which tables are used in business?

The registration information table
is mainly used. The registration information table is used as the fact table, and
the school district and subject information table is used as the
internal employee information table and the
department information table
as the dimension table for multi-dimensional analysis,
and the previous clue table and intention table are reused.

39. How many analysis requirements are there?

Ten needs

21. Please briefly describe the modeling of each of the 5 Kanban boards in the project

How to layer and why to layer

First, the original data in the ODS layer includes customer_relationship (registration information), itcast_clazz (campus and subject information after registration), employee (internal employee information), scrm_department (department information).

The second is to clean, extract, and convert the data at the DWD layer. Therefore, we clean the customer table at the DWD layer to keep the data that is not empty and the paid data, and the conversion to obtain the online and offline fields and the year, month and day.

The DWM layer again, on the basis of the DWD layer, associates the school districts, disciplines and consulting center tables to obtain the desired fields.

Finally, the DWS layer performs statistics according to the attribute dimensions of the product, and obtains a wide statistical table. The product attribute dimensions include: campus, subject combination grouping, source channel, and consulting center.

40. List a few requirements that you have achieved?

Histogram of campus registration: the distribution of the number of applicants in each campus among all registered customers during the statistical period.
Histogram of subject registration: the distribution of the number of applicants for each subject among all the customers who signed up during the statistical period.
Total number of registrations: The total number of registered customers who have paid during the statistical period.

2. What is a data warehouse?

Data Warehouse is a subject-oriented (Subject Oriented), data integrated (Integrated), relatively stable (non-volatile) (Non-Volatile), reflecting historical changes (time-varying) (Time Variant) data Collection,
used to support management decision (Decision Making Support).

To put it bluntly, companies want to do data analysis, but there are problems with data islands and too much data. Therefore, a thing that can systematically solve centralized storage, massive data calculation and the best support for SQL is called a data warehouse.

3. What is the difference between a data warehouse and a traditional business database?

Database: is a logical concept, used to store data warehouse, realized by database software. The database is composed of many tables, the tables are two-dimensional, and there are many fields in a table. The fields are lined up, and the data is written into the table line by line. The database tables are capable of expressing multi-dimensional relationships in two dimensions. Such as: oracle, DB2, MySQL, etc.
Data warehouse: It is an upgrade of the database concept. From a logical point of view, there is no difference between a database and a data warehouse. Both are places where data is stored through database software, but in terms of data volume, a data warehouse is much larger than a database. Data warehouses are mainly used for data mining and data analysis to assist leaders in making decisions;
their main difference is that the data warehouse is integrated or refined, the database is detailed, and the data warehouse mainly uses the star model or the snowflake model; for analysis, Support decision-making needs; while the database uses an entity-relational (ER) model; transaction-oriented, the amount of data used in one operation is small; in addition, the data warehouse also stores historical data, excluding the latest data; data is read-only, only appended, one operation A collection has a large amount of data, while the database is the opposite.
Insert picture description here

4. What are OLTP and OLAP? What's the difference?

OLTP: Online transaction processing
OLAP: Online analysis and processing
OLTP usually has frequent transaction operations and a small amount of data, which is mainly for adding, deleting and modifying. OLAP is mainly for query operations and large amounts of data, which is mainly for querying. The
OLTP system is mainly for business smoothness. , Stable operation, OLAP system is mainly for the efficient analysis and processing of data.
At the same time, OLTP has a fast response speed, mainly for business operators, and OLAP response time is slow, mainly for management decision-makers.

5. How the project is layered

Generally divided into three layers

  • ODS
  • DW
  • ADS
    will also have a Dimen layer

6. How do data warehouses generally do hierarchical processing?

ODS ——》 DWD ——》 DWM ——》 DWS

7. What is the role of data warehouse layering?

Since we are doing data analysis, iterative calculations are generally performed in the data warehouse, and this calculation will be carried out in layers.

In actual calculations, if the statistical indicators of the wide table are directly calculated from DWD or ODS, there will be too much calculation and too few dimensions. Therefore, the general approach is to calculate multiple small intermediate tables at the DWM layer. , And then spliced ​​into a wide table of DWS.

Since the boundary between wide and narrow is not easy to define, you can also remove the DWM layer and leave only the DWS layer, or put all the data in the DWS.

8. Is there any analysis based on the theme in the project? If yes, what are the topics

Five major directions of project analysis (topics)

  • Visit and consult topics
  • Registered user theme
  • Effective clue theme
  • Intent user theme
  • Student attendance and questions

9. Data analysis can determine the future development of an enterprise, please dialect this point of view

Such as: completely agree, please state the reason.
Such as: completely disagree, please state the reason.
Such as: partially agree, please state the reason.

Partially agree that excellent data analysis can guide the decision-making of enterprises, find out the loopholes of the enterprise, and increase the profit of the enterprise. However, "data is not a panacea". Data can only provide opinions for decision-makers, and cannot replace the decision-making of enterprises. Decisions are still made by the leader.

10. What is a fact table, what is a dimension table, and what are the differences and connections

Fact: It is the meaning of the event. It represents a real event information in the system.

Dimension table: records the information on each dimension of an event or entity

Difference: In terms of the amount of data, the fact table is huge, and the dimension table is relatively small compared to the fact table.

Connection: Based on the association between the fact table and the dimension table, we can analyze the data in the fact table from multiple dimensions

Wide table is a collection of fact table and dimension table

11. What are indicators, what are dimensions, and what are the differences and connections

Indicators
Big Vernacular: the data subject to be viewed
Dimension:
Big Vernacular: looking at data from a different perspective
Indicators are generally of numerical type. The Y-axis shows the indicators. The indicators are divided into absolute values ​​and relative values.

​ Dimensions are generally character types, referring to characteristics. The X-axis shows dimensional information. Dimensions are divided into qualitative and quantitative dimensions.

Dimensions can be converted into indicators.

12. What are the main characteristics of a data warehouse?

  • Centralized storage
  • Mass data analysis and calculation
  • Support SQL language
  • Dedicated data analysis

13. What are the main problems that the data warehouse solves?

Please give a general overview of the problems encountered in the enterprise, and what problems have been solved by the data warehouse

Enterprises want to do data analysis, but there are data island problems and the amount of data is too large. Therefore, a system has been developed to solve the problem of centralized storage and the problem of massive data calculation, while supporting SQL best. Then we call this data warehouse.

14. It is generally best for an enterprise to build several data warehouses, and explain

The best one,
because the dilemma faced by enterprises is the problem of data islands. If data storage is too scattered, the advantages of data warehouses cannot be used. Even two data warehouses will encounter data synchronization problems, which will waste time and reduce efficiency.

15. What is a slowly gradual dimensionality? What scenario is it suitable for?

Dimensional attributes change over time

For example: a person's marital status, work experience, work unit and training experience, etc.

16. What is a zipper watch? Suitable for what scene

SCD gradient dimension is also called zipper table. It is
suitable for scenes where the complete record version is changed and the storage space can be greatly saved.
It is currently the most widely used mode.

17. What is the stratification and classification of dimensions? What is drilling up and rolling down?

Dimensions are not fixed, and all dimensions can be refined to get its sub-dimensions.

In terms of dimensions, there will be hierarchical relationships

Represents the relationship between the upper layer and the lower layer, we call it layered

The relationship between the same layer is called classification

Scroll up: From the current dimension, look up the upper dimension for statistical analysis

Drill down: find the lower temperature from the current dimension for statistical analysis

18. Please briefly describe the data mart

First of all, the data warehouse is the centralized storage of data.

But companies have different departments (business lines), and they care about different things.

Each department (business line) is only related to one of the data warehouse data数据子集

Data warehouse can be divided into many data subsets. This model is called:数据集市

Data mart is not necessary, depending on specific needs

19. Please briefly describe the degradation of dimensionality and its role

After removing all the descriptive items in the transaction dimension, the resulting dimension is empty. Such transaction numbers and inherent operational ticket numbers should be placed in the fact table naturally without being connected to the dimension table.
For example: order number, invoice number and bill of lading number
In order to improve the ease of use of the data detail layer, the DWD layer will use some dimension degradation methods to degenerate the dimensions into the fact table and reduce the association between the fact table and the dimension table.

20. Please briefly describe the main functions of the following levels

The ods
source data layer (ODS) has
no changes to the data in this layer, and directly uses the data structure and data of the peripheral system, and is not open to the outside world; it is the temporary storage layer, which is the temporary storage area for interface data, and prepares for the next step of data processing.
dwd
data warehouse layer (DW)
The data of the DW layer should be consistent, accurate, and clean data, that is, data after cleaning (removing impurities) from the source system data.
This layer can be subdivided into three layers:
DWD (Data Warehouse Detail): store detailed data, this data is the most fine-grained fact data. This layer generally maintains the same data granularity as the ODS layer and provides certain data quality assurance. At the same time, in order to improve the ease of use of the data detail layer, this layer will adopt some dimensional degeneration techniques to degenerate the dimensions into the fact table and reduce the association between the fact table and the dimension table.
dwm
middle layer DWM (Data WareHouse Middle): Stores intermediate data and creates intermediate table data for data statistics. This data is generally aggregated data of multiple dimensions. This layer of data usually comes from the data of the DWD layer.
dws
business layer DWS (Data WareHouse Service): store wide table data, this layer data is aggregated data for a certain business field, the data of the application layer usually comes from this layer, why it is called wide table, mainly for the needs of the application layer In this layer, all data related to the business is collected and stored in a unified manner, which is convenient for the business layer to obtain. The data of this layer usually comes from the data of the DWD and DWM layers.
Ads (app)
The data source directly read by the front-end application; the data generated is calculated according to the needs of reports and thematic analysis.

22. Please briefly describe the implementation method of SCD2 (at least 2 types, unlimited if there are more)

  1. Realized by adding tables (a new table is generated for each collection)
  2. Realized by adding columns (adding a judgment on the validity period of the row data in the table)
    3. By adding a new data temporary table (adding two temporary tables, update and tmp)

23. Please use three words to summarize the main work content of the data warehouse layering

ETL
ETL (Extra, Transfer, Load) includes three processes: data extraction, data conversion, and data loading.

24. What is a version control tool

Version control tools generally have two types of Git and SVN. Git is a distributed version control system
, and SVN is a centralized version control tool. SVN is poorly fault-tolerant, so general companies use Git as a code management library

25, what is git

Git is a distributed version control system. It does not have a central server. Everyone’s computer is a complete version library.

In this way, you don't need to connect to the Internet when you work, because the versions are all on your own computer.

26. The role of the .git folder

​ The .git folder is a folder that manages the git warehouse generated in the current directory after git is initialized. It
contains all the things needed for git operations.
If you delete the git folder, there will be no historical version.

27. What is a local library? What is a remote library?

The local library is the code warehouse stored on the machine, and the remote library is the code warehouse hosted in the cloud. The more famous one is Code Cloud, GitHub

28. What is the role of compression in the framework of big data?

The essence of compression: CPU calculation (execution algorithm)

它们就是一种特殊的加密算法,这个加密的要求是,加密后的体积要比原本更小

The function is to transfer the original hard disk load or network load to the CPU load, which is a profit behavior

29. Please briefly describe the advantages and disadvantages and applicable scenarios of row storage and column storage

Row storage, data row storage, a file can express a two-dimensional table.
The advantage is that the
concept is easy to understand,
the row operation is faster, and the
transaction support is better. The
disadvantage is that the operation performance of the column is lower than that of the column storage. The
compression algorithm can only be selected for the entire row of data. The compression rate is not high.
Sorting can only be sorted based on a certain column. Sorting between rows and rows
Expanding columns, deleting columns is inconvenient

It is suitable for general business scenarios such as CSV files, text files.
Column storage, each file stores a column, and multiple files are combined into a two-dimensional table. The
advantage is that relative to row storage, the disadvantages of row storage are its advantages, such as extended columns. Deleting columns is simpler, and you
can specify the columns to be loaded into memory. The
disadvantage is that the performance of the entire row is low. At the same time, the support for transactions is not
good. Applicable scenarios:
A large part of the data warehouse features are column filtering, column search, and column matching. Therefore, many data warehouse structures are more suitable for column storage.
Column storage is also more suitable for OLAP

30. What is the partition of Hive? What is Hive's bucketing?

Partitioning is to split table data in a large range by way of folders

Bucketing is to segment table data at a fine-grained level by distinguishing files

Bucketing is different from partitioning. Partitioning is a coarse-grained division of data and a large range

Bucketing is fine-grained data partitioning

The partition is divided into folders

Files are divided into buckets

The partitioning rule is: to determine the folder according to the value of the specified key

The rule of bucketing is: according to the hash hash to calculate which bucketed file the data should fall into.

Why partition? Generally speaking, a huge amount of data takes a lot of time to process, and partitioning according to different dimensions can effectively save time and improve processing efficiency.
Why do you want to bucket? My personal understanding is to reduce the cost of trial and error. Through sampling and analysis of data, representative query results, not all results, can be obtained, which can greatly improve the efficiency of development.

31. What are static partitions, dynamic partitions, and mixed partitions in Hive?

Static partition: You need to manually specify the partition when importing data. Dynamic partition: When importing data, the system can dynamically determine the target partition. Hybrid partitioning is a mixed use of the two, a table can be partitioned by static and dynamic partition keys at the same time.

32. What is Map Join, what are its benefits, and what are the main principles

MapJoin, as the name suggests, is the connection between tables in the Map phase. It does not need to enter the Reduce phase to connect.
This saves a large amount of data transmission during the Shuffle phase. Thus played a role in optimizing operations.
It is suitable for two tables to be joined, one is large and the other is small. This small table can be stored in memory without affecting performance.

33. How to explicitly tell Hive to use MapJoin to execute tasks

set hive.auto.convert.join=true;
–旧版本为hive.mapjoin.smalltable.filesize
set hive.auto.convert.join.noconditionaltask.size=512000000

34. What is Bucket Map Join, what are its benefits, and what are the main principles?

When two tables are joined, the small table is not enough to fit in the memory, but if you want to use map side join, you need to use bucket map join at this time.

The method is that both join tables are hash buckets on the join key, and the bucket number of the (relative) small table you intend to copy is set to a multiple of the large table. In this way, the data will be joined according to the key, and become a hash bucket. The small table is still copied to all nodes. When the map joins, each group of buckets of the small table is loaded into a hashtable, and a partial join is made with the corresponding large table bucket, so that only part of the hashtable needs to be loaded each time.

35. What is SMB Join, what are its benefits, and what are the main principles

The full name is Sort Merge Bucket Join.
Large tables versus small tables should be optimized using MapJoin, but if it is a large table versus a large table, if shuffle is performed, it is very terrible. Needless to say, the first one is slow, and the second one is prone to exceptions. At this time, you can use SMB Join to improve performance.
SMB Join is based on an ordered bucket of bucket-mapjoin, which can complete the join operation on the map side, which can effectively reduce or avoid the amount of shuffle data.

36. Please briefly describe the execution principle of Hive

Simply put, Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide SQL-like query functions.
Hive is actually a compiler and a translator. Translate SQL into operations such as MapReduce. The current Hive not only supports execution on MapReduce, but also supports execution on Spark and Tez.

Guess you like

Origin blog.csdn.net/xianyu120/article/details/112646707
Recommended