Zhixing Education Big Data Analysis Data Warehouse Project_Interview Questions Highlight Edition

1. Introduce the current project.
Can you introduce the project you wrote:
Our big data project mainly solves some pain points in the education industry.
First of all, due to the Internet+ concept and the impact of the epidemic, online education, K12 education and other developments are booming, and more and more platform institutions have emerged. However, due to insufficient information sharing and utilization, companies have accumulated a large amount of data for many years, and because of the problem of information islands, these data have not been further mined and analyzed, so they cannot provide effective data support to the management and decision-making level of the company.
In view of this, the education big data analysis platform project we have done applies big data technology to the education industry, and uses an OLAP system that is good at analysis to provide data support for business operations. The specific realization idea is to first establish the enterprise data warehouse, preprocess the scattered business data, and secondly mine and analyze the massive user behavior data according to the business needs, and customize the multi-dimensional data collection to form a data mart for each scene theme. Use, and finally use BI tools for front-end display.
The technical architectures used include: mysql, sqoop, CM-based Hive, Oozie and FineBi. Since most of the data in the OLTP system is stored in mysql, we finally choose Sqoop as the import and export tool, extract the data to the data warehouse, and use Hive based on CM management for data cleaning + analysis, then export sqoop to mysql, and finally use FineBI to display OLAP Data analysis results.
Therefore, our technology solves the three major pain points of enterprises. One is the problem of too much data, which cannot be satisfied by traditional databases; the second is that there are too many systems and the problem of data fragmentation, which cannot solve the problem of data islands; the third is that the statistical workload is too large and the analysis is difficult to provide timely data reference for enterprises.
Insert picture description here

2. What is the difference between a data warehouse and a traditional business database?
Database: is a logical concept, used to store data warehouse, realized by database software.
Data warehouse: It is an upgrade of the database concept.
The main difference between them is that the data warehouse is integrated or refined, the database is detailed, the data warehouse mainly uses the star model or the snowflake model; it is analysis-oriented and supports decision-making needs; and the database uses the entity-relation (ER) model ; Transaction-oriented, the amount of data used in one operation is small; in addition, the data warehouse also stores historical data, excluding the latest data; data is read-only, only appended, operating one collection at a time, and the amount of data is large, while the database is the opposite.

3. How do data warehouses generally do hierarchical processing?
Generally divided into three layers

  • ODS source data layer
  • DW data warehouse layer
  • ADS front-end application directly reads the data source; the
    dimension table is placed in the Dimen layer.
    The DW layer can be further divided into DWD detail layer, DWM middle layer, and DWS business layer.
    The Dimen layer can also be divided into high-base dimension tables and low-base dimension tables

ODS ——》 DWD ——》 DWM ——》 DWS ——》 APP

4. What is the role of data warehouse layering?
First, the data structure is clarified, the responsibilities of each layer are clarified, and it is easy to understand and locate.
Second, the complex problem is simplified, and a complex task is divided into multiple steps to complete.
Once again , it is easy to maintain. Start to repair the steps,
and develop general middle-tier data, which can reduce repeated development.
Finally, improve system performance. The required information is directly obtained from the data warehouse, thereby reducing joins and complex queries, and improving statistical efficiency.
Big vernacular: The role is to enable data to flow in an orderly manner, and the entire life cycle of data can be clearly and clearly perceived by designers and users. The levels are clear and the dependencies are intuitive.

5. Is there any analysis by theme in the project? If yes, what are the themes
? Five general directions of project analysis (themes)

  • Visit and consult topics
  • Registered user theme
  • Effective clue theme
  • Intent user theme
  • Student attendance and questions

6. What is a fact table, what is a dimension table, and what are the differences and connections?
Fact table: is a table that records information about each fact (event).
Dimension table: records information on each dimension of an event or entity

Difference: In terms of the amount of data, the fact table is huge, and the dimension table is relatively small compared to the fact table.

Connection: Based on the association between the fact table and the dimension table, we can analyze the data in the fact table from multiple dimensions

Wide table is a collection of fact table and dimension table

7. What is an indicator, what is a dimension, and what is the difference and connection?
The data subject that the indicator is viewed on.
Dimensions are viewed from different perspectives
. The indicator is generally a numeric type. The Y-axis shows the indicators. The indicators are divided into absolute values ​​and relative values.

​ Dimensions are generally character types, referring to characteristics. The X-axis shows dimensional information. Dimensions are divided into qualitative and quantitative dimensions.

Dimensions and indicators can be transformed into each other under certain conditions.

8. What problem does the
data warehouse mainly solve ? Data warehouse mainly solves the problem of data analysis that enterprises want to do, but there are data island problems and the amount of data is too large, so a system has been made to solve the problem of centralized storage and solve the problem of massive data calculation , While still supporting SQL is the best.

9. What is a slowly gradual change dimension? What scenario is it suitable for? How to solve the slow gradient dimension?
Slowly changing dimension means that dimensional attributes change over time

For example: a person's marital status, work experience, work unit and training experience, etc.

Slowly gradual dimension is also called zipper table. It is
suitable for scenes where the complete record version changes and can greatly save storage space.
It is currently the most widely used mode.

10. What is the stratification and classification of dimensions? What is drilling up and rolling down?

Dimensions are not fixed, and all dimensions can be refined to get its sub-dimensions.

In terms of dimensions, there will be hierarchical relationships

Represents the relationship between the upper layer and the lower layer, we call it layered

The relationship between the same layer is called classification

Scroll up: From the current dimension to find its superior dimension for statistical analysis

Drill down: From the current dimension down to find its subordinate dimensions for statistical analysis

11. Please briefly describe the modeling of the
5 kanbans in the project. The 5 kanbans are basically divided into 4 layers, which are ODS layer, DWD layer,
DWM layer, and DWS layer.
The ODS layer stores the original data, the DWD layer does data cleaning, filtering, and conversion, DWM does dimensional degradation, and the DWS layer does aggregation calculations based on business topics.
Kanban one follows the above modeling, and the other Kanban adds a DIM layer to store the dimension table data. Kanban five has a DWD layer because the data is clean and does not need to be processed, and the others remain unchanged. Whether there is an APP layer depends on whether the data analysis results need to be stored.

12. How many analysis requirements are there in total?
There are a total of 35 requirements.
13. List how many requirements you have achieved?
Take the fourth board as an example:
Campus registration histogram: The
requirement is to count the number of applicants from each campus in a certain period of time.
Indicator: Registration Number of people

Dimensions: year, month, day, online and offline, campus
. The tables involved include: customer intent table, enrollment course table, fields include: class id, id, school name, payment status, payment time. The
associated condition is the class of the customer intent table. The id is associated with the class id of itcast. The
histogram of subject registration is subject + common dimension. The TOP of the registered students of the campus subject is the statistics of the number of registered students of each subject in each campus. Subjects and campuses are added to the shared dimension.
The total number of registrations is based on the statistics of all registered users. The indicator is the number of registered users, and the dimension is the shared dimension. The online registration volume refers to the online data of the total registration volume.
The sign-up conversion rate of intended users is equal to the total number of sign-ups/the total number of newly-added intentions. The data before indicators and dimensions are available and can be reused directly. In the same way, the conversion rate of effective clue registration can also be used directly.
For the remaining daily registration trends, source channels, and consulting centers, the indicators are the number of applicants, and the dimensions are based on the common dimensions plus the day dimension, source channels, and consulting center. From this we can conclude that the common indicators for these ten needs are the number of registered users, the number of intentional users, and the number of effective leads. The amount of intentional users and the amount of effective leads can reuse the previous kanban data.
Next is the modeling analysis. First, the original data in the ODS layer includes customer_relationship (registration information), itcast_clazz (campus and subject information after registration), employee (internal employee information), scrm_department (department information). The second is to clean, extract, and convert the data at the DWD layer. Therefore, we clean the customer table at the DWD layer to keep the data that is not empty and the paid data, and the conversion to obtain the online and offline fields and the year, month and day. The DWM layer again, on the basis of the DWD layer, associates the school districts, disciplines and consulting center tables to obtain the desired fields. Finally, the DWS layer performs statistics according to the attribute dimensions of the product, and obtains a wide statistical table. The product attribute dimensions include: campus, subject combination grouping, source channel, and consulting center. The above is the kanban analysis of my sign-up user kanban, thank you all.

14. What technical framework was used in the project? What role does each technical framework play?
The technical architectures used include: mysql, sqoop, CM-based Hive, Oozie and FineBi. Since most of the data in the OLTP system is stored in mysql, we finally choose Sqoop as the import and export tool, extract the data to the data warehouse, and use Hive based on CM management for data cleaning + analysis, then export sqoop to mysql, and finally use FineBI to display OLAP Data analysis results.

15. How is the project data transferred?
Business database------->ODS (import and backup data with ETL tools)------------->DWD (data cleaning: clear invalid data)------ ---->DWM (pre-aggregation of dimensions)------------->DWS-(Complete data aggregation calculation for business topics)-----------> APP (calculation result of stored data)---------------->mysql

16. Please hand-paint the project structure diagram, and attach a description of the data flow

17. What are the theme boards for the project? What aspect does each kanban focus on?
Five major directions of project analysis (topics)

  • Visit and consult topics
  • Registered user theme
  • Effective clue theme
  • Intent user theme
  • Student attendance and questions

18. Which dimensions in the project are multi-level dimensions?
Multi-level dimension: The relationship between the same level is called classification.
The multi-level dimensions of Kanban 4 that I am responsible for are:
time dimension and campus dimension.

19. What fact tables are in the project?
The fact table refers to the information of a real event in the project.
For example:
Kanban One Inquiry Form, Visitation Scale,
Kanban Two Intention Form, Cue Form,
Kanban Three Cue Form, Intention Form (Appeal Form)
Kanban Four Intention Form,
Kanban Five Student Leave Application Form, Student Check-in Record Form

20. Briefly describe the difference and connection between SCD2 and zipper table.
SCD2: Record the full amount of historical changes. The way SCD2 records data can add fields or tables.
Zipper tables can only add tables. The
zipper table is one of the SCD2 modes. The temporary table records all historical versions.

Guess you like

Origin blog.csdn.net/xianyu120/article/details/112857384