Data processing flow overview

Data processing is the most important part of the data product manager. Compared with the final report display, analysis report, and data driving, this part is often time-consuming and low in value, but it takes the whole body. We often hear that the results of data analysis of the same function are diametrically opposite, tracing the reasons and discovering that there are errors in the data processing process.

 

This article will look at the process of data flow after data collection from the perspective of data products; and explain a data warehouse that is a bit of a technology, but is closely related to the output of data products.

1. Data processing

A large part of the work of the data product manager is to convert the inestimable data into visible reports and conclusive analysis reports-that is, to summarize the data from various heterogeneous data sources and finally display it as Reports, dashboards, dynamic data analysis queries, conclusive analysis reports, etc.

1. What are the heterogeneous data sources?

  • Server and client user behavior logs
  • The user's historical information, qualitative information (eg gender, professional user profile data), quantitative information (eg a certain degree of interest in the past 30 days)
  • Information obtained by third parties, eg crawler data, manually collated data, etc.

2. Most of this information requires secondary processing and cleaning to generate structured data

  • Dirty data cleaning and integration, eg delayed data are summarized according to the date of occurrence;
  • Generate basic tables to improve the ease of use of data, such as basic tables of user basic data and behavior data;
  • Generate user & behavior structured business application tables that can be directly applied to reports and analysis;

The two steps of understatement are the key points that affect the presentation and analysis of the report, and are also the places where the data product manager needs to deal with the most.

2. Data Warehouse

The data processing process is often vague, but in the process of "heterogeneous data sources-> structured data tables-> reports / analysis reports", our common various database tables are entities of the data warehouse , such as common hive, spark , Oracle, etc. What data warehouse knowledge points should be paid attention to in the daily data processing of data product managers?

1. Data Warehouse Layering

Why do you want to layer?

  1. Clearer management and tracking data (cleaned data structure, clear blood relationship): help us to find the entire link of data processing;
  2. Reduce redundant calculations by establishing a common intermediate table: a common intermediate table can effectively provide a data table that can directly contribute to downstream business data, to avoid producing a business data table from the original data every time;
  3. Clear data warehouse layering will help us decompose the data processing process: disassemble complex data-> business applications into multiple steps, and each layer only processes a single step

What is data stratification? What should we pay attention to at each layer?

Operational Data Store (ODS, Operational Data Store): The data at this level is closest to the original appearance of the data source (the content and granularity are consistent with the original data). Usually, the data source is stored directly after passing through ETL. From the original data to the ODS layer, it is not recommended to do complex data cleaning, so as not to destroy the original data and cause unnecessary cost of investigation.

It is recommended that only

  • Map the log recorded by json to each field;
  • Cleaning of cheating data;
  • Data transcoding: mapping codes to values ​​with real meaning
  • Data standardization, eg format all dates into YYYY-MM-DD format;
  • Abnormal value repair, eg video playlist: (including user id, video id, broadcaster, playing time, etc.).

If a table is divided into ODS layers, it is necessary to confirm whether the meaningful fields of the original data are cleaned.

Detailed data layer (DWD, Data Warehouse Detail): do some business-level data cleaning and normalization operations on the ODS layer, a log-level table of eg users playing videos;

If a table is divided into DWD layers, does it clearly and clearly record the detailed data at the business level?

Data Warehouse Summary (DWS, Data Warehouse Summary): According to business requirements, the data of the ODS / DWD layer is summarized, such as a playback video with user portrait information;

If it is a table in the DWS layer, can it effectively and conveniently serve the statistical requirements of the business direction?

Application data layer (ADS, Application Data Store): the statistical data results that the business needs to carry out, such as the video playback statistics of various types of users.

If it is an ADS layer table, can you get the statistical data required by the business?

Dimension table (DIM) : stores basic information, such as user attribute table-gender, age, etc.

If it is a table in the DIM layer, does it fully record the various dimensions needed for subsequent analysis or statistics?

In addition to being fixed to layers, of course there are temporary tables (TEM).

Alibaba / Huawei's data warehouse data classification: Operation data layer (ODS), detail data layer (DWD), summary data layer (DWS) and application data layer (ADS), dimension table (DIM); operation data layer, detail data layer The summary data layer is a public data layer.

In addition, when it comes to the table, it is necessary to fully consider which role students follow this table. Is the table easy to use? Is the content redundant? Is it safe?

  • Can the students of the business line get the data results through a few simple SQL statements?
  • Can the statistics be obtained from a single table or do I need to obtain multiple tables?
  • Is the content of a single table redundant? Will it affect query efficiency?
  • When there is multi-table association, will there be pitfalls in business understanding? Is the field between eg multi-tables one-to-one, one-to-many, or many-to-many? How to make users understand clearly?
  • Does the table involve sensitive fields, such as amount, etc. Do the user groups have sufficient authority to obtain this information?

2. Metadata management

Metadata and applications are also an important part of a data warehouse. It is data about data (data about data) and attribute information that describes data, which can help us find the data they care about very conveniently.

What information does the metadata record?

  • Table structure of data: field information, partition information, index information, etc .;
  • Data usage & permissions: space storage, read-write records, modification records, permissions attribution, audit records and other information;
  • The blood relationship information of the data: The blood relationship information is simply the upstream and downstream relationship of the data. Where does the data come from? Through the blood relationship, we can understand the dependence relationship between the tasks that produce these data, and then assist the scheduling of the scheduling system, or used to determine which downstream data a failed or erroneous task may affect, etc .; and It can also help us locate problems during data troubleshooting.
  • Data business attribute information: record the business purpose of this table, the specific statistical caliber of each field, business description, historical change record, change reason, etc.
    This part of the data is mostly filled in manually, but it can greatly improve the convenience in the process of using the data.

3. Offline data warehouse & real-time data warehouse

In addition, according to the real-time data, the data warehouse can be divided into offline data warehouse and real-time data warehouse.

  • The offline data warehouse mainly records data above t-1, and mainly calculates data for days, weeks, and months;
  • The real-time data warehouse emerges with people's demand for real-time data display, analysis, and algorithms.

4. Summary

The data processing process is the most time-consuming part of the data product manager ’s output report and analysis report. Understanding the concepts and key points of the data warehouse can help us process data clearly and effectively, improve work efficiency, and spend more time. For business insight.

Published 15 original articles · praised 3 · 10,000+ views

Guess you like

Origin blog.csdn.net/edward_2017/article/details/98207648