Data Governance Practice of Meituan Winery

This article mainly introduces the history and practical experience of data governance of Meituan Wine and Travel, as well as the problems and solutions encountered by the data system in various stages of business development. Finally, we will discuss the current construction ideas and development direction of data governance


Background introduction

The topic of data governance has been very hot in the past two years. Many companies, especially large Internet companies, are doing some data governance plans and actions. Why does everyone need to do data governance? My personal understanding is that various problems may be introduced in each link in the whole process from data generation, collection, production, storage, application to destruction. In the initial development stage, these data issues have little impact on us, and everyone has a relatively high tolerance for the issues. However, as business development requires higher data quality and stability, and more and more data accumulates, our requirements for refinement of some data are getting higher and higher, and we will gradually discover that there are many problems that need to be managed. Some problems will continue to be introduced in the data development process, and data governance is to continuously eliminate the introduced problems and provide data for the business in a high-quality, highly available, and highly secure manner.

Insert picture description here
1. What issues need to be addressed

What issues need to be governed in the data governance process? Summarized there are five major categories of problems.

Insert picture description here

  • Quality issues are the most important issue. A big background for data governance in many companies' data departments or business line groups is that there are many problems with data quality, such as the timeliness, accuracy, consistency, standardization and data application indicators of data warehouses. The problem of logical consistency.
  • In terms of cost, data in the Internet industry is expanding very rapidly, and large Internet companies spend a very high proportion of their investment in big data infrastructure, and the cost will continue to rise as the amount of data increases.
  • Security issues, especially user-type data that the business pays special attention to, once leaked, have a great impact on the business, and can even affect the life and death of the entire business.
  • Standardization issues. When the company has more business departments, the data standards of each business department and development team are inconsistent, and many problems will arise in the process of data connection and integration.
  • Efficiency problems, in the process of data development and data management will encounter some inefficiency problems, many times it is done by stacking manpower.

2. Data status of Meituan Liquor Travel

Meituan’s wine travel business has been established as an independent business unit in 2014 and has become an important online booking platform for domestic wine travel business in 2018. The business has developed relatively fast and the data growth rate is also very fast. In the two years from 2017 to 2018, the number of production tasks has more than doubled every year, and the growth rate of data volume has more than doubled every year. If there is no governance, according to the exponential growth trend, the complexity of future data production tasks and the cost burden will be very large.

In response to the situation we were facing at the time, five categories of problems were summarized:

  • There is a lack of standardization and the business development was very fast when the construction started, but the standardization and standardized construction between multiple business lines only existed in the form of specification documents, and everyone’s understanding was inconsistent, resulting in data developed by multiple R&D students Standards are difficult to achieve consistency.
  • There are many data quality problems, highlighting in several aspects. The first is that there is a lot of data redundancy. From the perspective of the growth rate of data tasks, there are more new online users, fewer offline tasks, and less life cycle control of data tables. The second is that in the process of data construction, a lot of application layer data is constructed in a chimney style, and there is no unified management standard for many indicators, and data consistency cannot be guaranteed.
  • Costs are growing very fast. In some business lines, big data storage and computing resources account for more than 35% of machine costs. If it is not controlled, big data costs will only get higher and higher.
  • For data security control, there are more data that can be shared between various business lines, and there is no unified data authority management for each business line.
  • Data management and operation and maintenance are inefficient, and data is used and consulted frequently. Data RD needs to spend a lot of time to answer business users' questions.

Governance practices

Before 2018, the Liquor Travel Data Group also did data governance, optimizing and standardizing processes from data warehouse modeling, index management, and application. At that time, there was no systematic data governance plan. Since 2018, based on the five issues mentioned above, we have made an overall data governance strategy.

We divide the content of data governance into several parts: organization, standard specification, technology, and measurement indicators. The implementation path of overall data governance is based on standardized specifications and organizational guarantees, and the implementation of the data governance strategy is ensured through the overall technical system. At the same time, it will build a data governance measurement system to observe and monitor the effect of data governance at any time to ensure the long-term development of data governance.

Insert picture description here***

Standardization and organization guarantee

Every company mentions standardization when doing data governance, and our overall thinking is not much different. Data standardization includes three aspects: the first is standard formulation, the second is standard implementation, and the third is organizational guarantee in the process of standard formulation and implementation, such as how to make the standard available in the data technology department, business department, and related business analysis department Unite.

Insert picture description here

In terms of standard formulation, we have developed a full-link data standard method. We have established many standards from data collection, data warehouse development, index management to data life cycle management. In the process of standardization establishment, we jointly established a business department's data Management committee. The management committee is a virtual organization. The main components are the technical department and the business department. The technical department is the development team for business data, and the business department is the product team for business data. These two teams are responsible for the implementation, each connecting the technical team and The business team, such as the technical team, is responsible for coordinating the back-end development team, the big data platform team, and the data analysis system team. The business will coordinate business analysis, product operations and some business departments. Each department of the business dispatches personnel to run the data management committee to provide organizational guarantees for the formulation and implementation of standards. It allows everyone to have a more unified understanding of the standardization, the implementation process is less resistant, and the information can be synchronized within the organization on a regular basis.


Technology System

In the implementation process, it is not hoped that it will be achieved entirely through manpower and organization. Generally, it is hoped that it will be carried out in some automated way. Let's introduce our technical system.

① Data quality. Data quality is the most important issue in data quality. At present, most of the problems in data governance belong to data quality. There are four major issues here:

  • The comprehensiveness of the data warehouse is relatively poor. Although there are some specification documents, it is more dependent on personal understanding to execute.
  • There are many data consistency problems, mainly manifested in the management of data indicators. Index management previously defined indexes in documents, and there was no systematic unified management logic and query logic.
  • There are many data applications, and the ways to use data include data table synchronization, interface message push, OLAP engine query, etc., which cannot guarantee data consistency on the data application side.
  • There are many products, and there are more than ten business data product portals. There is no unified portal, and no one has a unified check on these products, resulting in many differences in data applications and usage methods.

Our technical implementation method is to solve the above four categories of quality problems. First, we unify the standardization of the data warehouse, then unify the index logic, unify the data service interface on this, and finally unify the user product entrance on the product. From these four directions, common data quality problems are managed and controlled, and the specific technical implementation methods are as follows.
Insert picture description here


Data warehouse modeling specification

The unified data warehouse modeling specification is implemented in three parts. In the past, we only had some standardized specifications beforehand, and everyone should model and implement them according to their own understanding. On this basis, two parts are added during the event and after the event. Systematic tools have been developed for the event, and the data warehouse configuration development has been done. Make regular verification after the fact. There will be standardized documents in advance for everyone to understand and publicize in advance. Many standardized items will be automatically regulated and regulated through configuration. Afterwards, there will be inspections on the line and weekly inspections after the line to verify whether the modeling specifications of the data warehouse are Conform to the standard, prompt the non-conforming standard in time, and make timely improvement.

Insert picture description here
There are several directions for pre-standardization and specification. The first is the design specification of the data warehouse. Before doing a new business or module, do some design specifications in the form of documents. The second is development specifications, including some development processes, code writing specifications, and annotation information.

After these formations, I also want to control things in a systematic way to ensure that everyone's different understandings will not affect the standardization of the data warehouse. There are mainly three parts of tools:

  • The development tool in the model development process mainly controls the basic information of the model, data warehouse theme and layering, and ETL code generation.
  • The naming standard tool has built many standardized system implementations for models, tables, fields, and indicators to control the standardization of these naming.
  • The online rule monitoring tool will monitor some data specifications and some performance monitoring during the online process, and problems will be discovered in time.

Regular monitoring will be conducted after the event, and reports will be generated to see the standardization of data warehouses for each business line, each group, and each person.

For the specific implementation plan, I will give a simple example, a naming convention tool for data warehouse development and configuration. The essence of our tools is still from standardization, standardization to tooling, so we have done some standardization and standardization in the early stage, and we will realize standardization and standardization through the system through tooling. With tools, such as when people are in the warehouse, they will be unified. Named in the same way, even if this field is available in thousands of ETLs, it can be located very quickly. The naming tool and the data warehouse modeling ETL tool have also been opened up. After the naming review is passed, a piece of code can be generated in the platform of the ETL tool by clicking directly, and you only need to add the query logic. In this way, the purpose of controlling the naming convention of data warehouse is achieved.

Insert picture description here


Unified index management system

Indicators are very important in data warehouses, and all data applications are used as indicators. The systematization of index management mainly includes standardization of process management, standardization of index definitions, and standardization of index usage. Systematization is divided into three layers. The first layer is physical table management, the second layer is model management, and the third layer is index management. This information is unified in metadata management.

Insert picture description here
Uniform regulation is only the first step in index management. In addition to index management, all data applications can also query data through this tool. Specifically, an application has to query two kinds of data, one is the dimension, and the other is the indicator. When querying metrics, there may be some dimensional restrictions. In the indicator management module, locate the data warehouse model by specifying the indicator to understand the method of obtaining the indicator (is it sum or count, etc.). The corresponding data warehouse model can be a star model, a wide table, or a cyclic model, and the corresponding underlying physical table can be parsed from the model. After parsing, combining indicators, dimensions and filtering conditions, through different storage engines, parsed into different query statements. After controlling the data index management in this way, the data application can obtain consistent analysis through the index management module.

Insert picture description here

Unified data service

Our data is used by many downstream systems, such as data products, business systems, operation systems, management systems, etc. Some downstream require us to provide not only data tables but also interfaces, but it is difficult to develop and maintain back-end interfaces for data groups, and it is difficult to control the use of data after the interfaces are provided. So we have made a unified data service platform. The goal of the platform is to improve efficiency, improve data accuracy, provide data monitoring, and open up the entire data warehouse and data application link. There are two ways to provide, one is for B-side applications, providing on-demand use, providing tens of thousands of calls per day; the other is for C-side, through push methods, such as pushing the latest data once a day. Push and pull to ensure the comprehensiveness of the service function. For specific implementation, you can refer to the following figure:


Insert picture description here
Divided into several major levels:

  1. Data import layer.
  2. In the storage layer, there are many different storage methods for data according to different usage scenarios. For example, KV is the most suitable for querying a piece of data according to conditions. Some simple aggregations that require high qualitative conditions use MySQL. Some data is very large but frequently Low use OLAP engine.
  3. The service layer encapsulates the storage engine query.
  4. The control layer carries out authority management, parameter verification and business resource isolation.
  5. The interface layer provides different query methods, such as aggregate query, KV query, detailed query and group query.

Unified user product entry

Because there are so many data entries, we have made a unified data entry, divided into three categories:

  • Analysis and decision products used by managers and business analysis
  • Business sales data products for business sales operations
  • Data asset management products

In this way, a certain type of user only needs to access one type of product in one type of entry, and data in the same type of product will not be inconsistent. We also ensure the consistency of the three types of underlying data marts through the unified modeling of the data warehouse and data index management, thus ensuring the consistency of all data.

Insert picture description here


Overall system architecture

The overall technical architecture is divided into three layers, from unified data modeling to unified indicator logic, unified data service and unified product entry, which guarantees the quality of data as a whole. At the same time, it cooperates with the organizational guarantee system and process specifications of data management to improve the overall data quality. The relevant architecture is built.

Insert picture description here
② Data operation efficiency

As a data provider, we have a lot of data assets, but whether the data user can quickly find it, how to use it, and what data is there, there are three main types of questions:

  • Can't find it, don't know if the data is there or where.
  • I don’t understand, there are many business parties who are not technical R&D teams, and they don’t understand the meaning of data, how to correlate queries, and which business system it comes from.
  • No way, how to write SQL or which products can query the data indicators you want.

Based on this, there are three main goals: find it, understand it, and use it correctly. In order to improve efficiency, we use some intelligent systems instead of manual labor. For operation-related data issues, first provide a systematic data guide. The guide contains three types of information: indicators, data warehouse models, and recommended usage methods. This method can solve possible 60% of the problems, and the remaining 40% can be answered by answering robots, using some machines to answer questions for people, which can solve 60% of the problems. Finally, there are still some that have not been found, and there are very few manual questions and answers. Through automation, the manual tasks are reduced to less than 20% of the original.

Insert picture description here
For the specific implementation, a system was built for the data usage guide to manage indicator metadata, dimension metadata, data tables, and various product metadata. Users can quickly locate the query from the entrance, support classified search and key word search, and also provide ranking for key recommendations, and classify and describe each topic data. Many problems can be solved through the data guide, and those that cannot be solved are entered into the answering robot system. This is mainly to solve some problems that are not in the metadata. There will be questions and answers on our daily communication tools, and these questions and answers will be summarized into a knowledge base for cleaning and rule matching. The analysis of this type of question and answer is a question corresponding to an answer, which is stored after matching some rules and keywords. Later, when only one question is entered, based on this analysis, there may be several questions he wants to ask, and these answers are thrown to him.

Insert picture description here
③ Data cost The data cost of
Meituan's business is also very high, and the cost of data storage and calculation is increasing very fast every year. Meituan’s current approximate ratio is 70% of computing costs, 20% of storage costs, and 10% of collected logs. For these three categories, we have also made some data cost management solutions.
Insert picture description here
For the calculation class, the following things are mainly done:

  • Invalid task management, super long task optimization
  • Improve resource utilization rate and unified management of resources

For storage class:

  • Cold data governance, duplicate data governance
  • Data life cycle management, storage format compression

Log collection class:

  • Log downstream application monitoring
  • Optimized log reporting method
  • Invalid buried point optimization

The overall solution strategy has been refined. For example, according to the tenant (user of each business line), there are queues under the tenants, and the queues are offline and real-time. Below the queue, there are calculation, storage, and collection. In the calculation, there are separate lines, real-time, and some configuration and usage. In this way, it is very easy to locate which tenants and which data warehouses are problematic, corresponding to rapid management.

A lot of systematic things have also been done in this regard. For example, there is a logic for data redundancy judgment. After each warehouse modeling, a redundancy judgment will be made. After the metadata is generated, it is pre-processed and pre-judged based on the existing data to see if it already exists. Through the configured comparison logic, if the data is considered to be duplicated, it will be marked and pushed to the data management Kanban every week, and the redundant data will be managed in a timely manner.

④ Data security

Data security is carried out in three ways: pre-prevention, during-event monitoring, and post-event tracking. In terms of practical experience, it is realized through three-layer system control plus five principles of use. From the source business system where the data is generated, some very sensitive user data will be encrypted. The data warehouse layer will desensitize and re-encrypt the data of each layer. The third layer will specifically do some data audits. Provide information prompts and audit reports in the process.
Insert picture description here

Five principles that should be followed in the use of data:

The principle of ciphertext disposal, all highly sensitive data must be ciphertext transmitted.

The principle of decryption at the latest, if used in the application layer products, do not decrypt in the data warehouse layer.

The minimum range extraction principle, if only 10,000 pieces of data are used, only 10,000 pieces of data can be decrypted.

The principle of minimum authorization, give as much as you use.

The principle of whole-process auditing is that there are measures to ensure the whole process from system outflow to use.

3. Measurement indicators

In the future, we can comprehensively measure the effect of data warehouse governance. We have built a new data measurement index system, which is divided into five categories: quality, cost, safety, ease of use, and value. The monitoring methods are divided into daily monitoring and regular monitoring (weekly, monthly, quarterly monitoring), let us know whether the overall data governance is generally good, stable or bad.

Insert picture description here
According to the PDCA principle, data governance is implemented as a daily operation project, and the bottom layer relies on the data indicator system for monitoring. The first step is to find a problem to propose an optimization plan, then follow up and deal with it, and then to daily monitoring constitute a cycle.


Summary plan

Generally speaking, data governance is divided into three major stages: passive governance, active governance, and automatic governance.
Insert picture description here

  • In the first stage, what we did was passive governance, that is, phased governance. There was no overall consideration. The governance was mainly based on a single issue, and after a period of time, repeated governance might be required. This stage is more about the rule of people. A project is established and several people are coordinated to complete it in accordance with the project system. There is no system planning and no organizational guarantee.
  • The second stage is active governance, with long-term overall planning, which can cover all links of the data life cycle. In the governance process, some methods and experience are streamlined, standardized, and systemized, and some data problems can be solved for a long time to allow data governance Long-term controllable.
  • The third stage is automatic governance, and also intelligent governance. It is hoped that after long-term planning and data life cycle links are determined, the existing experience, processes, and standards can be used as strategies. Once a problem occurs, it is automatically monitored and solved through some systematic methods. The first step of automatic governance is the implementation and tacticalization of governance plans, which relies heavily on metadata and accumulate some experience and technology in the various processes of data governance. After finishing the strategy precipitation, automate the strategy and implement the strategy with tools. When the system finds that there is a problem with the data, it will automatically deal with it.
  • At present, the data governance of the wine brigade is still between the second and third stages. Although there is an overall governance plan, technical structure, and organizational guarantee, it still needs to invest a lot of manpower to do it. After that, the wine travel data will continue to move towards the direction of intelligence, and do a good job in automated governance.

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/114436053