Summary of data asset governance: using data to govern data

|0x00 Why data governance is difficult

Writing is not easy. Pay attention to the official account: Xiaoyang's data station, even better.

Chairman Mao said: "If you study any process, if it is a complex process with more than two contradictions, you must use your best to find out its main contradiction. If you catch this main contradiction, all problems will be solved."

For data governance, the contradiction is: "the contradiction between the limited machine resources and the unlimited growth of storage computing."

Because of the main contradiction, "data governance" is still a hot issue in the data field more than 10 years after it was proposed. The solution is also very simple, that is, to limit the growth of storage computing as much as possible, whether through technical means, such as data compression, columnar storage, or methodologies, such as dimensional modeling, storage health scores, can all delay the data growth dilemma.

But the biggest problem is still a human problem. It can be said that it is not a data warehouse position or a data development position. The sensitivity to data governance or data risk issues is insufficient. These shortcomings are mainly reflected in three aspects:

Global level

  1. Risk awareness is not strong: First, data governance usually favors post-mortem analysis, and daily production and release habits are very random; second, the coverage of data quality verification is insufficient, or the identification of data problems is not accurate enough; third, most team members are business development , Investment in basic governance actions is limited.
  2. Unreasonable governance methods: Data governance usually faces the problem of imbalance between rewards and punishments. For example, if you don’t do it well, you will only punish you, and you won’t be praised if you do it well. Governance actions are usually periodic, with individual participation, and cannot stimulate individual subjectivity. initiative.

Business level

  1. Refined operation: More and more business segmentation scenarios have led to an increase in demand. The same piece of data needs to be displayed in a variety of business scenarios, which objectively leads to the growth of storage computing.
  2. Temporary tasks: For some old businesses, even if the data is no longer used due to changes in the maintainer, no one can decide whether to go offline.
  3. Frequent data refreshing: In scenarios such as e-commerce, there are many requests for data refreshing, resulting in the cluster's computing resources operating at full capacity.

Development level

  1. Efficiency first: Because the main task of the individual is to quickly support business development, the willingness to improve cost consumption is not high.
  2. Insufficient resources: There is no time or very little time that can be used for resource management.
  3. Insufficient ability: due to modeling ability or specification problems, problems such as similar calculation of data, data tilt, simple processing, violent scanning, and unreasonable parameters are very prominent.

Therefore, data governance is primarily to unify "human consensus" and establish a "legal system" process.

|0x01 Analyze the core issues of data governance

Since we want to unify the "consensus of people," then we will start with the "common" problem and gradually analyze the breakthrough points for solution.

As data development, what are the "common" problems we often encounter? I think there are about three points:

  1. Can't find data: Why should we emphasize modeling specifications? It is to allow others to see what this watch is doing. When the scale of the company continues to expand and the data dependence link continues to deepen, if there are differences in upstream and downstream specifications, such as naming, annotations and refresh cycles, even if the upstream table can be found based on blood relationship, it is because you cannot understand what the data means and how. Designed, but cannot be used, you can only do it again by yourself.
  2. Don't dare to use data: Duplicate data has always been a big problem in data governance, because a lot of similar table names or fields can usually be found in metadata, and the calibers of processing are different. I don’t dare to use it when I see it. Can do it again by myself.
  3. Don't let the data be used: As many companies realize that the cost of machines is increasing too fast, they will put forward strict requirements on the data budget, which leads to some big businesses that take up too much resources, and how to make new demands becomes a problem. Everyone is talking about managing data and reducing storage and computing resources, but few people will tell you how to develop data on a limited scale.

Let's imagine a case:
Indicator A is the company's core asset, but for objective reasons, the calculation rules need to be modified. Then we will encounter these types of situations.

  1. The company's indicators are all calculated directly from ODS. At this time, all downstream tables need to modify the calculation logic, involving X tables, Y interfaces, and Z product modules;
  2. Only need to modify the corresponding DWS table, but the downstream needs to gradually investigate the scope of influence, involving Y interfaces and Z product modules;
  3. This indicator has unique meaning and calculation rules within the company, and the data is only revealed on a single interface. At this time, only a few fixed tables need to be modified.

Although the company's business is usually very complex, if the abstraction is good and the underlying logic is modified, it will not have too much impact on the user and avoid meaningless data rectification.

From this example, we can sort out some common problems:

  1. From the perspective of data production: the modeling of the public layer must be standardized, at least it needs to be approved by analysts or business parties, and tables cannot be built at will; at the same time, the data output time needs to be guaranteed, and the corresponding quality testing There must be a mechanism for monitoring;
  2. From the perspective of data usage: R&D tools must be unified, and history tables must have offline mechanisms.

Don't underestimate the unification of R&D tools. When the business is growing rapidly, the technical solutions are very changeable. The more flexible you use, the higher the technical debt in the future.

Data assets rely on the Hadoop ecosystem, and its governance costs are very high, especially unstructured data, which takes up a large amount of storage and calculation, and the value of output is relatively limited. In the past, we mainly focused on storage management, but as the number of tasks increased, computing management was also on the agenda. Therefore, from a global perspective, a company needs to have its own unified modeling and evaluation methods, a unified development and operation and maintenance platform, and on the basis of unified development specifications and development methods, can it talk about effective data asset governance.

"Books with the same text, cars with the same track, unified weights and measures" is the core idea of ​​data governance.

|0x02 Use data to manage data

After unifying consensus and unifying weights and measures, we have the "hands" for data governance. To be more specific, when work behaviors are standardized, they can be measured through "data indicators" to see the overall situation of data assets and key directions for improvement.

Those who do user growth know the importance of establishing an index system, and those who do data governance must also have the awareness of "using data to manage data".

So what is the specific idea? There are two main points, one is the monitoring of the data model itself, and the other is the monitoring of business complexity.

The monitoring of the data model is understandable, but why monitor the business complexity? Because business complexity largely affects the complexity and cost of the data model, monitoring is also required.

Let me talk about the monitoring of the data model. Simply put, there are four strategies: the specification is better; the reuse rate is higher; the utilization rate is higher; the dependency level is not too deep.

The specification is better: everyone who is doing development knows that there must be basic specifications for doing things, such as the naming of tables, and it is necessary to be able to clearly see which business domain it belongs to, which product module serves, whether to export data synchronously or disclose views, and refresh cycle How, wait, these all need to be standardized by name, so when the data specifications are set, you can target the statistics that do not meet the specifications, and rectify within a specified time.

The multiplexing rate should be high: this one is for the CDM layer. In the dimensional modeling theory, the main function of CDM is to improve the data reuse rate. Therefore, CDM (including DWD, DWS, and DIM) must not be development for demand, but statistics for business processes. Counting the number of downstream dependencies of each table in the CDM layer can effectively assess the construction level of the public layer. CDMs that are rarely used by people are unqualified.

The utilization rate should be high: this one is for the ODS layer. ODS usually stores the most data, so if ODS data is not cited enough, then its business is usually not that important, then the storage period of the ODS table can be appropriately reduced, and between the number of references and the storage period Find a balance. Of course, there must be special examples, but special does not represent the general situation. In addition, some ADS tables directly quote ODS. If the business is in the early stage of development, this can be considered, but if it is a mature business, this should be the case. The method of distinction is still to determine the business domain and product to which the expression belongs through the naming of the table, and link it with the maturity of the business domain.

The dependency level is not too deep: this one is for the ADS layer. The most troublesome problem for data people is that the data layer goes back and forth, and the link is so long that it is too long to use. Therefore, the dependence depth of the ADS layer itself, including the statistics of the maximum dependence depth and different dependence depths, can see some problems in the construction of the ADS layer.

In addition to monitoring business complexity, there are four strategies: total link length, total code amount, total cost estimation, and project management. The premise of business complexity monitoring is to sort out the core ADS product export tables, and sort out which ADS tables correspond to each product module or interface.

Total link length: Calculate the full path length from ODS to ADS for a product export table. The longer the link, the more storage and resource resources are occupied.

Total code amount: Calculate the total amount of codes involved in a product export table from ODS to ADS. The higher the code amount, the more computing resources are consumed.

Total cost estimation: Infer the data cost consumed by a product based on the amount of data stored in the link table and the consumption of machine resources.

Project management: from the root cause of the chaotic governance needs, this issue will not be discussed here.

Of course, as the understanding of data continues to deepen, we will do more valuable analysis, such as analyzing whether each SQL is written reasonably, and so on. But no matter what, with statistical indicators, one can see the overall situation and make targeted governance.

|0xFF Data Governance Short, Medium and Long Term Strategy

Just as any plan has three strategies of "upper, middle and lower", solving problems also requires a "short, medium and long" strategy.

The short-term plan focuses on improving the above-mentioned statistical indicators and quickly solving some low-level problems. Because once you have the concept of indicators, you can mobilize the subjective initiative of R&D students.

The mid-term plan should organize the data structure system, including the establishment of a complete normative system and technical structure, and influence every junior through methodological and cultural methods.

Long-term technology innovation has been used to realize automatic task optimization and help reduce the workload of data maintenance and management, such as the recently popular "cloud native" concept.

But no matter what the strategy is, it needs to consider the historical debt and how to stop adding debt.

The perfect solution usually doesn't exist, and it's most people's choice to settle down. When technology can't solve the problem, you may wish to use alternative ideas to solve it.

Of course, data asset governance in a broad sense should be extended to more aspects, such as data security, such as data island issues, each of which requires systematic theories to explain.

But the last thing I want to talk about is that this actually involves the problem of choosing a job. The improvement of corporate efficiency is nothing more than two points: cost reduction and efficiency improvement. Efficiency improvement can be solved from the perspective of data analysis, and cost reduction needs to be promoted through data asset governance. When choosing a career, if you just use the tools proficiently, it is easy to be eliminated. If you have mastered the methodology of cost reduction and efficiency improvement, you will have to be more comfortable with the midlife crisis.

Guess you like

Origin blog.csdn.net/gaixiaoyang123/article/details/112634786