The data directory is dead? Why rethink metadata management and data governance?

The full text is 4233 words, and the expected learning time is 11 minutes

Source: unsplash

As companies increasingly use data to drive digital products, drive decision-making and innovation, understanding the status and reliability of these most critical assets is critical. For decades, companies have relied on data catalogs to promote data governance. But is this enough?

Debashis Saha, vice president of engineering at AppZen, and Barr Moses, CEO and co-founder of Monte Carlo, discussed the reasons why data catalogs cannot meet the needs of modern data stacks, and our new approach to metadata management—the urgency of data discovery desire.

It’s no secret: Knowing where the data is and who has access is essential to understanding its impact on the business. In fact, to build a successful data platform, the key is to organize and centralize the data, but at the same time it must be easy to find.

Similar to the physical library catalog, the data catalog functions as a metadata catalog and provides users with the information needed to assess the accessibility, health, and location of the data. In the era of self-service business intelligence, data catalogs have also become powerful tools for data management and data governance.

No wonder that for most data leaders, one of their first tasks is to establish a data catalog. The data directory should at least answer:

· Where should I find the data?

· Is this data important?

· What do these data represent?

· Is this data relevant and important?

· How to use this data?

However, as data operations mature and data pipelines become more and more complex, traditional data directories often cannot meet these requirements. Therefore, some of the best data engineering teams are innovating their metadata management methods. Compared to traditional methods, what are they innovating?

What are the shortcomings of the data directory

Although the data catalog can record data, to a large extent, the fundamental problem of allowing users to "discover" and collect meaningful, real-time data conditions remains unsolved. Data catalogs cannot keep up with this new reality: there are three main reasons: lack of automation; inability to expand with the growth and diversity of data stacks; and their non-distributed format.

Growing demand for automation

Traditional data catalogs and governance methods usually rely on the data team to manually complete the heavy work of data input, and they are also responsible for updating the catalog with the development of data assets. This method is not only time-intensive, but also requires a lot of manual work, which could have been automated.

As a data expert, understanding the state of the data is a protracted battle, which shows the need for a higher degree of more customized automation.

Perhaps this scene is reminiscent of this: before the stakeholder meeting, do you often find yourself searching the Slack channel frantically to figure out what data set provides information for the specific report or model being used — and why the data last week Did not arrive? In order to solve this problem, did you and the team squeeze in a room and start recording all upstream and downstream contacts with a whiteboard to complete a specific key report?

I won’t repeat the bloody details, it may look like this:

Does your data lineage look like a mess of lines and arrows? What the heroes see is the same. | Source: Shutterstock

Yes, many people will feel the same way, you are not alone. Many companies that need to solve this dependency puzzle have embarked on a multi-year journey of manually planning all their data assets. Some companies can invest resources to develop short-term hacking tools, or even internal tools, so that they can search and explore their own data.

Even if the ultimate goal is reached, it will bring a heavy burden to the data organization and make the data engineering team spend more time and money, which could have been spent on other things, such as product development or actual use of data.

The ability to scale as data changes

When data is structured, data catalogs are very useful, but in 2020, this is not always the case. As machine-generated data increases and companies invest in machine learning projects, unstructured data becomes more and more common, accounting for more than 90% of all newly generated data.

Unstructured data is usually stored in a data lake, there is no predefined model, and it must be converted many times before it can be used. Unstructured data is very dynamic, and its form, source, and meaning are constantly changing in various stages of processing (including conversion, modeling, and aggregation). The work done on these unstructured data (ie transformation, modeling, aggregation, and visualization) makes it difficult to catalog the data in its ideal state.

In addition, in addition to simply describing the data users access and use, more and more people need to understand the data according to the intent and purpose of the data. The way data producers describe data assets may be completely different from the way data users understand their functions, and even data users may have huge differences in their understanding of the meaning of the data.

For example, the meaning of the data set extracted from Salesforce to the data engineer is completely different from the meaning to the sales team. Although the engineer will understand the meaning of "DW_7_V3", the sales team will rack their brains to try to confirm whether the data set is related to the "2021 revenue forecast" dashboard in Salesforce. The list goes on.

The description of static data is limited by its nature. By 2021, we must accept and adapt to these new and evolving dynamics in order to truly understand the data.

The data is distributed, but the directory is not

Although the distribution of modern data architectures and semi-structured and unstructured data have become the norm, most data catalogs still treat data as one-dimensional entities. When data is aggregated and transformed, it flows through different elements of the data stack, making it almost impossible to record it.

 

Traditional data catalogs manage metadata (data about data) in the receiving state, but the data is constantly changing, making it difficult to understand the status of the data as it evolves in the pipeline. | Source: BarrMoses

Now, data tends to be self-describing, with data and metadata describing the format and meaning of the data contained in a single package.

Because the traditional data catalog is not distributed, it is almost impossible to use it as the central source of data authenticity. As more and more users (from BI analysts to operations teams) have access to data, and pipelines that support machine learning, operations, and analysis become more and more complex, this problem will only get worse.

Today's data catalog requires the meaning of cross-domain joint data. The data team needs to be able to understand how these data domains are related to each other and what aspects of the aggregate view are important. They need a centralized way to answer these distributed questions as a whole-in other words, a distributed, federated data directory.

Investing in the right way to build a data catalog from the beginning will help build a better data platform, help the team explore data more easily, pay close attention to important data assets and fully utilize their potential.

Data Catalog 2.0=Data Discovery

If there is a rigid model, the data catalog will be very useful, but as the data pipeline becomes more and more complex, unstructured data becomes the gold standard, our understanding of the data (what is it used for, who uses it, how to use it, etc. ) Cannot reflect reality. We believe that the next generation of data catalogs will have the ability to learn, understand and infer data, enabling users to leverage their insights in a self-service manner. But how to do it?

 

Data discovery can replace today's data catalogs by providing distributed real-time insights about data across different domains, while complying with a centralized set of governance standards. | Source: BarrMoses

In addition to cataloging data, metadata and data management strategies must also include data discovery, which is a new way to understand the health of distributed data assets in real time.

Data discovery draws on the distributed domain-oriented architecture proposed by the data grid model of Zhamak Deghani and Thoughtworks. It is assumed that different data owners need to be responsible for their data products and at the same time promote the distribution of distributed data in different locations. Communication. Once the data serves a given domain and is transformed by it, domain data owners can use the data to meet their operational or analytical needs.

Data discovery replaces the need for data catalogs because it can provide domain-specific and dynamic data understanding through the way users receive, store, aggregate, and use data. Like data catalogs, governance standards and tools are united across domains (allowing greater accessibility and interoperability), but unlike data catalogs, data discovery can understand the current state of the data in real time, rather than the ideal state Or "cataloged" status.

Data discovery can answer these questions, not only for the ideal state of the data, but also for the current state of the data in each domain:

· Which datasets are the most recent? Which datasets can be discarded?

When was the last time the table was updated?

What is the meaning of a given field in my field?

· Who has access to this data? When was the data last used? By whom?

What are the upstream and downstream dependencies of these data?

· Is this production-quality data?

What data is important to the business needs of my field?

· What are my assumptions about these data and are they satisfied?

In other words, the next-generation data catalog, data discovery, will have the following characteristics:

· Self-service discovery and automation

The data team should be able to easily utilize the data catalog without the need for a dedicated support team. Self-service, automation, and workflow orchestration of data tools eliminates silos between data pipeline stages and makes it easier to understand and access data. Higher accessibility will naturally lead to more data adoption, thereby reducing the load on the data engineering team.

· Scalability with data development

As companies receive more and more data and unstructured data becomes the norm, the ability to meet these needs will be critical to the success of data projects. Data discovery uses machine learning to obtain a bird's-eye view of data assets to ensure that understanding changes as the data develops. In this way, data consumers can make more informed decisions instead of relying on outdated documents or worse intuitive decisions.

· Data lineage of distributed discovery

Data discovery relies heavily on automatic table-level and field-level lineage to map the upstream and downstream dependencies between data assets. Lineage helps to display the correct information (the core function of data discovery) and draw the connections between data assets at the correct time, so as to better troubleshoot when the data pipeline fails. With the continuous development of modern data stacks Adapting to more complex use cases, this problem becomes more and more common.

· Data reliability ensures the gold standard of data-always so

In fact, your team may have invested in data discovery in one way or another. Whether it’s manually validating data through a team, or custom validation rules written by engineers, or simply the cost of decisions made based on broken data or unnoticed silent errors.

Today, data teams have begun to use automated methods to ensure highly trusted data at every stage of the data pipeline, from data quality monitoring to more robust end-to-end data observability platforms that can monitor and alert data pipelines The problem. Such solutions will notify when data is damaged, so that the root cause can be quickly identified, the problem can be resolved quickly, and future downtime can be prevented.

Data discovery enables data teams to believe that their assumptions about data are in line with reality, thereby supporting dynamic discovery and high reliability across data infrastructures, regardless of domains.

 

What's next?

If bad data is worse than no data, then a data directory with no data found is worse than no data directory. To obtain truly discoverable data, it is important that your data is not only "cataloged", but also accurate, clean, and fully observable, from receipt to use-in other words: reliable.

A powerful method of data discovery relies on automated and scalable data management, which is suitable for the new distributed nature of data systems. Therefore, to truly realize data discovery in an organization, you need to reconsider how to deal with data catalogs.

Only by understanding the data, its state, and how to use it in all directions, can we begin to trust it.

Share the dry goods of AI learning and development together

Welcome to follow the full platform AI vertical self-media "core reading"

(Add the editor WeChat: dxsxbb, join the reader circle, and discuss the freshest artificial intelligence technology together~)

Guess you like

Origin blog.csdn.net/duxinshuxiaobian/article/details/112907576