[Translation] DataHub: Analysis of the third-generation technology architecture of metadata management

原文地址:DataHub: Popular metadata architectures explainedicon-default.png?t=N7T8https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

Preface

When I started working at LinkedIn ten years ago, the company was just beginning to experience tremendous growth in the volume, variety, and velocity of data. Over the next few years, my colleagues on the LinkedIn Data Infrastructure team and I developed foundational technologies such as Espresso , Databus , and Kafka to ensure that LinkedIn could survive and thrive in the next wave of growth. A few years later, I became the technical lead on the then fairly small "Data Analytics Infrastructure" team, which was responsible for running and supporting LinkedIn's use of Hadoop and maintaining a hybrid data warehouse that spanned Hadoop and Teradata.

The first thing I noticed is that people often ask around for the "right dataset" to use for analysis. This made me realize that while we had built highly scalable dedicated data stores, streaming capabilities, and cost-effective batch computing capabilities, we were still wasting time finding the right data sets to analyze.

Data Discovery: One problem, many solutions 

Today, we are living in the golden age of data. When data scientists join a data-driven company, they expect to find a data discovery tool (i.e., a data catalog) that can be used to find out what data sets exist at the company and how they can be used to test new hypotheses and generate new insights. Most data scientists don't really care about how the tool actually works under the hood, as long as it makes them more productive.

In fact, there are many data discovery solutions to choose from: both proprietary software available for purchase, open source software provided by specific companies, and software developed in-house. Over the past few years, LinkedIn , Airbnb , Lyft , Spotify , Shopify , Uber , and Facebook have all shared details of their respective data discovery solutions. This begs the question: How do each of these platforms differ, and which option is best for companies considering adopting these tools?

The architecture of your data catalog will impact how well your business can truly extract value from your data. Additionally, data catalogs are sticky and take a long time to integrate and implement within a company. Therefore, it is important to choose your data discovery solution carefully.

In this article, I'll take a look at the three generations of architectures the industry has offered for data discovery tools to date, and explain where many of the best-known options fit within these three generations. The development between these three generations also reflects the evolution of the LinkedIn DataHub architecture, where we have been driving the latest best practices (first open sourced and shared with the world as WhereHows in 2016 , then completely rewritten in 2019 and shared with the world as DataHubReshare with the open source community).

Hopefully this article helps you make the best decision when choosing your own data discovery solution.

What is a data directory?

Before we dive into the different architectures, let’s clear up the definitions. I found one of the simplest definitions of a data catalog from the Oracle website: "Simply put, a data catalog is an organized inventory of data assets in an organization. It uses metadata to help organizations manage data. It also helps data professionals People collect, organize, access and enrich metadata to support data discovery and governance.”

Thirty years ago, a data asset might have been a table in an Oracle database. Yet, in the modern enterprise, we have a dizzying array of assets: tables in your relational database or NoSQL store, streams in your favorite streaming store, functionality in your AI system, your metrics platform Metrics, dashboards in your favorite visualization tools, and more. A modern data catalog is expected to contain an inventory of all these types of data assets and enable data workers to more effectively leverage these assets to complete their work.

Why do we need a directory?

Before you decide to purchase or adopt a specific data catalog solution or build your own data catalog, you should first ask yourself what you want your data catalog to accomplish for your business. An important question related to this is what types of metadata you want to store in the data catalog, as this directly affects the types of use cases you can enable.

Here are some common use cases and examples of the types of metadata they require: 

  • Search and discover : data schemas, fields, tags, data usage information
  • Access control : access control groups, users, policies
  • Data relationship : pipeline execution, query, API log, API pattern
  • Compliance : Taxonomy of data privacy/compliance annotation types
  • Data management : Data source configuration, data integration configuration, data history version policy, data purge policy (e.g., " right to be forgotten" for the General Data Protection Regulation (GDPR) , data export policy (e.g., for the General Data Protection Regulation (GDPR) ) "right of access" ).
  • Artificial Intelligence Interpretability and Repeatability : Feature Definition, Model Definition, Training Run Execution, Problem Statement
  • Data operations : pipeline execution, processed data partitioning, data statistics
  • Data quality : Data quality rule definition, rule execution results, data statistics

An interesting phenomenon is that each use case often brings its own special metadata needs, but also needs to be connected to existing metadata brought by other use cases. We'll return to this point when we dive into the different architectures of these data catalogs and their impact on success.

First generation architecture: everything is monolithic

The diagram below depicts the first generation metadata architecture. It's usually a classic monolithic frontend (perhaps a Flask app) connected to a main storage for queries (usually MySQL/Postgres), a search index to serve search queries (usually Elasticsearch), and for that Generation 1.5 of the architecture, maybe a graph index to handle threaded graph queries (usually Neo4j), once you hit the limitations of relational databases in terms of "recursive queries".

First generation architecture: ETL based on active crawling mechanism

The acquisition of metadata usually adopts active crawling method, that is, connecting to metadata sources such as database catalog, Hive catalog, Kafka shema registry or log file of workflow coordinator, and then writing these metadata to the main storage and indexing will be required. section is added to the search index and graph index. This scraping is typically a single process (not parallel) and runs about once a day. During this scraping process, the raw metadata is typically converted into the application's metadata model, since the data is rarely in the exact form intended by the directory. Typically, this transformation is embedded directly into the crawling task.

Slightly more advanced versions of this architecture also allow batch jobs (such as Spark jobs) to process metadata at scale, compute relationships, recommendations, etc., and then load this metadata into storage and indexes.

Typically, it would take a few engineers about two weeks to build a first prototype of this basic backend architecture and load data into it. Additionally, it would take weeks to build a simple front-end that could display this metadata and support simple searches.

advantage

  • Few moving parts : With just a query repository, a search index, and a few crawlers, you can quickly aggregate metadata and build a useful application, making data workers more productive. To prove this, you don't need a lot of infrastructure up and running.
  • One team can do many things : This architecture is intended for a single team that has access to metadata sources and can build applications to serve them.

shortcoming

However, this architecture also has some real drawbacks. I only want to highlight the first two of them.

  • Push vs. Pull: It’s easy to get started with crawling data sources as a way to collect metadata and aggregate it into one place, but it doesn’t take long for these ingest pipelines to start showing signs of fragility. The crawler runs in a separate environment from the data source, and its configuration needs to be managed separately by a central team. So one set of issues in these pipelines are operational barriers, such as network connectivity (firewall rules) or credential sharing (passwords can be changed without notifying the central team).
    The other set of issues is more subtle, but also operational in nature. Crawl-based ingestion often results in batches (how often do you call a data source? This makes the data source's operations team very unhappy, because no one likes to be woken up in the middle of the night by a melting database or an unresponsive HDFS Namenode and find it groaning , because the metadata crawler has pushed it to the edge. The first victim of such operational issues is usually the metadata crawler pipeline, regardless of whether it is actually responsible! Your metadata ingestion pipeline will be paused while the operations team While working to bring the system back to health, they will often require the metadata crawler to remain paused for an extended period of time, even after the system has been restored. At the same time, your metadata will become increasingly "stale", causing damage to the directory The level of trust is reduced. This leads to the second question.
  • Metadata freshness: Closely related to the “push” or “pull” decision is the issue of freshness of the data (in this case, metadata). At the beginning of my metadata journey, it seemed perfectly fine to crawl the Hive metastore (or S3 bucket) and populate the directory once a night. After all, you just want your data scientists to be more efficient than before. However, once you start getting into the data creation workflow (e.g., once you create some data, you can come here to provide data classification labels) or provide operational metadata (e.g., a data quality checklist for the latest partition), then the freshness of the metadata is starts to become more important. If you only have a crawl-based metadata directory, you're basically out of luck at this point.  

What does this mean to me?

As a reader, you may be thinking: "So, what first-generation metadata systems are out there today?" Amundsen adopted this architecture, as did the original version of WhereHows, which we open sourced in 2016. In internal systems, Spotify’s Lexikon , Shopify’s Artifact and Airbnb’s Dataportal also use the same architecture.

These systems play an important role in making humans more efficient at using data, but fall short in maintaining high-fidelity data inventories and enabling programmatic use cases for metadata.

Second generation architecture: 3-tier applications with service APIs 

 The diagram below depicts what I consider to be a second generation metadata architecture. The monolithic application is split into a service that sits in front of a metadata storage database. The service provides an application programming interface (API) that allows metadata to be written to the system using a push mechanism. Programs that need to read metadata programmatically can use the API to read metadata. However, all metadata accessed through the API is still stored in a metadata repository, which can be a relational database or a scaled-out key-value store.

Second generation architecture: Services with push APIs 

advantage

Let’s talk about the benefits of this evolution.

  • Better contracts lead to better results : Providing a schema-based push interface can instantly establish a good contract between metadata producers and the "central metadata team". You still need to convince the metadata production team to publish the metadata and accept dependencies, but it's much easier to do this with a well-established schema.
  • Enable programmatic use cases : With service APIs, central teams can finally enable programmatic use cases for metadata. For example, if your data portal application allows data sets and fields to be tagged with tags that specify the field's semantic type (e.g., email address, customer identifier) ​​and stores that information in a metadata system, then your data management foundation The architecture can then start using this metadata to automatically delete data assets for customer IDs that have requested the right to be forgotten, or automatically create aliased versions of these datasets for data scientists. In fact, at LinkedIn, we use Apache Gobblin to do this work, powered by DataHub's metadata.

shortcoming

However, there are still some issues with this architecture that are worth highlighting.

  • No change log : The second-generation architecture provides a microservices-based API for reading and writing metadata, but does not support streaming metadata changes from external systems, nor does it support subscribing to metadata that occurs in the data directory Change flow.
    You may be familiar with this popular blog post : Why logs should be the core of your data ecosystem. It turns out that the same is true for metadata. A modern data catalog should be able to subscribe to changes in metadata in real time as a first-class service.
    Without metadata logs, it is difficult to reliably boot (recreate) or repair search and chart indexes when problems arise. Without real-time metadata change logging, it is impossible to build efficient reactive systems on top of a central metadata platform, such as data triggers or access control abuse detection systems. To set up such a system, it is necessary to use key-value APIs to access metadata through excessive polling or full scans. Or you need to wait for nightly ETL of the metadata database before the snapshot can finally be processed. We've gone through this painful process on the data side, so we'd love to skip it on the metadata side too! However, modern metadata systems often forget to design this important function.
  • Problems with centralized teams : Another big problem with this architecture is that it continues to rely on one centralized team to handle too many things: own the metadata model, run the central metadata service, data storage and indexing, and support all downstream Consumers and their different ways of accessing metadata. This severely limits the ability of centralized systems to support the various use cases a company has today (productivity, governance, AI explainability, etc.). Taking LinkedIn as an example, when we were still in the second generation of metadata architecture, we had the data quality team build a separate user interface and metadata store for storing rules and displaying data quality results on the dataset.
    The operational impact of service-based integration also results in tightly coupling the availability of producers and central services, which makes adopters worry about adding a source of downtime to the stack.

“The central metadata team has the same problems that the central data warehouse team has”.

Data engineering itself is evolving into a different paradigm - decentralization is becoming the norm. Therefore, central metadata teams should not repeat the mistakes of the past in trying to successfully keep up with the rapidly evolving complexity of the metadata ecosystem.

What does this mean to me?

Among commercial metadata systems, Collibra and Alation appear to have second-generation architectures. Among open source metadata systems, Marquez has a second-generation metadata architecture.

My experience is that second generation metadata systems often serve as reliable search and discovery portals for a company's data assets, so they do meet the productivity needs of data workers. They can also begin to provide service-based integration into programmatic workflows, such as access control configuration. In fact, we also went through this process when evolving WhereHows from Gen 1 to Gen 2, where we added a push-based architecture and services specifically for storing and retrieving metadata.

However, centralization bottlenecks often lead to building or adopting new, independent catalog systems for different use cases, thereby reducing the power of a single, consistent metadata graph. Companies that build or adopt search and discovery portals for data scientists sometimes end up installing different data governance products for business units, complete with their own metadata backends. To keep dataset definitions and vocabularies in sync, these companies must build and monitor new data pipelines to reliably replicate metadata that is copied from one directory to another using different metadata models. This problem is not limited to large companies but affects any organization that has reached a certain level of data literacy and enabled various use cases for metadata.

Third Generation Architecture: Event Sourced Metadata

The key point of third-generation metadata architecture is that "central service"-based metadata solutions cannot keep up with the requirements of enterprise metadata use cases. To solve this problem, two requirements must be met. First, the metadata itself must be free-flowing, event-based, and subscribed in real time. Second, the metadata model must support continuous evolution as new extensions and additions become available without being blocked by a central team. This will allow metadata to always be consumed at scale by multiple types of consumers.

Step 1: Log-oriented metadata architecture

Metadata providers can push metadata to a stream-based API or perform CRUD operations against the directory's service API, depending on their preference. The resulting metadata mutations will in turn generate a metadata change log. This metadata log can be automatically and deterministically materialized into the appropriate storage and indexing (e.g. search index, graph index, data lake, OLAP storage) to meet the needs of all query patterns. As shown in the figure below, the result is an unbundled metadata database architecture that is ready for the modern enterprise. Since the log is the center of the metadata universe, in the event of any inconsistency, it is possible to bootstrap the graph index or search index at will and fix the error deterministically.

Third Generation Architecture: Unbundled Metadata Database

Step 2: Domain-oriented decoupled metadata model

In addition to its "stream-first" architecture, third-generation catalogs support extensible, strongly-typed metadata models and relationships for enterprise collaborative definition. Strong typing is important because without it we would be storing a lowest common denominator bag of generic properties in the metadata store. In this way, programmatic consumers of metadata cannot guarantee backward compatibility when processing metadata.

In the metadata model diagram below, we use the DataHub terms "entity type", "aspect", and "relationship" to describe the diagram containing three entities: datasets, users, and groups. Different teams can attach different aspects such as ownership, profile, etc. to these entities to create relationships between these entity types. Note that there are many ways to describe these graph models, ranging from RDF-based models, to full ER models , to custom hybrid approaches like those used by DataHub .

Metadata model diagram example: types, aspects, relationships

This approach to modeling enables teams to evolve the global metadata model by adding domain-specific extensions without being bottlenecked by a central team. For example, the compliance team might check in the ownership aspects, while the core metadata team might check in the schema aspects of the dataset entities. At the same time, the data ingestion team may design and check in the ReplicationConfig aspects for the dataset entities. All of these additions to the model can be made independently with minimal points of conflict. Of course, we need to manage and agree on core entity types before they can be introduced into the graph.

benefit

Through this evolution, customers can interface with the metadata database in different ways according to their own needs. They get stream-based metadata logging (for ingestion and change consumption), low-latency querying of metadata, full-text and sorted search capabilities for metadata attributes, metadata graph queries, and full scanning and analysis capabilities. Different use cases and applications can extend the core metadata model differently without sacrificing consistency or freshness. You can also integrate these metadata with your preferred developer tools (such as git) to write and version the metadata while writing code. Metadata can be refined and enriched through low-latency processing of metadata change logs, or batch processing of compressed metadata logs as tables on a data lake.

The figure below shows the complete implementation of this architecture:

Third generation architecture: end-to-end data flow

Any global enterprise metadata needs, such as global lifecycle management, auditing, or compliance, can be addressed by building workflows that can query these global metadata in streaming or batch form.

"Good metadata architecture" is very similar to "good data architecture"

 A classic sign of a good third-generation metadata architecture implementation is that you are always able to read and act on the latest metadata in its most detailed form, without losing consistency. At LinkedIn, when we transitioned from WhereHows (second generation) to DataHub (third generation), we found that we were able to dramatically increase the trustworthiness of metadata, making the metadata system central to the enterprise. Now, metadata systems are gradually becoming the starting point for data workers to research new hypotheses, discover new indicators, and manage the life cycle of existing data assets.

shortcoming

Sophistication often goes hand in hand with complexity. Third-generation metadata systems typically have several moving parts that need to be set up for the entire system to work well. Companies with a handful of data engineers may find themselves intimidated by the amount of work required to bring a simple use case to life, wondering whether the initial investment of time and effort is worth the long-term payoff. However, third-generation metadata systems like DataHub have begun to make significant advances in usability and out-of-the-box experience to ensure that this does not happen.

What does this mean to me?

Of all the systems we investigated, only Apache Atlas , Egeria , Uber Databook , and DataHub use third-generation metadata architecture . Among them, Apache Atlas is closely integrated with the Hadoop ecosystem. Some companies are trying to add Amundsen on top of Atlas to try to get the best of both worlds, but there appear to be some challenges with this integration. For example, you must ingest metadata and store it in Atlas' graph and search indexes, completely bypassing Amundsen's data ingestion, storage, and indexing modules. This means that any new concepts you want to model need to be introduced as Atlas concepts and then connected to Amundsen's user interface, which leads to considerable complexity. Egeria supports integrating different directories via a metadata event bus, but so far the functionality seems incomplete. Uber Databook appears to be based on very similar design principles to DataHub, but is not open source. Of course, our opinion is biased due to our personal experience with DataHub, but open source DataHub has all the advantages of a third-generation metadata system, the ability to support multiple types of entities and relationships, and a stream-first architecture.

At LinkedIn, DataHub deployments include datasets, schemas, streams, compliance annotations, GraphQL endpoints, metrics, dashboards, capabilities, and AI models, making it truly a third-generation metadata system in terms of scale and readiness. It typically processes up to more than 10 million entity and relationship change events per day, indexing more than five million entities and relationships in total, while serving operational metadata queries with low millisecond service level agreements (SLAs). Enable data productivity, compliance and governance workflows for all our employees.

Below is a simple visual representation of the state of metadata today.

Of course, this is just a snapshot of the different systems out there. As enterprises' needs for metadata grow, Generation 3 systems are likely to be further consolidated and updated.

Is a good architecture enough?

With the third generation architecture implemented by DataHub, we appear to have obtained a good metadata architecture that is scalable and works well for our multiple use cases. Are there other issues in this area that need to be addressed? The short answer is "yes". Third generation metadata architecture ensures you can integrate, store and process metadata in the most scalable and flexible way. But this is not enough.

“Making metadata work is harder than putting it together.”

First, you need to define the right metadata model that truly captures the concepts that are meaningful to your business. Then, you need an AI-powered path to transition from this complete, reliable inventory of data assets to a trusted metadata knowledge graph. This will allow you to truly unlock the productivity and governance capabilities of your business. But that’s for another blog post! 

Guess you like

Origin blog.csdn.net/xieshaohu/article/details/132912138
Recommended