Selection and comparison of open source data asset (metadata) management platforms

0. Foreword

Although the popularity of new words in the data industry is transformed from big data platform -> data governance -> data middle platform -> digital transformation (modern data technology stack), as the basic component of these new words, data asset management platform/metadata Technical solutions such as management platform/data directory management platform are still in the climbing recovery period of the Gartner curve. Related platforms are flourishing, and open source platforms or commercial products that dominate the rivers and lakes have not yet appeared. In the process of promoting the implementation of enterprise digital transformation, data governance, data The selection of asset management platform/metadata management platform/data directory management platform is still a test of human ability.

1. Atlas

Open source address : https://github.com/apache/atlas

Atlas was first developed by HortonWorks, one of the troika of big data platforms (Cloudera, Hortonworks, MapR), to manage metadata in Hadoop projects, and then designed as a data governance framework. It provides Hadoop clusters with data classification, Centralize the core capabilities of metadata governance including policy engine, data lineage, security and lifecycle management.

Later, the open source was released to the Apache community for incubation, and it was supported by Aetna, Merck, Target, SAS, IBM and other companies for development and evolution. Because of its support for massive horizontal expansion, good integration capabilities and open source features, most domestic manufacturers choose to use Atlas or conduct secondary development on it. At present, Cloudera and Hortonworks have been acquired, and MapR has few new products.

In the field of big data technology, compared with 2016 when the Hadoop technology platform was in the limelight, tremendous changes have taken place, and the Hadoop system is gradually fading out of the center of the stage. MPP, modern technology stack, cloud-native database, etc. are on the stage, such as Clickhouse, Doris, StarRocks, Databend, Materialize, Ringswave.

Advantages of Atlas:

  • Dachang is open source, deeply integrated with Hive in the Hadoop ecosystem, and supports table-level and field-level blood relationship

  • Natively integrated with HDP, it supports docking with Ranger to realize row and column level data permission control, easy and worry-free installation

  • Powerful metadata metamodel, supporting metadata customization and extension

  • The source code is not complicated, and there are a large number of platforms in China that are customized and modified into commercial products based on Atlas

Disadvantages of Atlas:

  • Its advantages are also disadvantages. The parent open source company has been acquired and has a long history. It is no longer an advantage, but a burden

  • The Hadoop system has been declining. How to perfectly support the Hive and Hadoop systems can no longer meet the current rapid development of technical requirements

  • Its design interface is complicated, the experience is old, and the data directory and data retrieval are not convenient enough

  • The user experience is complex and the product features are more focused on solving the problems of technical personnel rather than end users of data, such as business personnel

  • The ecology gradually loses its freshness, and new similar platforms continue to develop

Related introduction : https://mp.weixin.qq.com/s/MvaxSF74NE0E43i4rQEb3g

Selection suggestion :

1) If you only have the Hadoop ecosystem, you can try it.

2) If your data asset is geared towards the technical staff of the data team, you can try it.

2. DataHub

Open source address : https://github.com/datahub-project/datahub 7.2K star

DataHub is open sourced by Linkedin, the official Slogan: The Metadata Platform for the Modern Data Stack - a metadata platform for the modern data stack. The purpose is to solve the metadata management problems of a variety of data ecosystems. It provides metadata retrieval, data discovery, data monitoring and data supervision capabilities to help everyone solve the complexity of data management.

DataHub is open source based on Apache License 2, adopts a push-based data collection architecture (of course also supports the pull method), and can continuously collect changing metadata. The current version has integrated most popular data ecosystem access capabilities, including but not limited to: Kafka, Airflow, MySQL, SQL Server, Postgres, LDAP, Snowflake, Hive, BigQuery.

Advantages of Datahub:

  • Famous open source, same family as Kafka. The community is active, the development momentum is rapid, and the version update iteration is rapid.

  • The positioning is clear and far-reaching. Slogan can see the team's ambitions and later investment, and the constantly iterative version also proves this.

  • The underlying architecture is flexible and advanced, born without extended integration, and supports push and pull modes. For details, see: https://datahubproject.io/docs/architecture/architecture/

  • The UI interface is simple and easy to use, friendly to technicians and business personnel

  • Rich interfaces and comprehensive functions

Disadvantages of Datahub:

  • The front-end interface does not support internationalization, and the logic of interface construction and use is not Chinese enough

  • The version update iteration is fast, and it is difficult to upgrade after use

  • More functions are under construction, such as Hive column-level blood relationship

  • Some functional performance needs to be optimized, such as SQL Profile

  • There are not many Chinese materials, and there are not many Chinese communication communities

Related introduction :

https://mp.weixin.qq.com/s/74gK3hTt7-j1lTbKFagbTQ

https://mp.weixin.qq.com/s/iP6sc2DzPaeAKpSWNmf8hQ

Suggestions for model selection : 1) If there are at least half of the front-end developers + back-end developers; 2) If you need a data asset management platform with better user experience; 3) If you need to expand the metadata that supports various platforms and systems. Please list Datahub as the top choice. Although some shortcomings are listed, Datahub is currently the best choice among open source products. The author is also using it in production, and if you have any questions, you can communicate at any time.

Commercial version : Metaphor (https://metaphor.io/) is the SaaS version of Datahub.

3. Marquez

Open source address : https://github.com/MarquezProject/marquez 1.3K star

Advantages of Marquez:

  • The interface is beautiful, and the operation details are well designed

  • Simple deployment and concise code

  • Relying on the underlying OpenLineage protocol, the structure is better

Disadvantages of Marquez:

  • Focus on the visualization of data assets/blood relationship, and some functions of data asset management require more development work

Related introduction : https://mp.weixin.qq.com/s/OMm6QEk9-1bFdYKuimdxCw

Suggestions for model selection : 1) If you have a powerful metadata and data asset management platform backend and only need visualization and blood relationship display of data assets, you can consider using experience. 2) The interface display is relatively good, and it supports the selection of dependent line highlighting and hidden branch line dependencies. To achieve data asset management and metadata collection, there is a lot of work to be done.

Commercial version : Datakin (https://datakin.com/) is the SaaS version of Marquez. It supports Apache Hive, Amazon RDS, Teradata, Amazon Redshift, Amazon S3, and Cassandra.

4. Amundsen

Open source address : https://github.com/amundsen-io/amundsen 3.8K star

Amundsen is an open source metadata management and data discovery platform from Lyft. It has complete functions and a relatively complete front-end, back-end and data processing framework.

Advantages of Amundsen:

  • Lyft is an open source company with an active community and many version updates

  • The positioning is clear and clear, similar to Datahub, and it is committed to becoming a data catalog product in the modern data stack

  • Support docking with more data platforms and tools

Amundsen's shortcomings:

  • The UI interface is quite satisfactory, and the operation convenience is not enough

  • There are not many Chinese documents

  • It is not as convenient to use as Datahub in terms of blood relationship, tags, terminology and other functions

  • More support-friendly components, not many used in China

Related introduction

https://mp.weixin.qq.com/s/yGZ1RJs2seu943sswxYYzw

https://mp.weixin.qq.com/s/5w6euvUWzm5RWXgisB-rMg

https://mp.weixin.qq.com/s/iVocnMV8zuQN-jcID83nSg

Selection suggestion :

1) If someone is tossing, it is recommended to choose Datahub, if no one is tossing, choosing Amundsen is enough to toss

Commercial version : Stemma (https://www.stemma.ai/) is the SaaS version of Amundsen.

五、Open Data Discovery

 Open source address : https://github.com/opendatadiscovery/odd-platform (692 stars)

Open Data Discover is an open source data discovery and observability platform. It aims to help data-driven businesses democratize their data by making it easier to discover, manage, observe, trust and secure. Since ODD supports open data standards, data teams are able to exchange data more efficiently between various data tools.

To be honest, the UI of the platform is indeed very beautiful. Its ingestion is based on specifications. However, the platform is a work in progress, so some features are still being developed.

Advantages of Open Data Discovery:

  • Provide an online experience Demo environment, which is helpful for promoting and attracting newcomers

  • The UI interface is beautiful and beautiful, and the interface operation logic conforms to the usage habits of Chinese people

  • The project is young and can learn from many existing data asset projects

  • Integrated data quality module

  • Some excellent functions of Datahub have been planned

  • It supports open data standards, but it feels useless and cannot be played domestically.

  • Provides an interface for scheduling workflow alarms

  • New concept design based on observable data

  • ML is a first-class citizen, this is a bet on the future development of AI

Disadvantages of Open Data Discovery:

  • The project is in its infancy and the community is not very active

  • It overlaps with Datahub's large number of functions

  • There are very few Chinese materials

  • product positioning?

Related introduction : https://demo.oddp.io/ It is better to see it than to hear it, and it is better to do it than to see it.

Suggestion for model selection : The project is in the early stage, and the domestic ecology has not yet risen. People who have a sense of trying new things and are willing to make troubles can follow up and study. To build and use the production environment, it is necessary to prepare for the front-end and back-end issues and dig deep into the source code.

6. Open Metadata

 

Open source address : https://github.com/open-metadata/OpenMetadata (1.9K star)

OpenMetadata is an open standard for metadata, providing basic capabilities for end-to-end metadata management solutions. Provides all the necessary components for data discovery, data governance, data collaboration, data quality and observability.

Similar to Open Data Discover, its UI is very beautiful, and its operation and usage logic are also in line with the habits of business people.

Advantages of Open Metadata:

  • Provide an online experience Demo environment, which is helpful for promoting and attracting newcomers

  • The UI interface is beautiful and beautiful, and the interface operation logic conforms to the usage habits of Chinese people

  • The project is young and can learn from many existing data asset projects

  • Integrated data quality module

  • It supports open data standards, but it feels useless and cannot be played domestically.

  • New concept design based on observable data

Shortcomings of Open Metadata:

  • The project is in its infancy, and there are not many Chinese people involved

  • Not very different from Open Data Discovery

  • The product is still under rapid development

  • There are very few Chinese materials

Related introduction : https://sandbox.open-metadata.org/ Seeing is better than hearing, and doing is better than seeing.

Suggestion for model selection : The project is in the early stage, and the domestic ecology has not yet risen. People who have a sense of trying new things and are willing to make troubles can follow up and study. To build and use the production environment, it is necessary to prepare for the front-end and back-end issues and dig deep into the source code.

Commercial version : collate (https://www.getcollate.io/) is the SaaS version of Open Metadata.

7. Magda

 Open source address : https://github.com/magda-io/magda (408 stars)

Magda is a data catalog system that provides functions such as data cataloging, enhancement, search, tracking and sorting. Support internal and external data sources, support big data and small data processing, and support external data asset services through files, databases or APIs.

Target users: Data technologists such as data analysts, data scientists, and data engineers.

Value goal: Provide data technicians with auxiliary functions such as historical data version management and duplicate data detection, and improve the efficiency and quality of data query and management.

Advantages of Magda:

  • Lightweight and simple data directory management platform

  • Support data preview

  • Function focus, independent deployment

  • Simple and concise interface

  • Support map data

Disadvantages of Magda:

  • Single function, like CKAN below, is positioned in data cataloging, data display and sharing

  • Massive data transmission, performance problems

  • Does not support modern big data synchronization, integration

  • Relatively single function

Related introduction : https://demo.dev.magda.io/ Seeing is better than hearing a hundred things, and seeing a hundred things is better than having a try.

Suggestion for model selection : The current data middle platform and data asset platform will include similar data portals, and the functions of Magda will be integrated, and basically few scenarios in enterprises will be used alone.

8. CKAN

 Open source address : https://github.com/ckan/ckan (3.7K stars)

CKAN is the world's leading open source data portal platform, a tool for making open data websites. CKAN makes it easy to publish, share and process data. This is a data management system that provides a powerful platform for cataloging, storing and accessing datasets, with a rich front end, full API (for data and catalog), visualization tools, and more.

The above description is a direct Baidu translation of the description of the CKAN github homepage. In plain English, CKAN is a tool that can help you display personal or corporate data sets through a website. Others can browse, retrieve, preview, catalog, download. CKAN is ideal for open data use by national, local governments, research institutes, schools and other organizations.

Advantages of CKAN:

  • Python is the main development language, so getting started doesn’t seem to be a problem.

  • It has a long history and has been used by a large number of governments and research organizations to open public data

  • Simple to use, independent deployment

  • Function focus, small and medium-scale data cataloging, development, preview and download

Shortcomings of CKAN:

  • Focus on data portals, which catalog and organize data, provide data preview and download.

  • Massive data transmission, performance problems

  • Does not support modern big data synchronization, integration

  • Relatively single function

Related introduction : https://blog.csdn.net/iCloudEnd/article/details/125676123

Suggestion for model selection : The current data middle platform and data asset platform will include similar data portals, and the functions of CKAN will be integrated, and basically few scenarios in enterprises will be used alone. Governments, schools and other institutions have many application scenarios.

Summarize

Work such as data governance and data asset management is the underlying infrastructure in the digital transformation of enterprises. It is very important, but it is difficult to reflect the effect and value. Issues such as upper-level data strategy, data architecture, data process, and data specification have not been resolved at the organizational level; no matter how well the data asset platform and other work is planned and implemented, it can only reflect the effect of a drop in the bucket.

References:

1. wx public account (big data flow) - "Analysis of 12 Open Source Data Asset (Metadata) Management Platform Selection (1)"

2. wx Public Account (Big Data and Digital Transformation) - "Analysis on Selection of 12 Open Source Data Asset (Metadata) Management Platforms (2)"

Guess you like

Origin blog.csdn.net/u011487470/article/details/128897051