Selection analysis of 12 open source data asset (metadata) management platforms (1)

insert image description here
Two years ago, in the mind map of the most complete big data open source components in the article , I sorted out the mind map of open source technology components of the big data ecology, and there are 4K downloads so far.

Although the popularity of new words in the data industry is transformed from big data platform -> data governance -> data middle platform -> digital transformation (modern data technology stack), as the basic component of these new words, data asset management platform/metadata Technical solutions such as management platform/data directory management platform are still in the climbing recovery period of the Gartner curve. Related platforms are flourishing, and open source platforms or commercial products that dominate the rivers and lakes have not yet appeared. In the process of promoting the implementation of enterprise digital transformation, data governance, data The selection of asset management platform/metadata management platform/data directory management platform is still a job that tests people's ability.

Three articles are planned to introduce 12 excellent open source data asset/metadata management platforms in detail. In the third article, a selection two-dimensional table will be used to comprehensively compare the functional characteristics of 12 open source software.

This article sorts out the four products of Apache atlas, Datahub, Marquez, and Amundsen, and briefly analyzes their advantages and disadvantages for reference:
insert image description here

Apache Atlas

insert image description here
Open source address : https://github.com/apache/atlas 1.5K star

Atlas was first developed by HortonWorks, one of the troika of big data platforms (Cloudera, Hortonworks, MapR), to manage metadata in Hadoop projects, and then designed as a data governance framework. It provides Hadoop clusters with data classification, Centralize the core capabilities of metadata governance including policy engine, data lineage, security and lifecycle management. Later, the open source was released to the Apache community for incubation, and it was supported by Aetna, Merck, Target, SAS, IBM and other companies for development and evolution. Because of its support for massive horizontal expansion, good integration capabilities and open source features, most domestic manufacturers choose to use Atlas or conduct secondary development on it.
At present, Cloudera and Hortonworks have been acquired, and MapR has few new products. In the field of big data technology, compared with 2016, when the Hadoop technology platform was in the limelight, great changes have taken place, and the Hadoop system is gradually fading out of the center of the stage. MPP, modern technology stack, cloud-native database, etc. are on the stage, such as Clickhouse, Doris, StarRocks, Databend, Materialize, Ringswave.

Advantages of Atlas:

  • Dachang is open source, deeply integrated with Hive in the Hadoop ecosystem, and supports table-level and field-level blood relationship
  • Natively integrated with HDP, it supports docking with Ranger to realize row and column level data permission control, easy and worry-free installation
  • Powerful metadata metamodel, supporting metadata customization and extension
  • The source code is not complicated, and there are a large number of platforms in China that are customized and modified into commercial products based on Atlas

Disadvantages of Atlas:

  • Its advantages are also disadvantages. The parent open source company has been acquired and has a long history. It is no longer an advantage, but a burden
  • The Hadoop system has been declining. How to perfectly support the Hive and Hadoop systems can no longer meet the current rapid development of technical requirements
  • Its design interface is complicated, the experience is old, and the data directory and data retrieval are not convenient enough
  • The user experience is complex and the product features are more focused on solving the problems of technical personnel rather than end users of data, such as business personnel
  • The ecology gradually loses its freshness, and new similar platforms continue to develop

Related introduction : https://mp.weixin.qq.com/s/MvaxSF74NE0E43i4rQEb3g
Suggestions for selection : 1) If you only have the Hadoop ecosystem, you can try it. 2) If your data asset is geared towards the technical staff of the data team, you can try it.

Datahub

insert image description here
Open source address : https://github.com/datahub-project/datahub 7.2K star
DataHub is open sourced by Linkedin, official Slogan: The Metadata Platform for the Modern Data Stack - a metadata platform born for the modern data stack. The purpose is to solve the metadata management problems of a variety of data ecosystems. It provides metadata retrieval, data discovery, data monitoring and data supervision capabilities to help everyone solve the complexity of data management.

DataHub is open source based on Apache License 2, adopts a push-based data collection architecture (of course also supports the pull method), and can continuously collect changing metadata. The current version has integrated most popular data ecosystem access capabilities, including but not limited to: Kafka, Airflow, MySQL, SQL Server, Postgres, LDAP, Snowflake, Hive, BigQuery.

Advantages of Datahub:

  • Famous open source, same family as Kafka. The community is active, the development momentum is rapid, and the version update iteration is rapid.
  • The positioning is clear and far-reaching. Slogan can see the team's ambitions and later investment, and the constantly iterative version also proves this.
  • The underlying architecture is flexible and advanced, born without extended integration, and supports push and pull modes. For details, see: https://datahubproject.io/docs/architecture/architecture/
  • The UI interface is simple and easy to use, friendly to technicians and business personnel
  • Rich interfaces and comprehensive functions

Disadvantages of Datahub:

  • The front-end interface does not support internationalization, and the logic of interface construction and use is not Chinese enough
  • The version update iteration is fast, and it is difficult to upgrade after use
  • More functions are under construction, such as Hive column-level blood relationship
  • Some functional performance needs to be optimized, such as SQL Profile
  • There are not many Chinese materials, and there are not many Chinese communication communities

Related introduction :
https://mp.weixin.qq.com/s/74gK3hTt7-j1lTbKFagbTQ
https://mp.weixin.qq.com/s/iP6sc2DzPaeAKpSWNmf8hQ
Suggestions for selection :
1) If there are at least half of the front-end developers+ Background developers;
2) If you need a data asset management platform with better user experience;
3) If you need to expand the metadata that supports various platforms and systems. Please list Datahub as the top choice.
Although some shortcomings are listed, Datahub is currently the best choice among open source products. The author is also using it in production, and if you have any questions, you can communicate at any time.
Commercial version : Metaphor (https://metaphor.io/) is the SaaS version of Datahub.

Marquez

insert image description here
Open source address : https://github.com/MarquezProject/marquez
Advantages of 1.3K star Marquez:

  • The interface is beautiful, and the operation details are well designed
  • Simple deployment and concise code
  • Relying on the underlying OpenLineage protocol, the structure is better

Disadvantages of Marquez:

  • Focus on the visualization of data assets/blood relationship, and some functions of data asset management require more development work

Related introduction : https://mp.weixin.qq.com/s/OMm6QEk9-1bFdYKuimdxCw
Suggestions for selection :
1) If you have a powerful metadata and data asset management platform backend, you only need data asset visualization and lineage Demonstration, you can consider the use experience.
2) The interface display is relatively good, and it supports the selection of dependent line highlighting and hidden branch line dependencies. To achieve data asset management and metadata collection, there is a lot of work to be done.

Commercial version :
Datakin (https://datakin.com/) is the SaaS version of Marquez. It supports Apache Hive, Amazon RDS, Teradata, Amazon Redshift, Amazon S3, and Cassandra.

Amundsen

insert image description here

Open source address : https://github.com/amundsen-io/amundsen 3.8K star
Amundsen is an open source metadata management and data discovery platform from Lyft. It has complete functions, including a relatively complete front-end, back-end and data processing Advantages of the framework
Amundsen:

  • Lyft is an open source company with an active community and many version updates
  • The positioning is clear and clear, similar to Datahub, and it is committed to becoming a data catalog product in the modern data stack
  • Support docking with more data platforms and tools

Amundsen's shortcomings:

  • The UI interface is quite satisfactory, and the operation convenience is not enough
  • There are not many Chinese documents
  • It is not as convenient to use as Datahub in terms of blood relationship, tags, terminology and other functions
  • More support-friendly components, not many used in China

Related introduction :
https://mp.weixin.qq.com/s/yGZ1RJs2seu943sswxYYzw
https://mp.weixin.qq.com/s/5w6euvUWzm5RWXgisB-rMg
https://mp.weixin.qq.com/s/iVocnMV8zuQN Suggestions for -jcID83nSg
selection :
1) If anyone is struggling, it is recommended to choose Datahub. If no one is struggling, choose Amundsen’s
commercial version : Stemma (https://www.stemma.ai/) is the SaaS version of Amundsen.

Summarize

Work such as data governance and data asset management is the underlying infrastructure in the digital transformation of enterprises. It is very important, but it is difficult to reflect the effect and value. Issues such as upper-level data strategy, data architecture, data process, and data specification have not been resolved at the organizational level; no matter how well the data asset platform and other work is planned and implemented, it can only reflect the effect of a drop in the bucket.

Guess you like

Origin blog.csdn.net/zdsx1104/article/details/128892219