Selection analysis of 12 open source data asset (metadata) management platforms (2)

insert image description here

When ChatGPT added 100 million users in January, people marveled at the magical power of AI. It seems that the end of the universe is AI. However, only those who have studied AI deeply or engaged in AI-related work have a deep understanding, that is, behind every bull-nosed AI model is a pile of data that is hard to describe.

As data becomes the fifth element after land, labor, capital, and technology, as digitalization and digital transformation are in full swing around the world, as digital twins and metaverses are widely emerging around the world, more and more enterprises Start to use artificial intelligence, machine learning and big data analysis to mine the value of data. However, when enterprises are deeply advancing the process of data-driven value, they realize that in order to realize digitalization and digital transformation, and realize data-driven business, all the information technologies that have appeared before, and all information technologies that may appear in the future, cannot Delivering magic solutions that can change everything about your business overnight. In fact, technology is only a means to achieve the goal. Only by making fundamental and long-term continuous changes to the organization's culture, technical architecture and operating model can it be possible to achieve the desired goal in the foreseeable future.

Nevertheless, behind many information technologies, a key component is needed, which is the data catalog (data asset management platform, metadata platform). It organizes your data in one place and allows you to tag it with metadata, allowing more teams and people to discover and manage data more efficiently.

In the selection analysis of 12 open source data asset (metadata) management platforms (1) , we discussed 4 open source data asset management platforms. This article, as the second article in this series, will continue to analyze the four open source data asset management platforms Open metadata, Open Data Discovery, Magda, and CKAN.
insert image description here

Open Metadata

insert image description here

Open source address : https://github.com/open-metadata/OpenMetadata 1.9K star
OpenMetadata is an open standard for metadata, providing basic capabilities for end-to-end metadata management solutions. Provides all the necessary components for data discovery, data governance, data collaboration, data quality and observability.

Similar to Open Data Discover, its UI is very beautiful, and its operation and usage logic are also in line with the habits of business people.

Advantages of Open Metadata:

  • Provide an online experience Demo environment, which is helpful for promoting and attracting newcomers
  • The UI interface is beautiful and beautiful, and the interface operation logic conforms to the usage habits of Chinese people
  • The project is young and can learn from many existing data asset projects
  • Integrated data quality module
  • It supports open data standards, but it feels useless and cannot be played domestically.
  • New concept design based on observable data

Shortcomings of Open Metadata:

  • The project is in its infancy, and there are not many Chinese people involved
  • Not very different from Open Data Discovery
  • The product is still under rapid development
  • There are very few Chinese materials

Related introduction : https://sandbox.open-metadata.org/ Seeing is better than hearing, and doing is better than seeing.

Suggestion for model selection : The project is in the early stage, and the domestic ecology has not yet risen. People who have a sense of trying new things and are willing to make troubles can follow up and study. To build and use the production environment, it is necessary to prepare for the front-end and back-end issues and dig deep into the source code.

Commercial version : collate (https://www.getcollate.io/) is the SaaS version of Open Metadata.

Open Data Discovery

insert image description here
Open source address : https://github.com/opendatadiscovery/odd-platform 692 stars

Open Data Discover is an open source data discovery and observability platform. It aims to help data-driven businesses democratize their data by making it easier to discover, manage, observe, trust and secure. Since ODD supports open data standards, data teams are able to exchange data more efficiently between various data tools.

To be honest, the UI of the platform is indeed very beautiful. Its ingestion is based on specifications. However, the platform is a work in progress, so some features are still being developed.

Advantages of Open Data Discovery:

  • Provide an online experience Demo environment, which is helpful for promoting and attracting newcomers
  • The UI interface is beautiful and beautiful, and the interface operation logic conforms to the usage habits of Chinese people
  • The project is young and can learn from many existing data asset projects
  • Integrated data quality module
  • Some excellent functions of Datahub have been planned
  • It supports open data standards, but it feels useless and cannot be played domestically.
  • Provides an interface for scheduling workflow alarms
  • New concept design based on observable data
  • ML is a first-class citizen, this is a bet on the future development of AI

Disadvantages of Open Data Discovery:

  • The project is in its infancy and the community is not very active
  • It overlaps with Datahub's large number of functions
  • There are very few Chinese materials
  • product positioning?

Related introduction : https://demo.oddp.io/ It is better to see it than to hear it, and it is better to do it than to see it.

Suggestion for model selection : The project is in the early stage, and the domestic ecology has not yet risen. People who have a sense of trying new things and are willing to make troubles can follow up and study. To build and use the production environment, it is necessary to prepare for the front-end and back-end issues and dig deep into the source code.

Magda

insert image description here
Open source address : https://github.com/magda-io/magda 408 star
Magda is a data catalog system that provides functions such as data cataloging, enhancement, search, tracking and sorting. Support internal and external data sources, support big data and small data processing, and support external data asset services through files, databases or APIs.

Target users: Data technologists such as data analysts, data scientists, and data engineers.
Value goal: Provide data technicians with auxiliary functions such as historical data version management and duplicate data detection, and improve the efficiency and quality of data query and management.

Advantages of Magda:

  • Lightweight and simple data directory management platform
  • Support data preview
  • Function focus, independent deployment
  • Simple and concise interface
  • Support map data

Disadvantages of Magda:

  • Single function, like CKAN below, is positioned in data cataloging, data display and sharing
  • Massive data transmission, performance problems
  • Does not support modern big data synchronization, integration
  • Relatively single function

Related introduction : https://demo.dev.magda.io/ Seeing is better than hearing a hundred things, and seeing a hundred things is better than having a try.

Suggestion for model selection : The current data middle platform and data asset platform will include similar data portals, and the functions of Magda will be integrated, and basically few scenarios in enterprises will be used alone.

CKAN

insert image description here

Open source address : https://github.com/ckan/ckan 3.7K star
CKAN is the world's leading open source data portal platform, a tool for making open data websites. CKAN makes it easy to publish, share and process data. This is a data management system that provides a powerful platform for cataloging, storing and accessing datasets, with a rich front end, full API (for data and catalog), visualization tools, and more.

The above description is a direct Baidu translation of the description of the CKAN github homepage. In plain English, CKAN is a tool that can help you display personal or corporate data sets through a website. Others can browse, retrieve, preview, catalog, download. CKAN is ideal for open data use by national, local governments, research institutes, schools and other organizations.

Advantages of CKAN:

  • Python is the main development language, so getting started doesn’t seem to be a problem.
  • It has a long history and has been used by a large number of governments and research organizations to open public data
  • Simple to use, independent deployment
  • Function focus, small and medium-scale data cataloging, development, preview and download

Shortcomings of CKAN:

  • Focus on data portals, which catalog and organize data, provide data preview and download.
  • Massive data transmission, performance problems
  • Does not support modern big data synchronization, integration
  • Relatively single function

Related introduction : https://blog.csdn.net/iCloudEnd/article/details/125676123

Suggestion for model selection : The current data middle platform and data asset platform will include similar data portals, and the functions of CKAN will be integrated, and basically few scenarios in enterprises will be used alone. Governments, schools and other institutions have many application scenarios.

summary

Among the four open source data asset management platforms introduced in this article, Open Data Discovery and Open Metadata have similar functions and the same positioning, and their development paths and trends are also similar with minor differences. Their interfaces are very beautiful and their functions are rich in imagination. The future is promising. Organizations and teams with strong research and development capabilities can try it out and continue to follow up. CKAN and Magda have similar functions and similar positioning. They both focus on the last mile of data asset management. They catalog data well and allow non-data technicians to quickly retrieve, search, preview and download data. If there is no complicated data processing, Integration, processing flow, just sharing some good quality, small and medium-scale data, you can consider CKAN and Magda.

Guess you like

Origin blog.csdn.net/zdsx1104/article/details/128909771
Recommended