Hello everyone, I am Dugufeng, a former port coal worker, currently working as the person in charge of big data in a state-owned enterprise, and the manager of the big data flow of the official account. In the last two years, because of the needs of the company and the development trend of big data, I began to learn about data governance. Today I will share with you the integrated metadata management platform - OpenMetadata.
This document is compiled based on the official website and personal practice data. For subsequent documents, please pay attention to the big data flow of the official account , and will continue to update~
This article is divided into four parts, respectively from the open source metadata management platform, OpenMetadata introduction, installation process and function demonstration four aspects.
1. Open Source Metadata Management Platform
Metadata management is the starting point for enterprises to comprehensively carry out data governance. Various metadata management tools and metadata management platforms emerge in endlessly.
There are many open source metadata management platforms. Open source metadata management platforms are tools for collecting, storing, and managing data. They provide a scalable way to organize and maintain metadata information for data. Here are some common open source metadata management platforms:
Apache Atlas: Apache Atlas is an open source big data metadata management and data governance platform designed to help organizations collect, organize and manage metadata information of data. It provides rich metadata models and search functions, and can be integrated with various data storage and processing platforms.
LinkedIn DataHub: LinkedIn DataHub is LinkedIn's open source metadata search and discovery platform. It provides a centralized metadata repository for managing and browsing metadata information of various types of datasets and data assets.
Amundsen: Amundsen is Lyft's open source data discovery and metadata management platform. It provides a user-friendly interface that enables users to search, browse and contribute metadata information of datasets. Amundsen also supports integration with other data tools and platforms.
Metacat: Metacat is Netflix's open source data discovery and metadata management platform. It provides a unified interface to find and browse metadata information of various datasets, and supports integration with other data tools and services.
These open source metadata management platforms provide various functions, such as metadata storage, search, browsing, data asset relationship management, data lineage tracking, etc., to help organizations better manage and utilize metadata information of data.
The OpenMetadata we are going to introduce today hopes to provide a metadata management standard to allow us to better manage metadata.
2. Introduction to OpenMetadata
OpenMetadata is an all-in-one platform for data discovery, data lineage, data quality, observability, governance, and team collaboration. It is one of the fastest growing open source projects with a vibrant community and adoption by numerous companies across various industry verticals. OpenMetadata is powered by a centralized metadata store based on open metadata standards/APIs, supports connectors for various data services, and enables end-to-end metadata management, allowing you to freely release the value of data assets.
At present, OpenMetadata has 2.5k stars on Github, and has just updated version 1.1.
Considering the network problems of some students, you can reply "OpenMetadata1.1" in the big data flow background to download the source code and installation package, which is valid for one month.
OpenMetadata includes the following:
Metadata Schema - Defines the core abstraction and vocabulary of metadata using a schema of types, entities, and relationships between entities. This is the basis for open metadata standards. Extensibility for entities and types with custom properties is also supported.
Metadata Store - Stores a metadata graph that connects data assets, user and tool generated metadata.
Metadata API - for generating and consuming metadata built on top of user interface patterns and tool, system, and service integrations.
Ingestion Framework - Pluggable framework for integrating tools and ingesting metadata into a metadata store, supports about 55 connectors. The ingestion framework supports well-known data warehouses such as Google BigQuery, Snowflake, Amazon Redshift, and Apache Hive; databases such as MySQL, Postgres, Oracle, and MSSQL; dashboard services such as Tableau, Superset, and Metabase; messaging services such as Kafka, Redpanda; and Airflow , Glue, Fivetran, Dagster, and other plumbing services.
OpenMetadata User Interface - A single place for users to discover and collaborate on all data.
Core functions
Data Collaboration - Get event notifications through activity feeds. Use webhooks to send alerts and notifications. Add an announcement to notify the team of upcoming changes. Add a task to request a description or glossary term approval workflow. Add user mentions and collaborate using conversation threads.
Data Quality and Analyzers - Standardized tests and data quality metadata. Group related tests into test suites. Supports custom SQL data quality testing. There is an interactive dashboard to drill down to details.
Data Lineage - Supports rich column-level lineage. Efficiently filter queries to extract lineage. Manually edit lineages as needed and connect entities using the no-code editor.
Comprehensive roles and policies - handle complex access control use cases and hierarchical teams.
Connectors - Supports 55 connectors to various databases, dashboards, pipelines, and messaging services.
Glossary - Add a controlled vocabulary to describe important concepts and terms within your organization. Add glossaries, terms, tags, descriptions and reviewers.
Data Security - Supports Google, Okta, custom OIDC, Auth0, Azure, Amazon Cognito, and OneLogin as identity providers for SSO. Additionally, AWS SSO and Google SAML-based authentication are supported.
3. Installation process
Mainly use the Docker installation method, which can be done in a few minutes.
First check the python version.
python3 --version
Three versions of python 3.7, 3.8 and 3.9 are required.
Check the docker version.
docker --version
20.10.0 or later.
docker compose version
Requires docker compose 2.1.1 or higher.
create folder
mkdir openmetadata-docker && cd openmetadata-docker
Create a virtual environment.
python3 -m venv env
The virtual environment takes effect.
source env/bin/activate
update pip
pip3 install --upgrade pip setuptools
install openmetadata
pip3 install --upgrade "openmetadata-ingestion[docker]"
Make sure the installation is successful
metadata docker --help
Start the container
metadata docker --start
start postgre
metadata docker --start -db postgres
Subsequent visits
http://localhost:8585
success!
4. Function demonstration
Home display
multilingual support
Overview page
Data Quality Monitoring Page
data assets
Business Glossary Function
Configuration of some data sources.
To be continued~
For more knowledge sharing about big data, data governance, and artificial intelligence, please pay attention to big data flow.