元数据管理Datahub导入Mysql数据库元数据

Mysql元数据同步说明

Mysql的元数据同步,是一种Pull的同步方式。只能做到全量或批增量的元数据同步。而 push-based的同步方式可以做到实时元数据同步,目前只有Airflow、Spark、Great Expectations、Protobuf Schemas

1. 安装Mysql元数据同步依赖包

(docker-compose-v1-py) [root@datahub ~]# pip3 install 'acryl-datahub[mysql]'
......省略部分......
Successfully installed Ipython-8.4.0 Jinja2-3.0.3 Send2Trash-1.8.0 altair-4.2.0 argon2-cffi-21.3.0 argon2-cffi-bindings-21.2.0 asttokens-2.0.5 backcall-0.2.0 beautifulsoup4-4.11.1 bleach-5.0.0 colorama-0.4.5 debugpy-1.6.0 decorator-5.1.1 defusedxml-0.7.1 executing-0.8.3 fastjsonschema-2.15.3 great-expectations-0.15.2 greenlet-1.1.2 importlib-metadata-4.11.4 ipykernel-6.15.0 ipython-genutils-0.2.0 jedi-0.18.1 jsonpatch-1.32 jsonpointer-2.3 jupyter-client-7.3.4 jupyter-core-4.10.0 jupyterlab-pygments-0.2.2 matplotlib-inline-0.1.3 mistune-0.8.4 nbclient-0.6.3 nbconvert-6.5.0 nbformat-5.4.0 nest-asyncio-1.5.5 notebook-6.4.12 numpy-1.22.4 pandas-1.4.2 pandocfilters-1.5.0 parso-0.8.3 pexpect-4.8.0 pickleshare-0.7.5 prometheus-client-0.14.1 prompt-toolkit-3.0.29 ptyprocess-0.7.0 pure-eval-0.2.2 pygments-2.12.0 pymysql-1.0.2 pyparsing-2.4.7 pyzmq-23.1.0 ruamel.yaml-0.17.17 scipy-1.8.1 soupsieve-2.3.2.post1 sqlalchemy-1.3.24 stack-data-0.3.0 terminado-0.15.0 tinycss2-1.1.1 toolz-0.11.2 tornado-6.1 tqdm-4.64.0 traitlets-5.2.1.post0 wcwidth-0.2.5 webencodings-0.5.1 zipp-3.8.0
(docker-compose-v1-py) [root@datahub ~]#

2. 编写mysql_recipe.yaml导入文件

定义导入元数据的数据库相关信息,和导入的sink信息,这里导入的sink类型为datahub-rest。文件内容如下

会从Mysql Pull元数据,然后一Rest API方式保存到Datahub

[root@datahub ~]# cat mysql_recipe.yaml 
source:
  type: mysql
  config:
    host_port: 192.168.23.121:3306
    database: d_general
    username: root
    password: Root_123
    
    
sink:
  type: "datahub-rest"
  config:
    server: 'http://datahub:8080'

[root@datahub ~]#

这里的database貌似并不能作过滤作用。会同步Mysql服务所有数据库表的元数据

3. 执行元数据同步

(docker-compose-v1-py) [root@datahub ~]# python3 -m datahub ingest -c /root/mysql_recipe.yaml
[2022-06-19 13:38:56,423] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38
[2022-06-19 13:38:58,748] INFO     {datahub.cli.ingest_cli:115} - Starting metadata ingestion
[2022-06-19 13:38:59,362] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-info-d_general-urn:li:container:1ea81cc18487dc3ee6ce8a98e45e4a55
[2022-06-19 13:38:59,529] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-platforminstance-d_general-urn:li:container:1ea81cc18487dc3ee6ce8a98e45e4a55
[2022-06-19 13:38:59,694] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-subtypes-d_general-urn:li:container:1ea81cc18487dc3ee6ce8a98e45e4a55
[2022-06-19 13:38:59,944] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-info-abdoor-urn:li:container:cdcc8d8abadaef84636c827c9383a2e5
[2022-06-19 13:39:00,096] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-platforminstance-abdoor-urn:li:container:cdcc8d8abadaef84636c827c9383a2e5
[2022-06-19 13:39:00,218] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-subtypes-abdoor-urn:li:container:cdcc8d8abadaef84636c827c9383a2e5
[2022-06-19 13:39:00,352] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-parent-container-abdoor-urn:li:container:cdcc8d8abadaef84636c827c9383a2e5-urn:li:container:1ea81cc18487dc3ee6ce8a98e45e4a55
[2022-06-19 13:39:00,853] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-urn:li:container:cdcc8d8abadaef84636c827c9383a2e5-to-urn:li:dataset:(urn:li:dataPlatform:mysql,abdoor.ab_system,PROD)
[2022-06-19 13:39:01,647] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit abdoor.ab_system
[2022-06-19 13:39:02,189] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit abdoor.ab_system-subtypes
[2022-06-19 13:39:02,970] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit container-urn:li:container:cdcc8d8abadaef84636c827c9383a2e5-to-urn:li:dataset:(urn:li:dataPlatform:mysql,abdoor.area_main,PROD)
[2022-06-19 13:39:03,233] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit abdoor.area_main
......省略部分......
Sink (datahub-rest) report:
{'records_written': 15720,
 'warnings': [],
 'failures': [],
 'downstream_start_time': datetime.datetime(2022, 6, 19, 13, 58, 50, 418010),
 'downstream_end_time': datetime.datetime(2022, 6, 19, 14, 18, 49, 555005),
 'downstream_total_latency_in_seconds': 1199.136995,
 'gms_version': 'v0.8.38'}

Pipeline finished with 14 warnings in source producing 15720 workunits
(docker-compose-v1-py) [root@datahub ~]#

再次运行同步,如果Mysql表的元数据未发生变更,则同步到Datahub不会产生新的版本,同步的速度会很快;如果Mysql表的元数据发生变更,则同步到Datahub会产生新的版本,新的版本和旧的版本可以切换查看

猜你喜欢

转载自blog.csdn.net/yy8623977/article/details/125356024