Metadata management platform Datahub version 0.10.5 installation, deployment and importing various metadata manuals

Official website document link

DataHub Quickstart Guide | DataHub (datahubproject.io)

The Python version selected in this article is 3.8.16, the Docker version is 20.10.0, and the Datahub version is 0.10.5

python must be version 3.7 or above. 0.10.5 does not support the following versions

If you want to use the add data source on the web, you need to set the environment variables by directly calling the python and pip commands. Cannot use python3

Install python3

One thing to note is that datahub requires openssl1.11 or above. So when installing python3, configure it in advance. You can read this document.

python error: ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1_Mumunu-'s blog-CSDN blog

Download and unzip the Python3 installation package

wget https://www.python.org/ftp/python/3.8.16/Python-3.8.16.tgz
tar -zxvf Python-3.8.11.tgz

Download a bunch of dependencies

yum install -y zlib-devel bzip2-devel \
openssl-devel ncurses-devel epel-release gcc gcc-c++ xz-devel readline-devel \
gdbm-devel sqlite-devel tk-devel db4-devel libpcap-devel libffi-devel

Compile Python3

mkdir /usr/local/python3
cd Python-3.8.16
./configure --prefix=/usr/local/python3
make && make install

Then deploy docker

#下载docker-20.10.0包
https://download.docker.com/linux/static/stable/x86_64/docker-20.10.0.tgz
#下载docker-compose对应系统的包
curl -SL https://github.com/docker/compose/releases/download/v2.20.3/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose

chmod +x /usr/local/bin/docker-compose
tar -zxvf docker-20.10.0.tgz
#将解压出来的docker文件内容移动到 /usr/bin/ 目录下
cp docker/* /usr/bin/
#查看docker版本
docker version
#查看docker信息
docker info

Configure docker

配置Docker开机自启动服务
#添加docker.service文件
vi /etc/systemd/system/docker.service
#按i插入模式,复制如下内容:
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
#TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
# restart the docker process if it exits prematurely
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
#添加文件可执行权限
chmod +x /etc/systemd/system/docker.service
#重新加载配置文件
systemctl daemon-reload
#启动Docker
systemctl start docker
#查看docker启动状态
systemctl status docker
#查看启动容器
docker ps
#设置开机自启动
systemctl enable docker.service
#查看docker开机启动状态 enabled:开启, disabled:关闭
systemctl is-enabled docker.service

Install Datahub

pip3 install acryl-datahub==0.10.5

Check the version status.

python3 -m datahub version

 The next step is to download the image. The image is large, more than ten GB in total. You need to be patient to download it.

We choose to start by reading the configuration file locally

python3 -m datahub docker quickstart --quickstart-compose-file ./docker-compose.consumers-without-neo4j.quickstart.yml
docker-compose -p datahub -f ./docker-compose.consumers-without-neo4j.quickstart.yml up -

这个文件从https://github.com/datahub-project/datahub/tree/master/docker/quickstart
下载

After executing the command, if no error is reported, it proves that there is no problem.

Check whether so many containers have been started. If not, just restart it.

Access IP: 9002, startup successful

some basic commands

#启动
docker-compose -p datahub -f ./docker-compose.consumers-without-neo4j.quickstart.yml up -
#停止
docker-compose -p datahub -f ./docker-compose.consumers-without-neo4j.quickstart.yml stop

查看有哪些插件
python3 -m datahub check plugins --verbose

缺少插件的时候安装对应插件
pip3 install 'acryl-datahub[数据源]'
例如
pip3 install 'acryl-datahub[mysql]'

Import hive metadata

First, add the keyberos client environment to the machine where datahub is deployed.

安装kerberos客户端
yum -y install krb5-libs krb5-workstation
 
同步KDC配置
scp hadoop102:/etc/krb5.conf /etc/krb5.conf
scp hadoop102:/etc/security/keytab/ranger_all_publc.keytab /etc/security/keytab/
 
验证能否连接到服务
kinit -kt /etc/security/keytab/ranger_all_publc.keytab  hadoop/[email protected]

 When configuring the hive data source, do not use the web interface configuration, otherwise an error will be reported. There is no corresponding authorization in the Kerberos database. It is guessed that there is no corresponding authorization in the docker environment of datahub.

 
安装sasl 不然后边会报错少这个包
yum install cyrus-sasl  cyrus-sasl-lib  cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi  cyrus-sasl-md5
 
pip install sasl
 
安装hive插件
pip install 'acryl-datahub[hive]'
 
 
配置hive相应的yml 并保存成 hive.yml
 
 

source:
  type: hive
  config:
    host_port: xxxx:10000
    database: test 
    username: hive
    options:
      connect_args:
        auth: KERBEROS
        kerberos_service_name: hive
        scheme: 'hive+https'
sink:
  type: "datahub-rest"
  config:
    server: 'http://IP:8080'
    token: 如果有就写


 
之后导入python -m  datahub --debug ingest -c hive.yml
也可以把debug去掉 。不然日志太多
 
脚本定时导入hive数据
 
 
import os
import subprocess
 
yml_files = [f for f in os.listdir('/root/datalineage') if f.endswith('.yml')]
 
 
for file in yml_files:
    cmd = f"python3 -m datahub ingest -c {file}"   
    subprocess.run(cmd, shell=True, check=True)

Import mysql metadata

安装hive插件
pip install 'acryl-datahub[mysql]'
 
 
配置相应的yml 并保存成 mysql.yml
  
source:
  type: mysql
  config:
    # Coordinates
    host_port: master:3306
    database: dolphinscheduler
    # Credentials
    username: root
    password: lovol
    # If you need to use SSL with MySQL:
    # options:
    #   connect_args:
    #     ssl_ca: "path_to/server-ca.pem"
    #     ssl_cert: "path_to/client-cert.pem"
    #     ssl_key: "path_to/client-key.pem"
sink:
  # sink configs
  type: datahub-rest
  config:
    server: http://slave1:8080


 
之后导入python -m  datahub --debug ingest -c mysql.yml
 

However, I did not import successfully using this. The web interface I use

 

Select mysql and fill in the basic information. It's all literal. Just go next without pitfalls. You can look at the log when execution starts. Check to see if there is any problem. Note that the python and pip commands directly called by the web need to set environment variables. Cannot use python3

Guess you like

Origin blog.csdn.net/h952520296/article/details/132848338
Recommended