Official website document link
DataHub Quickstart Guide | DataHub (datahubproject.io)
The Python version selected in this article is 3.8.16, the Docker version is 20.10.0, and the Datahub version is 0.10.5
python must be version 3.7 or above. 0.10.5 does not support the following versions
If you want to use the add data source on the web, you need to set the environment variables by directly calling the python and pip commands. Cannot use python3
Install python3
One thing to note is that datahub requires openssl1.11 or above. So when installing python3, configure it in advance. You can read this document.
python error: ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1_Mumunu-'s blog-CSDN blog
Download and unzip the Python3 installation package
wget https://www.python.org/ftp/python/3.8.16/Python-3.8.16.tgz
tar -zxvf Python-3.8.11.tgz
Download a bunch of dependencies
yum install -y zlib-devel bzip2-devel \
openssl-devel ncurses-devel epel-release gcc gcc-c++ xz-devel readline-devel \
gdbm-devel sqlite-devel tk-devel db4-devel libpcap-devel libffi-devel
Compile Python3
mkdir /usr/local/python3
cd Python-3.8.16
./configure --prefix=/usr/local/python3
make && make install
Then deploy docker
#下载docker-20.10.0包
https://download.docker.com/linux/static/stable/x86_64/docker-20.10.0.tgz
#下载docker-compose对应系统的包
curl -SL https://github.com/docker/compose/releases/download/v2.20.3/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
tar -zxvf docker-20.10.0.tgz
#将解压出来的docker文件内容移动到 /usr/bin/ 目录下
cp docker/* /usr/bin/
#查看docker版本
docker version
#查看docker信息
docker info
Configure docker
配置Docker开机自启动服务
#添加docker.service文件
vi /etc/systemd/system/docker.service
#按i插入模式,复制如下内容:
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
#TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
# restart the docker process if it exits prematurely
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
#添加文件可执行权限
chmod +x /etc/systemd/system/docker.service
#重新加载配置文件
systemctl daemon-reload
#启动Docker
systemctl start docker
#查看docker启动状态
systemctl status docker
#查看启动容器
docker ps
#设置开机自启动
systemctl enable docker.service
#查看docker开机启动状态 enabled:开启, disabled:关闭
systemctl is-enabled docker.service
Install Datahub
pip3 install acryl-datahub==0.10.5
Check the version status.
python3 -m datahub version
The next step is to download the image. The image is large, more than ten GB in total. You need to be patient to download it.
We choose to start by reading the configuration file locally
python3 -m datahub docker quickstart --quickstart-compose-file ./docker-compose.consumers-without-neo4j.quickstart.yml
docker-compose -p datahub -f ./docker-compose.consumers-without-neo4j.quickstart.yml up -
这个文件从https://github.com/datahub-project/datahub/tree/master/docker/quickstart
下载
After executing the command, if no error is reported, it proves that there is no problem.
Check whether so many containers have been started. If not, just restart it.
Access IP: 9002, startup successful
some basic commands
#启动
docker-compose -p datahub -f ./docker-compose.consumers-without-neo4j.quickstart.yml up -
#停止
docker-compose -p datahub -f ./docker-compose.consumers-without-neo4j.quickstart.yml stop
查看有哪些插件
python3 -m datahub check plugins --verbose
缺少插件的时候安装对应插件
pip3 install 'acryl-datahub[数据源]'
例如
pip3 install 'acryl-datahub[mysql]'
Import hive metadata
First, add the keyberos client environment to the machine where datahub is deployed.
安装kerberos客户端
yum -y install krb5-libs krb5-workstation
同步KDC配置
scp hadoop102:/etc/krb5.conf /etc/krb5.conf
scp hadoop102:/etc/security/keytab/ranger_all_publc.keytab /etc/security/keytab/
验证能否连接到服务
kinit -kt /etc/security/keytab/ranger_all_publc.keytab hadoop/[email protected]
When configuring the hive data source, do not use the web interface configuration, otherwise an error will be reported. There is no corresponding authorization in the Kerberos database. It is guessed that there is no corresponding authorization in the docker environment of datahub.
安装sasl 不然后边会报错少这个包
yum install cyrus-sasl cyrus-sasl-lib cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5
pip install sasl
安装hive插件
pip install 'acryl-datahub[hive]'
配置hive相应的yml 并保存成 hive.yml
source:
type: hive
config:
host_port: xxxx:10000
database: test
username: hive
options:
connect_args:
auth: KERBEROS
kerberos_service_name: hive
scheme: 'hive+https'
sink:
type: "datahub-rest"
config:
server: 'http://IP:8080'
token: 如果有就写
之后导入python -m datahub --debug ingest -c hive.yml
也可以把debug去掉 。不然日志太多
脚本定时导入hive数据
import os
import subprocess
yml_files = [f for f in os.listdir('/root/datalineage') if f.endswith('.yml')]
for file in yml_files:
cmd = f"python3 -m datahub ingest -c {file}"
subprocess.run(cmd, shell=True, check=True)
Import mysql metadata
安装hive插件
pip install 'acryl-datahub[mysql]'
配置相应的yml 并保存成 mysql.yml
source:
type: mysql
config:
# Coordinates
host_port: master:3306
database: dolphinscheduler
# Credentials
username: root
password: lovol
# If you need to use SSL with MySQL:
# options:
# connect_args:
# ssl_ca: "path_to/server-ca.pem"
# ssl_cert: "path_to/client-cert.pem"
# ssl_key: "path_to/client-key.pem"
sink:
# sink configs
type: datahub-rest
config:
server: http://slave1:8080
之后导入python -m datahub --debug ingest -c mysql.yml
However, I did not import successfully using this. The web interface I use
Select mysql and fill in the basic information. It's all literal. Just go next without pitfalls. You can look at the log when execution starts. Check to see if there is any problem. Note that the python and pip commands directly called by the web need to set environment variables. Cannot use python3