Article directory
I. Overview
In fact, the deployment of hive through docker-compose is superimposed on the basis of Hadoop deployment in the previous article. Hive is the most commonly used data warehouse service, so it is necessary to integrate it. Interested partners, please read my following carefully Content, the service deployed through docker-compose is mainly to quickly deploy services with the least resources and time cost, which is convenient for small partners to learn, test, verify functions, etc.~
For Hadoop deployment, please refer to my following articles:
- Detailed tutorial on quickly deploying Hadoop clusters through docker-compose
- A minimalist tutorial on quickly deploying Hadoop clusters through docker-compose
It is best to browse through the articles on Hadoop deployment first. If you don’t care about this article.
For an introduction to Hive, please refer to my following article: Big Data Hadoop - Data Warehouse Hive
2. Preliminary preparation
1) Deploy docker
# 安装yum-config-manager配置工具
yum -y install yum-utils
# 建议使用阿里云yum源:(推荐)
#yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
# 安装docker-ce版本
yum install -y docker-ce
# 启动并开机启动
systemctl enable --now docker
docker --version
2) Deploy docker-compose
curl -SL https://github.com/docker/compose/releases/download/v2.16.0/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
docker-compose --version
3. Create a network
# 创建,注意不能使用hadoop_network,要不然启动hs2服务的时候会有问题!!!
docker network create hadoop-network
# 查看
docker network ls
4. MySQL deployment
1) mysql mirror
docker pull mysql:5.7
docker tag mysql:5.7 registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/mysql:5.7
2) Configuration
mkdir -p conf/ data/db/
cat >conf/my.cnf<<EOF
[mysqld]
character-set-server=utf8
log-bin=mysql-bin
server-id=1
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
datadir = /var/lib/mysql
sql_mode=STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
symbolic-links=0
secure_file_priv =
wait_timeout=120
interactive_timeout=120
default-time_zone = '+8:00'
skip-external-locking
skip-name-resolve
open_files_limit = 10240
max_connections = 1000
max_connect_errors = 6000
table_open_cache = 800
max_allowed_packet = 40m
sort_buffer_size = 2M
join_buffer_size = 1M
thread_cache_size = 32
query_cache_size = 64M
transaction_isolation = READ-COMMITTED
tmp_table_size = 128M
max_heap_table_size = 128M
log-bin = mysql-bin
sync-binlog = 1
binlog_format = ROW
binlog_cache_size = 1M
key_buffer_size = 128M
read_buffer_size = 2M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 64M
lower_case_table_names = 1
explicit_defaults_for_timestamp=true
skip_name_resolve = ON
event_scheduler = ON
log_bin_trust_function_creators = 1
innodb_buffer_pool_size = 512M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_log_buffer_size = 4M
innodb_log_file_size = 256M
innodb_max_dirty_pages_pct = 90
innodb_read_io_threads = 4
innodb_write_io_threads = 4
EOF
3) Orchestration
version: '3'
services:
db:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/mysql:5.7 #mysql版本
container_name: mysql
hostname: mysql
volumes:
- ./data/db:/var/lib/mysql
- ./conf/my.cnf:/etc/mysql/mysql.conf.d/mysqld.cnf
restart: always
ports:
- 13306:3306
networks:
- hadoop-network
environment:
MYSQL_ROOT_PASSWORD: 123456 #访问密码
secure_file_priv:
healthcheck:
test: ["CMD-SHELL", "curl -I localhost:3306 || exit 1"]
interval: 10s
timeout: 5s
retries: 3
# 连接外部网络
networks:
hadoop-network:
external: true
4) Deploy mysql
docker-compose -f mysql-compose.yaml up -d
docker-compose -f mysql-compose.yaml ps
# 登录容器
mysql -uroot -p123456
4. Hive deployment
1) Download hive
Download address: http://archive.apache.org/dist/hive
# 下载
wget http://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
# 解压
tar -zxvf apache-hive-3.1.3-bin.tar.gz
2) Configuration
images/hive-config/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 配置hdfs存储目录 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive_remote/warehouse</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
<!-- 所连接的 MySQL 数据库的地址,hive_local是数据库,程序会自动创建,自定义就行 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysql:3306/hive_metastore?createDatabaseIfNotExist=true&useSSL=false&serverTimezone=Asia/Shanghai</value>
</property>
<!-- MySQL 驱动 -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<!--<value>com.mysql.cj.jdbc.Driver</value>-->
<value>com.mysql.jdbc.Driver</value>
</property>
<!-- mysql连接用户 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<!-- mysql连接密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<!--元数据是否校验-->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>system:user.name</name>
<value>root</value>
<description>user name</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hive-metastore:9083</value>
</property>
<!-- host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>0.0.0.0</value>
<description>Bind host on which to run the HiveServer2 Thrift service.</description>
</property>
<!-- hs2端口 默认是10000-->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.active.passive.ha.enable</name>
<value>true</value>
</property>
</configuration>
3) Start script
#!/usr/bin/env sh
wait_for() {
echo Waiting for $1 to listen on $2...
while ! nc -z $1 $2; do echo waiting...; sleep 1s; done
}
start_hdfs_namenode() {
if [ ! -f /tmp/namenode-formated ];then
${HADOOP_HOME}/bin/hdfs namenode -format >/tmp/namenode-formated
fi
${HADOOP_HOME}/bin/hdfs --loglevel INFO --daemon start namenode
tail -f ${HADOOP_HOME}/logs/*namenode*.log
}
start_hdfs_datanode() {
wait_for $1 $2
${HADOOP_HOME}/bin/hdfs --loglevel INFO --daemon start datanode
tail -f ${HADOOP_HOME}/logs/*datanode*.log
}
start_yarn_resourcemanager() {
${HADOOP_HOME}/bin/yarn --loglevel INFO --daemon start resourcemanager
tail -f ${HADOOP_HOME}/logs/*resourcemanager*.log
}
start_yarn_nodemanager() {
wait_for $1 $2
${HADOOP_HOME}/bin/yarn --loglevel INFO --daemon start nodemanager
tail -f ${HADOOP_HOME}/logs/*nodemanager*.log
}
start_yarn_proxyserver() {
wait_for $1 $2
${HADOOP_HOME}/bin/yarn --loglevel INFO --daemon start proxyserver
tail -f ${HADOOP_HOME}/logs/*proxyserver*.log
}
start_mr_historyserver() {
wait_for $1 $2
${HADOOP_HOME}/bin/mapred --loglevel INFO --daemon start historyserver
tail -f ${HADOOP_HOME}/logs/*historyserver*.log
}
start_hive_metastore() {
if [ ! -f ${HIVE_HOME}/formated ];then
schematool -initSchema -dbType mysql --verbose > ${HIVE_HOME}/formated
fi
$HIVE_HOME/bin/hive --service metastore
}
start_hive_hiveserver2() {
$HIVE_HOME/bin/hive --service hiveserver2
}
case $1 in
hadoop-hdfs-nn)
start_hdfs_namenode
;;
hadoop-hdfs-dn)
start_hdfs_datanode $2 $3
;;
hadoop-yarn-rm)
start_yarn_resourcemanager
;;
hadoop-yarn-nm)
start_yarn_nodemanager $2 $3
;;
hadoop-yarn-proxyserver)
start_yarn_proxyserver $2 $3
;;
hadoop-mr-historyserver)
start_mr_historyserver $2 $3
;;
hive-metastore)
start_hive_metastore $2 $3
;;
hive-hiveserver2)
start_hive_hiveserver2 $2 $3
;;
*)
echo "请输入正确的服务启动命令~"
;;
esac
4) Build the mirror Dockerfile
FROM registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop:v1
COPY hive-config/* ${HIVE_HOME}/conf/
COPY bootstrap.sh /opt/apache/
COPY mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar ${HIVE_HOME}/lib/
RUN sudo mkdir -p /home/hadoop/ && sudo chown -R hadoop:hadoop /home/hadoop/
#RUN yum -y install which
Start building the image
# 构建镜像
docker build -t registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1 . --no-cache
# 推送镜像(可选)
docker push registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
### 参数解释
# -t:指定镜像名称
# . :当前目录Dockerfile
# -f:指定Dockerfile路径
# --no-cache:不缓存
5) Orchestration
version: '3'
services:
hadoop-hdfs-nn:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-hdfs-nn
hostname: hadoop-hdfs-nn
restart: always
privileged: true
env_file:
- .env
ports:
- "30070:${HADOOP_HDFS_NN_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-hdfs-nn"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${HADOOP_HDFS_NN_PORT} || exit 1"]
interval: 20s
timeout: 20s
retries: 3
hadoop-hdfs-dn-0:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-hdfs-dn-0
hostname: hadoop-hdfs-dn-0
restart: always
depends_on:
- hadoop-hdfs-nn
env_file:
- .env
ports:
- "30864:${HADOOP_HDFS_DN_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-hdfs-dn hadoop-hdfs-nn ${HADOOP_HDFS_NN_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${HADOOP_HDFS_DN_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hadoop-hdfs-dn-1:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-hdfs-dn-1
hostname: hadoop-hdfs-dn-1
restart: always
depends_on:
- hadoop-hdfs-nn
env_file:
- .env
ports:
- "30865:${HADOOP_HDFS_DN_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-hdfs-dn hadoop-hdfs-nn ${HADOOP_HDFS_NN_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${HADOOP_HDFS_DN_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hadoop-hdfs-dn-2:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-hdfs-dn-2
hostname: hadoop-hdfs-dn-2
restart: always
depends_on:
- hadoop-hdfs-nn
env_file:
- .env
ports:
- "30866:${HADOOP_HDFS_DN_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-hdfs-dn hadoop-hdfs-nn ${HADOOP_HDFS_NN_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${HADOOP_HDFS_DN_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hadoop-yarn-rm:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-yarn-rm
hostname: hadoop-yarn-rm
restart: always
env_file:
- .env
ports:
- "30888:${HADOOP_YARN_RM_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-yarn-rm"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "netstat -tnlp|grep :${HADOOP_YARN_RM_PORT} || exit 1"]
interval: 20s
timeout: 20s
retries: 3
hadoop-yarn-nm-0:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-yarn-nm-0
hostname: hadoop-yarn-nm-0
restart: always
depends_on:
- hadoop-yarn-rm
env_file:
- .env
ports:
- "30042:${HADOOP_YARN_NM_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-yarn-nm hadoop-yarn-rm ${HADOOP_YARN_RM_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${HADOOP_YARN_NM_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hadoop-yarn-nm-1:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-yarn-nm-1
hostname: hadoop-yarn-nm-1
restart: always
depends_on:
- hadoop-yarn-rm
env_file:
- .env
ports:
- "30043:${HADOOP_YARN_NM_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-yarn-nm hadoop-yarn-rm ${HADOOP_YARN_RM_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${HADOOP_YARN_NM_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hadoop-yarn-nm-2:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-yarn-nm-2
hostname: hadoop-yarn-nm-2
restart: always
depends_on:
- hadoop-yarn-rm
env_file:
- .env
ports:
- "30044:${HADOOP_YARN_NM_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-yarn-nm hadoop-yarn-rm ${HADOOP_YARN_RM_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:${HADOOP_YARN_NM_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hadoop-yarn-proxyserver:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-yarn-proxyserver
hostname: hadoop-yarn-proxyserver
restart: always
depends_on:
- hadoop-yarn-rm
env_file:
- .env
ports:
- "30911:${HADOOP_YARN_PROXYSERVER_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-yarn-proxyserver hadoop-yarn-rm ${HADOOP_YARN_RM_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "netstat -tnlp|grep :${HADOOP_YARN_PROXYSERVER_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hadoop-mr-historyserver:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hadoop-mr-historyserver
hostname: hadoop-mr-historyserver
restart: always
depends_on:
- hadoop-yarn-rm
env_file:
- .env
ports:
- "31988:${HADOOP_MR_HISTORYSERVER_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hadoop-mr-historyserver hadoop-yarn-rm ${HADOOP_YARN_RM_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "netstat -tnlp|grep :${HADOOP_MR_HISTORYSERVER_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 3
hive-metastore:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hive-metastore
hostname: hive-metastore
restart: always
depends_on:
- hadoop-hdfs-dn-2
env_file:
- .env
ports:
- "30983:${HIVE_METASTORE_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hive-metastore hadoop-hdfs-dn-2 ${HADOOP_HDFS_DN_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "netstat -tnlp|grep :${HIVE_METASTORE_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 5
hive-hiveserver2:
image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hadoop_hive:v1
user: "hadoop:hadoop"
container_name: hive-hiveserver2
hostname: hive-hiveserver2
restart: always
depends_on:
- hive-metastore
env_file:
- .env
ports:
- "31000:${HIVE_HIVESERVER2_PORT}"
command: ["sh","-c","/opt/apache/bootstrap.sh hive-hiveserver2 hive-metastore ${HIVE_METASTORE_PORT}"]
networks:
- hadoop-network
healthcheck:
test: ["CMD-SHELL", "netstat -tnlp|grep :${HIVE_HIVESERVER2_PORT} || exit 1"]
interval: 30s
timeout: 30s
retries: 5
# 连接外部网络
networks:
hadoop-network:
external: true
6) Start deployment
docker-compose -f docker-compose.yaml up -d
# 查看
docker-compose -f docker-compose.yaml ps
simple test verification
[Problem] If the following similar errors occur, it is because of multiple startups, the previous data is still there, but the IP of the datanode has changed (the host machine deployment will not have such a problem, because the host machine's IP is fixed ), so you need to refresh the node, of course, you can also clean up the old data. It is not recommended to clean up the old data. It is recommended to use the way to refresh the node (if there is an external mount, like I have no external mount here, it is because before The old container is still there, there are several solutions below):
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException): Datanode denied communication with namenode because the host is not in the include-list: DatanodeRegistration(172.30.0.12:9866, datanodeUuid=f8188476-4a88-4cd6-836f-769d510929e4, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-f998d368-222c-4a9a-88a5-85497a82dcac;nsid=1840040096;c=1680661390829)
【solution】
- Delete the old container and restart
# 清理旧容器
docker rm `docker ps -a|grep 'Exited'|awk '{print $1}'`
# 重启启动服务
docker-compose -f docker-compose.yaml up -d
# 查看
docker-compose -f docker-compose.yaml ps
- Login namenode refresh datanode
docker exec -it hadoop-hdfs-nn hdfs dfsadmin -refreshNodes
- Log in to any node to refresh the datanode
# 这里以 hadoop-hdfs-dn-0 为例
docker exec -it hadoop-hdfs-dn-0 hdfs dfsadmin -fs hdfs://hadoop-hdfs-nn:9000 -refreshNodes
At this point, the containerized deployment of Hive is complete. If you have any questions, welcome to leave me a message. I will continue to update relevant technical articles in the future. You can also follow my public account [Big Data and Cloud Native Technology Sharing] for in-depth exchange of technologies Or private message to ask questions~