教程目录

0x00 教程内容
0x01 Dockerfile文件的编写

1. 编写Dockerfile文件
2. 编写Dockerfile文件的关键点
3. 完整的Dockerfile文件参考

0x02 校验Kafka前准备工作

1. 环境及资源准备

0x03 校验Kafka是否安装成功

1. 修改生成容器脚本
2. 生成镜像
3. 生成容器
4. 启动Kafka

0xFF 总结

0x00 教程内容

Dockerfile文件的编写
校验Kafka准备工作
校验Kafka是否安装成功

0x01 Dockerfile文件的编写

1. 编写Dockerfile文件

为了方便，我复制了一份flume_sny_all的文件，取名kafka_sny_all。
a. Kafka安装步骤
参考文章：D011 复制粘贴玩大数据之安装与配置Kafka集群

常规安装	Dockerfile安装
1.将安装包放于容器	1.添加安装包并解压
2.解压并配置Kafka	2.添加环境变量
3.添加环境变量	3.添加配置文件（含环境变量）
4.同步各节点并启动	4.同步各节点并启动

其实安装内容都是一样的，这里只是就根据我写的步骤整理了一下

2. 编写Dockerfile文件的关键点

与D010 复制粘贴玩大数据之Dockerfile安装Flume集群的“0x01 3. a. Dockerfile参考文件”相比较，不同点体现在：
具体步骤：
a. 添加安装包并解压（ADD指令会自动解压）

#添加Kafka
ADD ./kafka_2.11-1.0.0.tgz /usr/local/

b. 添加环境变量（FLUME_HOME、PATH）

#Kafka环境变量
ENV KAFKA_HOME /usr/local/kafka_2.11-1.0.0

#PATH里面追加内容
$KAFKA_HOME/bin:

c. 添加配置文件（注意给之前的语句加“&& \”，表示未结束）

&& \
mv /tmp/server.properties $KAFKA_HOME/config/server.properties && \
mv /tmp/init_kafka.sh ~/init_kafka.sh

d. 给kafka初始化文件权限

#修改init_kafka.sh权限为700
RUN chmod 700 init_kafka.sh

3. 完整的Dockerfile文件参考

a. 安装hadoop、spark、zookeeper、hbase、hive、flume、kafka

FROM ubuntu
MAINTAINER shaonaiyi [email protected]

ENV BUILD_ON 2019-01-28

RUN apt-get update -qqy

RUN apt-get -qqy install vim wget net-tools  iputils-ping  openssh-server
#添加JDK
MAINTAINER shaonaiyi [email protected]

ENV BUILD_ON 2019-03-12

RUN apt-get update -qqy

RUN apt-get -qqy install vim wget net-tools  iputils-ping  openssh-server
#添加JDK
ADD ./jdk-8u161-linux-x64.tar.gz /usr/local/
#添加hadoop
ADD ./hadoop-2.7.5.tar.gz  /usr/local/
#添加scala
ADD ./scala-2.11.8.tgz /usr/local/
#添加spark
ADD ./zookeeper-3.4.10.tar.gz /usr/local/
#添加HBase
ADD ./hbase-1.2.6-bin.tar.gz /usr/local/
#添加Hive
ADD ./apache-hive-2.3.3-bin.tar.gz /usr/local/
#添加Flume
ADD ./apache-flume-1.8.0-bin.tar.gz /usr/local/
#添加Kafka
ADD ./kafka_2.11-1.0.0.tgz /usr/local/

ENV CHECKPOINT 2019-03-12
#增加JAVA_HOME环境变量
ENV JAVA_HOME /usr/local/jdk1.8.0_161
#hadoop环境变量
ENV HADOOP_HOME /usr/local/hadoop-2.7.5
#scala环境变量
ENV SCALA_HOME /usr/local/scala-2.11.8
#spark环境变量
ENV SPARK_HOME /usr/local/spark-2.2.0-bin-hadoop2.7
#zk环境变量
ENV ZK_HOME /usr/local/zookeeper-3.4.10
#HBase环境变量
ENV HBASE_HOME /usr/local/hbase-1.2.6
#Hive环境变量
ENV HIVE_HOME /usr/local/apache-hive-2.3.3-bin
#Flume环境变量
ENV FLUME_HOME /usr/local/apache-flume-1.8.0-bin
#Kafka环境变量
ENV KAFKA_HOME /usr/local/kafka_2.11-1.0.0
#将环境变量添加到系统变量中

RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
    chmod 600 ~/.ssh/authorized_keys
#复制配置到/tmp目录
COPY config /tmp
#将配置移动到正确的位置
RUN mv /tmp/ssh_config    ~/.ssh/config && \
    mv /tmp/profile /etc/profile && \
    mv /tmp/masters $SPARK_HOME/conf/masters && \
    cp /tmp/slaves $SPARK_HOME/conf/ && \
    mv /tmp/spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf && \
    mv /tmp/spark-env.sh $SPARK_HOME/conf/spark-env.sh && \
    mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
    mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \
    mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
    mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
    mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
    mv /tmp/master $HADOOP_HOME/etc/hadoop/master && \
    mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \
    mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
    mv /tmp/init_zk.sh ~/init_zk.sh && \
    mkdir -p /usr/local/hadoop2.7/dfs/data && \
    mkdir -p /usr/local/hadoop2.7/dfs/name && \
    mkdir -p /usr/local/zookeeper-3.4.10/datadir && \
    mkdir -p /usr/local/zookeeper-3.4.10/log && \
    mv /tmp/zoo.cfg $ZK_HOME/conf/zoo.cfg && \
    mv /tmp/hbase-env.sh $HBASE_HOME/conf/hbase-env.sh && \
    mv /tmp/hbase-site.xml $HBASE_HOME/conf/hbase-site.xml  && \
    mv /tmp/regionservers $HBASE_HOME/conf/regionservers && \
    mv /tmp/hive-env.sh $HIVE_HOME/conf/hive-env.sh && \
    mv /tmp/flume-env.sh $FLUME_HOME/conf/flume-env.sh && \
    mv /tmp/server.properties $KAFKA_HOME/config/server.properties && \
    mv /tmp/init_kafka.sh ~/init_kafka.sh

RUN echo $JAVA_HOME
#设置工作目录
WORKDIR /root
#启动sshd服务
RUN /etc/init.d/ssh start
#修改start-hadoop.sh权限为700
RUN chmod 700 start-hadoop.sh
#修改init_zk.sh权限为700
RUN chmod 700 init_zk.sh
#修改init_kafka.sh权限为700
RUN chmod 700 init_kafka.sh
#修改root密码
RUN echo "root:shaonaiyi" | chpasswd
CMD ["/bin/bash"]

0x02 校验Kafka前准备工作

1. 环境及资源准备

a. 安装Docker
请参考：D001.5 Docker入门（超级详细基础篇）的“0x01 Docker的安装”小节
b. 准备Kafka的安装包，放于与Dockerfile同级目录下
c. 准备Kafka的配置文件（放于config目录下）
cd /home/shaonaiyi/docker_bigdata/kafka_sny_all/config
配置文件一：vi server.properties

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma seperated list of directories under which to store log files
log.dirs=/root/logs/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=hadoop-master:2181,hadoop-slave1:2181,hadoop-slave2:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0

d. 修改环境变量配置文件（放于config目录下）
配置文件二：vi profile
添加内容：

export KAFKA_HOME=/usr/local/kafka_2.11-1.0.0
export PATH=$PATH:$KAFKA_HOME/bin

e. 编写初始化Kafka脚本并放于config文件夹目录下（需要修改Kafka的 broker.id）
vi init_kafka.sh

#!/bin/bash
# 将某个文件中的"broker.id=0"字符串替换为"broker.id=x",master这句可删除
ssh root@hadoop-master "sed -i 's/broker.id=0/broker.id=0/g' $KAFKA_HOME/config/server.properties"
ssh root@hadoop-slave1 "sed -i 's/broker.id=0/broker.id=1/g' $KAFKA_HOME/config/server.properties"
ssh root@hadoop-slave2 "sed -i 's/broker.id=0/broker.id=2/g' $KAFKA_HOME/config/server.properties"

0x03 校验Kafka是否安装成功

1. 修改生成容器脚本

a. 修改start_containers.sh文件（样本镜像名称成shaonaiyi/kafka）
vi start_containers.sh
本人把里面的三个shaonaiyi/flume改为了shaonaiyi/kafka
ps:当然，你可以新建一个新的网络，换ip，这里偷懒，用了旧的网络，只换了ip

2. 生成镜像

a. 删除之前的flume集群容器（节省资源），如已删可省略此步
cd /home/shaonaiyi/docker_bigdata/flume_sny_all/config/
如果是复制的，此句可以省略：chmod 700 stop_containers.sh
./stop_containers.sh
b. 生成装好hadoop、spark、zookeeper、hbase、hive、flume、kafka的镜像（如果之前shaonaiyi/flume未删除，则此次会快很多）
cd /home/shaonaiyi/docker_bigdata/kafka_sny_all
docker build -t shaonaiyi/kafka .
在这里插入图片描述

3. 生成容器

a. 生成容器（start_containers.sh如果没权限则给权限）：
config/start_containers.sh
b. 进入master容器
sh ~/master.sh

4. 启动Kafka

a. 先确保Zookeeper集群已经启动
在这里插入图片描述
b. 启动Kafka
第一次启动需要初始化：
./init_kafka.sh
启动命令（三台均执行，自己可以写个脚本来启动！）：
kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties
查看进程：
./jps_all.sh

0xFF 总结

安装很简单，只需要知道步骤，不清楚请参考文章：D011 复制粘贴玩大数据之安装与配置Kafka集群
这个教程又学习了一个新的脚本，按格式，将broker.id=0修改成broker.id=1，按样子学就可以：
"sed -i 's/broker.id=0/broker.id=1/g' $KAFKA_HOME/config/server.properties"
Dockerfile常用指令，请参考文章：D004.1 Dockerfile例子详解及常用指令
到目前为止，已经完成了大数据的基本框架的搭建了，可以嗨森地写原理教程了【破涕而笑.jpg】，当然，还要优化一下的，等有时间再回头优化吧。

作者简介：邵奈一
大学大数据讲师、大学市场洞察者、专栏编辑
公众号、微博、CSDN：邵奈一
本系列课均为本人：邵奈一原创，如转载请标明出处

D012 复制粘贴玩大数据之Dockerfile安装Kafka集群