flume+zookeeper+kafka+spark streaming

1.flume安装部署

1.1、下载安装介质，并解压：

cd /usr/local/
wget http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.7.0.tar.gz
tar -zxvf flume-ng-1.6.0-cdh5.7.0.tar.gz

ln -s apache-flume-1.6.0-cdh5.7.0-bin/ flume

1.2、配置flume工作

cd /usr/local/flume/conf
cp flume-env.sh.template flume-env.sh
vi flume-env.sh 【增加Java Home路径】
export JAVA_HOME=/usr/java/jdk1.8.0_152

vi /etc/profile

export FLUME_HOME=/usr/local/flume
export PATH=$FLUME_HOME/bin:$PATH

source /etc/profile

1.3、验证flume安装，编写配置文件

新建example.conf文件，注此配置输入源为netcat，通道为memory，输出为logger

vi example.conf

# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.0.186
a1.sources.r1.port = 45678

# Describe the sink
a1.sinks.k1.type = logger
a1.sinks.k1.maxBytesToLog = 10

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

1.4、测试验证

启动flume监听进程：

flume-ng agent --name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/example.conf -Dflume.root.logger=INFO,console

安装telnet
yum install -y telnet
telnet 192.168.0.186 45678

2、zookeeper部署

因Kafka把它的meta数据都存储在ZK上，所以说ZK是他的必要存在没有ZK没法运行Kafka；在老版本（0.8.1以前）里面消费段（consumer）也是依赖ZK的，在新版本中移除了客户端对ZK的依赖，但是broker依然依赖于ZK。所以必须在kafka配置前部署完成ZK.

2.1、下载安装介质，并解压：

cd /usr/local/

wget https://archive.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz

tar -zxvf zookeeper-3.4.6.tar.gz

2.2、配置flume工作

chown -R root:root zookeeper-3.4.6
ln -s zookeeper-3.4.6 zookeeper
vi /etc/profile
export ZOOKEEPER_HOME=/root/apps/zookeeper
export PATH=$ZOOKEEPER_HOME/bin:$PATH
source /etc/profile
cd zookeeper
mkdir data
cd conf
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg
dataDir=/usr/local/zookeeper/data

2.3、启动zookeeper

服务器：zkServer.sh start/stop/status

3、kafka部署

3.1、下载安装介质，并解压：

cd /usr/local/

wget https://archive.apache.org/dist/kafka/0.10.0.1/kafka_2.11-0.10.0.1.tgz
tar -zxvf kafka_2.11-0.10.0.1.tgz

3.2、配置kafka工作

ln -s kafka_2.11-0.10.0.1 kafka

cd kafka/config/

vi server.properties
#broker的ID，在集群中需要唯一
broker.id=1

#Socket Server端号
port=9092

#Socket Server服务IP地址

host.name=192.168.137.132

#kafka日志文件存储
log.dirs=/usr/local/kafka/kafka-logs

#连接ZK存放kafka元数据位置
zookeeper.connect=192.168.0.186:2181/kafka

mkdir -p /usr/local/kafka/kafka-logs

vi /etc/profile
export KAFKA_HOME=/usr/local/kafka
export PATH=$KAFKA_HOME/bin:$PATH

source /etc/profile

3.3、验证kafka安装

启动kafka
nohup bin/kafka-server-start.sh config/server.properties &

创建kafka的broker
kafka-topics.sh --create \
--zookeeper 192.168.0.186:2181/kafka \
--replication-factor 1 --partitions 1 --topic test

生产者
kafka-console-producer.sh \
--broker-list 192.168.0.186:9082 --topic test

消费者
kafka-console-consumer.sh \
--zookeeper 192.168.0.186:2181/kafka \
--from-beginning --topic test

4、flume+kafka

4.1、flume配置

vi flume-kafka-memory.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the custom exec source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = test
a1.sinks.k1.brokerList = 192.168.0.186:9092
//a1.sinks.kai.kafka.producer.compression.type = snappy

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.keep-alive = 90
a1.channels.c1.capacity = 2000000
a1.channels.c1.transactionCapacity = 6000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4.2、flume启动

flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/netcat-memory-Kafka.conf -Dflume.root.logger=INFO,console

启动kafka

kafka-server-start.sh -daemon /usr/local/kafka/config/server.properties

创建kafka的blocker
kafka-topics.sh --create \
--zookeeper 192.168.0.186:2181/kafka \
--replication-factor 1 --partitions 1 --topic topicD

查看topics

kafka-topics.sh --list --zookeeper localhost:2181/kafka

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{SparkConf, TaskContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe


object DkafkatoStreaming {

  def main(args: Array[String]) {

    val sparkconf=new SparkConf().setAppName("project").setMaster("local")
    val ssc=new StreamingContext(sparkconf,Seconds(5))
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "116.207.129.109:9082",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    //ssc.checkpoint("hdfs://116.207.129.109:9000/checkproint")
    val topics = Array("streaming_topic")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )
    val lines=stream.map(_.value)
    val word=lines.flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_+_).print

    ssc.start()
    ssc.awaitTermination()
  }
}

5、遇到的错误

Error while executing topic command : replication factor: 1 larger than available brokers: 0
[2018-04-24 01:26:45,715] ERROR kafka.admin.AdminOperationException: replication factor: 1 larger than available brokers: 0
at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)
at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)
at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:61)
at kafka.admin.TopicCommand.main(TopicCommand.scala)
(kafka.admin.TopicCommand$)

启动命令加上/kafka

6、调优点：

timeout heap rpc

producter：

acks

buffer.memory

compression.type

retries

batch.size 数据的条数，而不是数据一批次的大小

broker：

max.message.bytes 每条消息的最大size 2M

replica.fetch.max.bytes 尝试接收信息的最大字节，要大于等于上面的 4M

zookeeper.connection.timeout.ms

consumer：

fetch.message.max.bytes

【来自@若泽大数据】

flume+zookeeper+kafka+spark streaming

5、遇到的错误

猜你喜欢