spark streaming 总结

1.查看topic是否正常: topic副本,leader
可在集群任何节点执行
opt/kafka_2.11-0.10.0.0/bin/kafka-topics.sh    --describe   --zookeeper 10.25.133.192:12181,10.26.51.89:12181/kafka10 --topic  nongfunginxlog


查看当前的offset,消息数同offset
bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic nongfunginxlog --time -1 --broker-list 10.25.133.192:19092,10.26.51.89:19092

2.验证所指定的一个或多个Topic下每个Partition对应的所有Replica是否都同步
opt/kafka_2.11-0.10.0.0/bin/kafka-replica-verification.sh  --broker-list 10.25.133.192:19092,10.26.51.89:19092   --topic-white-list  nongfunginxlog


opt/kafka_2.11-0.10.0.0/bin/kafka-replica-verification.sh  --broker-list 10.25.133.192:19092,10.26.51.89:19092   --topic      nongfunginxlog

3.有了Replication机制后,每个Partition可能有多个备份。某个Partition的Replica列表叫作AR(Assigned Replicas),AR中的第一个Replica即为“Preferred Replica”。创建一个新的Topic或者给已有Topic增加Partition时,Kafka保证Preferred Replica被均匀分布到集群中的所有Broker上。理想情况下,Preferred Replica会被选为Leader。以上两点保证了所有Partition的Leader被均匀分布到了集群当中,这一点非常重要,因为所有的读写操作都由Leader完成,若Leader分布过于集中,会造成集群负载不均衡。但是,随着集群的运行,该平衡可能会因为Broker的宕机而被打破,该工具就是用来帮助恢复Leader分配的平衡


4.均衡topic的partition的leader到各个节点,使每个broker的负载能够均衡承担,增加稳定性,可用性(特别适用于某些broker重新回复到ISR中时,进行topic的各partitions leader的均衡分布
同时,与运行该工具前相比,Leader的分配更均匀
$KAFKA_HOME/bin/kafka-preferred-replica-election.sh  --zookeeper 10.25.133.192:12181,10.26.51.89:12181/kafka10

5.查看kafka topic在zookeeper上的状态:
leader
opt/zookeeper-3.4.11/bin/zkCli.sh  -server 10.25.133.192:12181

查看brokerid(cluster broker是否正常)
get /kafka10/brokers/ids/1

get /kafka10/brokers/ids/0

follower
opt/zookeeper-3.4.11/bin/zkCli.sh  -server 10.26.51.89:12181

6.
a)查看redis 保存offset 时间

keys * 得到时间戳


b)date -d @时间戳


c)查看kafka进入消息


7.kill当前机器运行的 kafka-console-consumer.sh 
cat killKafkaConsole.sh 

#! /bin/bash
a
kafkaconsolePid=`jps | grep -v grep | grep ConsoleConsumer | cut -d " " -f 1 `
#echo $kafkaconsolePid
kill -9  ${kafkaconsolePid}
echo "is done"


8.查看topic
kafka-topics.sh    --describe   --zookeeper hadoop:2181/kafka10 --topic  nongfunginxlog


opt/kafka_2.11-0.10.0.0/bin/kafka-topics.sh    --describe   --zookeeper 10.26.51.89:12181,10.25.133.192:12181/kafka10 --topic nongfunginxlog


9.修改topic的配置项
opt/kafka_2.11-0.10.0.0/bin/kafka-topics.sh    --alter   --zookeeper 10.26.51.89:12181,10.25.133.192:12181/kafka10 --topic  nongfunginxlog   --partitions 2
opt/kafka_2.11-0.10.0.0/bin/kafka-topics.sh    --create   --zookeeper 10.26.51.89:12181,10.25.133.192:12181/kafka10 --topic  nongfunginxlog   --partitions 2 --replication-factor 2

10.收集topic接收到的数据
/home/sznongfu/opt/kafka_2.11-0.10.0.0/bin/kafka-console-consumer.sh  --zookeeper 10.25.133.192:12181,10.26.51.89:12181/kafka10 --topic  __consumer_offsets  >> /tmp/test

/home/sznongfu/opt/kafka_2.11-0.10.0.0/bin/kafka-console-consumer.sh  --zookeeper 10.25.133.192:12181,10.26.51.89:12181/kafka10   --topic nongfunginxlog   >>/tmp/kafkaRM-7.1.dat &


/home/sznongfu/opt/kafka_2.11-0.10.0.0/bin/kafka-console-consumer.sh  --zookeeper 10.25.133.192:12181,10.26.51.89:12181/kafka10   --topic nongfunginxlog


11.重新平衡集群AR(有新增,删除的)
Kafka Reassign Partitions Tool
用途
  该工具的设计目标与Preferred Replica Leader Election Tool有些类似,都旨在促进Kafka集群的负载均衡。不同的是,Preferred Replica Leader Election只能在Partition的AR范围内调整其Leader,使Leader分布均匀,而该工具还可以调整Partition的AR。
  Follower需要从Leader Fetch数据以保持与Leader同步,所以仅仅保持Leader分布的平衡对整个集群的负载均衡来说是不够的。另外,生产环境下,随着负载的增大,可能需要给Kafka集群扩容。向Kafka集群中增加Broker非常简单方便,但是对于已有的Topic,并不会自动将其Partition迁移到新加入的Broker上,此时可用该工具达到此目的。某些场景下,实际负载可能远小于最初预期负载,此时可用该工具将分布在整个集群上的Partition重装分配到某些机器上,然后可以停止不需要的Broker从而实现节约资源的目的。
  需要说明的是,该工具不仅可以调整Partition的AR位置,还可调整其AR数量,即改变该Topic的replication factor。
  
原理
  该工具只负责将所需信息存入Zookeeper中相应节点,然后退出,不负责相关的具体操作,所有调整都由Controller完成。

在Zookeeper上创建/admin/reassign_partitions节点,并存入目标Partition列表及其对应的目标AR列表。
Controller注册在/admin/reassign_partitions上的Watch被fire,Controller获取该列表。
对列表中的所有Partition,Controller会做如下操作:
启动RAR - AR中的Replica,即新分配的Replica。(RAR = Reassigned Replicas, AR = Assigned Replicas)
等待新的Replica与Leader同步
如果Leader不在RAR中,从RAR中选出新的Leader
停止并删除AR - RAR中的Replica,即不再需要的Replica
删除/admin/reassign_partitions节点
用法
  该工具有三种使用模式

generate模式,给定需要重新分配的Topic,自动生成reassign plan(并不执行)
execute模式,根据指定的reassign plan重新分配Partition
verify模式,验证重新分配Partition是否成功
  下面这个例子将使用该工具将Topic的所有Partition重新分配到Broker 4/5/6/7上,步骤如下:

使用generate模式,生成reassign plan。指定需要重新分配的Topic ({“topics”:[{“topic”:”topic1”}],”version”:1}),并存入/tmp/topics-to-move.json文件中,然后执行

$KAFKA_HOME/bin/kafka-reassign-partitions.sh 
    --zookeeper localhost:2181 
    --topics-to-move-json-file /tmp/topics-to-move.json  
    --broker-list "4,5,6,7" --generate
  结果如下图所示
reassign_1
  
2. 使用execute模式,执行reassign plan
  将上一步生成的reassignment plan存入/tmp/reassign-plan.json文件中,并执行


   $KAFKA_HOME/bin/kafka-reassign-partitions.sh 
--zookeeper localhost:2181     
--reassignment-json-file /tmp/reassign-plan.json --execute
reassign_2


  此时,Zookeeper上/admin/reassign_partitions节点被创建,且其值与/tmp/reassign-plan.json文件的内容一致。


reassign_3
3. 使用verify模式,验证reassign是否完成。执行verify命令
$KAFKA_HOME/bin/kafka-reassign-partitions.sh 
--zookeeper localhost:2181 --verify
--reassignment-json-file /tmp/reassign-plan.json

12.查看topic的各partition 的offset
使用默认的kafka api
opt/kafka_2.11-0.10.0.0/bin/kafka-consumer-offset-checker.sh  --zookeeper 10.25.133.192:12181,10.26.51.89:12181/kafka10  --group testp --topic nongfunginxlog
keeper 10.25.133.192:12181,10.26.51.89:12181/kafka10  --group testp --topic nongfunginxlog
[2018-07-03 11:28:11,076] WARN WARNING: ConsumerOffsetChecker is deprecated and will be dropped in releases following 0.9.0. Use ConsumerGroupCommand instead. (kafka.tools.ConsumerOffsetChecker$)
Group           Topic                          Pid Offset          logSize         Lag             Owner
testp           nongfunginxlog                 0   2134058         2134059         1               none
testp           nongfunginxlog                 1   1384753         1384754         1               none
 

javaAPI,也就是【kafka.javaapi.consumer.ConsumerConnector】
zkCli.sh     -server 10.25.133.192:12181
[zk: localhost(CONNECTED) 0] get /kafka/consumers/zoo-consumer-group/offsets/my-topic/0
5662
cZxid = 0x20006d28a
ctime = Wed Apr 12 18:20:51 CST 2017
mZxid = 0x30132b0ed
mtime = Tue Aug 22 18:53:22 CST 2017
pZxid = 0x20006d28a
cversion = 0
dataVersion = 5758
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 4
numChildren = 0

13 offset 更新
offset更新的方式,不区分是用的哪种api,大致分为两类:

自动提交,设置enable.auto.commit=true,更新的频率根据参数【auto.commit.interval.ms】来定。这种方式也被称为【at most once】,fetch到消息后就可以更新offset,无论是否消费成功。
手动提交,设置enable.auto.commit=false,这种方式称为【at least once】。fetch到消息后,等消费完成再调用方法【consumer.commitSync()】,手动更新offset;如果消费失败,则offset也不会更新,此条消息会被重复消费一次。


修改部分:


kafka修改:保证kafka正常运行
1.关闭应用,spark集群,flume
1.删除kafka对应topic的日志,索引文件
2.删除Zookeeper 中kafka相关目录(zkCli.sh)
3.修改kafka  server.properties
zookeeper.connect=10.25.133.192:12181,10.26.51.89:12181/kafka10

4.创建topic
opt/kafka_2.11-0.10.0.0/bin/kafka-topics.sh    --create --zookeeper 10.25.133.192:12181,10.26.51.89:12181/kafka10   --replication-factor 2  --partitions 1  --topic  nongfunginxlog

5.启动spark,应用程序,kafka,flume


flume修改: 确保传输完整性
vim opt/flume-1.7.0/conf/hdfs-kafka-sink.conf 
1.配置文件修改
a1.sinks.k2.kafka.flumeBatchSize = 500
a1.sinks.k2.kafka.producer.acks = -1

mongo修改
添加索引:加速查询
user:
db.ss.iss.user.ensureIndex({"ip" : 1, "userid" :1, "userAgentOrig" :1, "epoch" : 1,"nginxInTime":1,"session" : 1})

api:
db.ss.iss.api.ensureIndex{"epoch" :1, "method" : 1, "nginxInTime" : 1, "path" : 1,"isStatic" : 1})

冷热数据分离


kafka 状态异常:

原因:
[2018-06-29 03:35:50,306] WARN [ReplicaFetcherThread-0-0], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@27ed6fa5 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:87)
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:84)
        at scala.Option.foreach(Option.scala:257)
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:84)
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:80)
        at kafka.utils.NetworkClientBlockingOps$.recursivePoll$2(NetworkClientBlockingOps.scala:137)
        at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
        at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:80)
        at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:244)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:229)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:107)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:98)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2018-06-29 03:36:08,548] INFO Partition [__consumer_offsets,0] on broker 1: Shrinking 


zk 显示对应的broker已经离线
[zk: 10.25.133.192:12181(CONNECTED) 0] get /kafka10/brokers/topics/nongfunginxlog/partitions
null
cZxid = 0x16b000003f5
ctime = Mon Jun 25 16:08:24 CST 2018
mZxid = 0x16b000003f5
mtime = Mon Jun 25 16:08:24 CST 2018
pZxid = 0x17100001b68
cversion = 2
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2
[zk: 10.25.133.192:12181(CONNECTED) 2] ls  /kafka10/brokers
ids      topics   seqid
[zk: 10.25.133.192:12181(CONNECTED) 2] ls  /kafka10/brokers/ids
[0]
[zk: 10.25.133.192:12181(CONNECTED) 3]  

broker 进程存在
sznongfu@iZ2zednyjjxxq7p50wqzenZ:~$ jps
14224 SecondaryNameNode
3893 Worker
14087 DataNode
19481 QuorumPeerMain
9690 Kafka
4109 CoarseGrainedExecutorBackend
29663 Jps

ids      topics   seqid
[zk: 10.25.133.192:12181(CONNECTED) 2] ls  /kafka10/brokers/ids
[0]
[zk: 10.25.133.192:12181(CONNECTED) 3] 

如果broker无法入组 在zk 的broker ids中体现信息,kafka无法进行分配消息消费机制,不断发起入组请求,阻碍job执行,处理
/06/29 03:59:15 INFO utils.AppInfoParser: Kafka version : 0.10.0.0
18/06/29 03:59:15 INFO utils.AppInfoParser: Kafka commitId : b8642491e78c5a13
18/06/29 03:59:15 INFO internals.AbstractCoordinator: Discovered coordinator 10.26.51.89:19092 (id: 2147483646 rack: null) for group testp.
18/06/29 03:59:15 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [] for group testp
18/06/29 03:59:15 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:20 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:25 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:30 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:35 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:40 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:45 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:50 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 03:59:55 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 04:00:00 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 04:00:05 INFO internals.AbstractCoordinator: (Re-)joining group testp
18/06/29 04:00:10 INFO internals.AbstractCoordinator: (Re-)joining group testp

解决:重启对应broker id 的kafka


flume 总结:

1.增大文件组,提高flume并发读取消息的能力,避免出现消息重复传输,漏传(tailDir:线程在读取数据(events)后,会更新position(offset).
如果监听文件(读取文件)过多,会出现读取消息后来不及更新position,就立即读取其他文件新增消息,造成下次读取该文件时,还是原来的position(读取数据后未更新position);

1.多sink使用kafka channel
多个sink后会出现消息漏传,这是由于一个sinK保存后,保存position后,为来得及传输到其他sink,就被清理线程给清除,造成消息漏传.

-rw-r--r-- 1 root root   725639 Jun 29 04:00 nongfu.merchant.client.api.log.1
-rw-r--r-- 1 root root 41865438 Jun 29 04:00 nongfu.merchant.cmn.api.log.1
-rw-r--r-- 1 root root  1803221 Jun 29 04:00 nongfu.merchant.dashboard.api.log.1
-rw-r--r-- 1 root root   994911 Jun 29 04:00 nongfu.merchant.passport.api.log.1
-rw-r--r-- 1 root root   229893 Jun 29 04:00 nongfu.merchant.repo.api.log.1


flume修改


a1.sources.r1.filegroups = f1 f2 f3 f4 f5

a1.sources.r1.fileHeader = true
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.headers.f2.headerKey2 = value2
a1.sources.r1.headers.f3.headerKey3 = value3
a1.sources.r1.headers.f4.headerKey4 = value4
a1.sources.r1.headers.f5.headerKey5 = value5

a1.sources.r1.filegroups.f1 = /home/sznongfu/opt/nginx-1.9/logs/merchant-master-logs/nongfu.merchant.client.api.log
a1.sources.r1.filegroups.f2 =   /home/sznongfu/opt/nginx-1.9/logs/merchant-master-logs/nongfu.merchant.cmn.api.log
a1.sources.r1.filegroups.f3 =  /home/sznongfu/opt/nginx-1.9/logs/merchant-master-logs/nongfu.merchant.dashboard.api.log
a1.sources.r1.filegroups.f4 =  /home/sznongfu/opt/nginx-1.9/logs/merchant-master-logs/nongfu.merchant.passport.api.log
a1.sources.r1.filegroups.f5 =   /home/sznongfu/opt/nginx-1.9/logs/merchant-master-logs/nongfu.merchant.repo.api.log


 


查看linxu负载:
总结:

(1)使用top命令查看负载,在top下按“1”查看CPU核心数量,shift+"c"按cpu使用率大小排序,shif+"p"按内存使用率高低排序;
 (2) top 查看具体的进程命令: top 后 按 c (comman line)

(2)使用iostat -x 命令来监控io的输入输出是否过大

发布了150 篇原创文章 · 获赞 15 · 访问量 10万+

猜你喜欢

转载自blog.csdn.net/dymkkj/article/details/81189669