Kafka总结

官网：http://kafka.apache.org

概述

Kafka是一种高吞吐量的分布式发布订阅消息系统，之所以快是因为Kafka在磁盘上只做Sequence I/O操作，主要是使用了PageCache与SendFile技术，它也可以处理消费者规模的网站中的所有动作流数据，Kafka的设计是把所有的Message都要写入速度低容量大的硬盘，以此来换取更强的存储能力。

JMS 的概念

JMS (java messger server )的缩写，即使Java提供的一套技术规范，主要做数据的集成，通信，提高系统的伸缩性，提高用户的体验，是系统之间的模块更加灵活以及方便。

通过什么方式：生产消费者模式（生产者、服务器、消费者）

jdk，kafka，activemq……

JMS消息传输模型

点对点模式（一对一，消费者主动拉取数据，消息收到后消息清除）

发布/订阅模式（一对多，数据生产后，推送给所有订阅者）

JMS核心组件

Destination：消息发送的目的地，也就是前面说的Queue和Topic。

Message ：从字面上就可以看出是被发送的消息。

Producer：消息的生产者，要发送一个消息，必须通过这个生产者来发送。

MessageConsumer：与生产者相对应，这是消息的消费者或接收者，通过它来接收一个消息。

注：treamMessage：Java 数据流消息，用标准流操作来顺序的填充和读取。

MapMessage：一个Map类型的消息；名称为 string 类型，而值为 Java 的基本类型。

TextMessage：普通字符串消息，包含一个String。

ObjectMessage：对象消息，包含一个可序列化的Java 对象

BytesMessage：二进制数组消息，包含一个byte[]。

XMLMessage: 一个XML类型的消息。

最常用的是TextMessage和ObjectMessage。

常见的类JMS消息服务器

JMS消息服务器 ActiveMQ
分布式消息中间件 Metamorphosis
分布式消息中间件 RocketMQ

4、其他MQ

Kafka名词解释和工作方式

Producer ：消息生产者，就是向kafka broker发消息的客户端。
Consumer ：消息消费者，向kafka broker取消息的客户端
Topic ：咱们可以理解为一个队列。
Consumer Group （CG）：这是kafka用来实现一个topic消息的广播（发给所有的consumer）和单播（发给任意一个consumer）的手段。一个topic可以有多个CG。topic的消息会复制（不是真的复制，是概念上的）到所有的CG，但每个partion只会把消息发给该CG中的一个consumer。如果需要实现广播，只要每个consumer有一个独立的CG就可以了。要实现单播只要所有的consumer在同一个CG。用CG还可以将consumer进行自由的分组而不需要多次发送消息到不同的topic。
Broker ：一台kafka服务器就是一个broker。一个集群由多个broker组成。一个broker可以容纳多个topic。
Partition：为了实现扩展性，一个非常大的topic可以分布到多个broker（即服务器）上，一个topic可以分为多个partition，每个partition是一个有序的队列。partition中的每条消息都会被分配一个有序的id（offset）。kafka只保证按一个partition中的顺序将消息发给consumer，不保证一个topic的整体（多个partition间）的顺序。
Offset：kafka的存储文件都是按照offset.kafka来命名，用offset做名字的好处是方便查找。例如你想找位于2049的位置，只要找到2048.kafka的文件即可。当然the first offset就是00000000000.kafka

Kafka 的核心组件

1、Topic ：消息根据Topic进行归类

2、Producer：发送消息者

3、Consumer：消息接受者

4、broker：每个kafka实例(server)

5、Zookeeper：依赖集群保存meta信息。

Kafka 丢数据的原因以及解决方法

不过Kafka采用MessageSet也导致在可用性上一定程度的妥协。每次发送数据时，Producer都是send()之后就认为已经发送出去了，但其实大多数情况下消息还在内存的MessageSet当中，尚未发送到网络，这时候如果Producer挂掉，那就会出现丢数据的情况。

如果允许对部分的数据丢失，可以把request.required.acks=0设置为0来关闭ack机制，以全部的发送，否则设置成1或-1，设置成1则表示消息只需要被Leader接收并确认即可，如果设置为-1，表示消息要Commit到该Partition的ISR集合中的所有Replica后，才可以返回ack，消息的发送会更安全，

将多条消息暂且在客户端buffer起来,并将他们批量发送到broker;小数据IO太多,会拖慢整体的网络延迟,批量延迟发送事实上提升了网络效率;不过这也有一定的隐患,比如当producer失效时,那些尚未发送的消息将会丢失.

消息生产时：同步的模式下，配置ACKS的模式1(注意配置1的风险性)

异步的模式下，配置无阻赛的超时时间

消息消费时：如果使用的storm，开始storm的ackfail机制

如果没有使用storm，使用新的Kafka API则会自动更新offset值。

Kafka数据重复消费，该如何去重

去重，将消息放在redis中，每次消费时看是否消费过
不管，在大数据中少一个或多一条数据时对报表的展示影响不是很大，可以放弃。

Kafka Consumer的负载均衡

当一个group中,有consumer加入或者离开时,会触发partitions均衡.均衡的最终目的,是提升topic的并发消费能力，如果消费数与分区数相等时，分发数据思路步骤如下：

假如topic1,具有如下partitions: P0,P1,P2,P3
加入group中,有如下consumer: C1,C2
首先根据partition索引号对partitions排序: P0,P1,P2,P3
根据consumer.id排序: C0,C1
计算倍数: M = [P0,P1,P2,P3].size / [C0,C1].size,本例值M=2(向上取整)
然后依次分配partitions: C0 = [P0,P1],C1=[P2,P3],即Ci = [P(i * M),P((i + 1) * M -1)]

Kafka 源码分析：

class RangeAssignor() extends PartitionAssignor with Logging {

def assign(ctx: AssignmentContext) = {

val valueFactory = (topic: String) => new mutable.HashMap[TopicAndPartition, ConsumerThreadId]

val partitionAssignment =

new Pool[String, mutable.Map[TopicAndPartition, ConsumerThreadId]](Some(valueFactory))

for (topic <- ctx.myTopicThreadIds.keySet) {

//1、获取当前的消费者数量假设为2个消费者

val curConsumers = ctx.consumersForTopic(topic)

//2、获取当前的分区数量假设为6个分区

val curPartitions: Seq[Int] = ctx.partitionsForTopic(topic)

//计算一个整除的得到 nPartsPerConsumer = 0

val nPartsPerConsumer = curPartitions.size / curConsumers.size

//计算取模之后的值 nConsumersWithExtraPart = 4

val nConsumersWithExtraPart = curPartitions.size % curConsumers.size

//3、迭代出所有的消费者线程，并计算这个消费者管理分分片区间

// 0,1,2,3

for (consumerThreadId <- curConsumers) {

//获取消费者的位置 0

val myConsumerPosition = curConsumers.indexOf(consumerThreadId)

assert(myConsumerPosition >= 0)

4、获取每个线程的分配分片的起始位置

//计算C0的startPart 等于 0*0 + 0 = 0

//计算C1的startPart 等于 0*1 + 1 = 1

//计算C2的startPart 等于 0*2 + 2 = 2

//计算C3的startPart 等于 0*3 + 3 = 3

//计算C4的startPart 等于 0*4 + 4 = 4

val startPart = nPartsPerConsumer * myConsumerPosition + myConsumerPosition.min(nConsumersWithExtraPart)

// 5、计算每个线程从起始位置消费几个分片

//计算C0的nParts 等于 0 + 1 = 1

//计算C1的startPart 等于 0+ 1 =1

//计算C2的startPart 等于 0+ 1 =1

//计算C3的startPart 等于 0+ 1 =1

//计算C4的startPart 等于 0+ 0 =0

val nParts = nPartsPerConsumer + (if (myConsumerPosition + 1 > nConsumersWithExtraPart) 0 else 1)

//取值范围 startPart until startPart + nParts

C0的取值范围是 0 until 1，对应的分片是p0

C1的取值范围是 1 until 2,对应分片是p1

C2的取值范围是 2 until 3 对应分片是p2

C3的取值范围是 3 until 4 对应分片是p3

C4的取值范围是 4 until 4 对应分片是空值

kafka文件存储机制

Kafka文件存储基本结构

在Kafka文件存储中，同一个topic下有多个不同partition，每个partition为一个目录，partiton命名规则为topic名称+有序序号，第一个partiton序号从0开始，序号最大值为partitions数量减1。

每个partion(目录)相当于一个巨型文件被平均分配到多个大小相等segment(段)数据文件中。但每个段segment file消息数量不一定相等，这种特性方便old segment file快速被删除。默认保留7天的数据。

每个partiton只需要支持顺序读写就行了，segment文件生命周期由服务端配置参数决定。（什么时候创建，什么时候删除）

数据有序的讨论？

一个partition的数据是否是有序的？间隔性有序，不连续

针对一个topic里面的数据，只能做到partition内部有序，不能做到全局有序。

特别加入消费者的场景后，如何保证消费者消费的数据全局有序的？伪命题。

只有一种情况下才能保证全局有序？就是只有一个partition。

详细的请查看：http://blog.csdn.net/xfg0218/article/details/52935368

Kafka集群部署

软件下载：链接：http://pan.baidu.com/s/1nvCUouH 密码：gnty 如果下载不了请联系作者。

或wget http://mirrors.hust.edu.cn/apache/kafka/0.8.2.2/kafka_2.11-0.8.2.2.tgz

1-1）、安装软件

[root@hadoop1 local]# tar -zxvf kafka_2.11-0.9.0.1.tgz

[root@hadoop1 local]# mv kafka_2.11-0.9.0.1 kafka

1-2）、修改配置文件

A）、配置 server.properties

[root@hadoop1 config]# cat server.properties

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements. See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.

# broker的全局唯一编号，不能重复

broker.id=0

############################# Socket Server Settings #############################

# The port the socket server listens on

# 用来监听链接的端口，producer或consumer将在此端口建立连接

port=9092

# Hostname the broker will bind to. If not set, the server will bind to all interfaces

# 机器绑定的机器名称

host.name=hadoop1

# Hostname the broker will advertise to producers and consumers. If not set, it uses the

# value for "host.name" if configured. Otherwise, it will use the value returned from

# java.net.InetAddress.getCanonicalHostName().

#advertised.host.name=<hostname routable by clients>

# The port to publish to ZooKeeper for clients to use. If this is not set,

# it will publish the same port that the broker binds to.

#advertised.port=<port accessible by clients>

# The number of threads handling network requests

# 处理网络请求的线程数量

num.network.threads=3

# The number of threads doing disk I/O

# 用来处理磁盘IO的现成数量

num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server

# 发送套接字的缓冲区大小

socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server

# 接受套接字的缓冲区大小

socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)

# 请求套接字的缓冲区大小

socket.request.max.bytes=104857600

############################# Log Basics #############################

# A comma seperated list of directories under which to store log files

# kafka运行日志存放的路径

log.dirs=/usr/local/kafka/logs

# The default number of log partitions per topic. More partitions allow greater

# parallelism for consumption, but this will also result in more files across

# the brokers.

# topic在当前broker上的分片个数

num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.

# This value is recommended to be increased for installations with data dirs located in RAID array.

# 用来恢复和清理data下数据的线程数量

num.recovery.threads.per.data.dir=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync

# the OS cache lazily. The following configurations control the flush of data to disk.

# There are a few important trade-offs here:

# 1. Durability: Unflushed data may be lost if you are not using replication.

# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.

# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.

# The settings below allow one to configure the flush policy to flush data after a period of time or

# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk

# 当数据到达10000条数后就将数据写入到磁盘中

#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush

# 在我们强制刷新之前，一个消息可以在日志中记录的最大时间数

#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can

# be set to delete segments after a period of time, or after a given size has accumulated.

# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens

# from the end of the log.

# The minimum age of a log file to be eligible for deletion

# segment文件保留的最长时间，超时将被删除，默认的是7天24*7=168

log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining

# segments don't drop below log.retention.bytes.

#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.

# 日志文件中每个segment的大小，默认为1G

log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according

# to the retention policies

# 周期性检查文件大小的时间

log.retention.check.interval.ms=300000

# By default the log cleaner is disabled and the log retention policy will default to just delete segments after their retention expires.

# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.

log.cleaner.enable=false

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).

# This is a comma separated host:port pairs, each corresponding to a zk

# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".

# You can also append an optional chroot string to the urls to specify the

# root directory for all kafka znodes.

# broker需要使用zookeeper保存meta数据

zookeeper.connect=hadoop1:2181,hadoop2:2181,hadoop3:2181

# artion buffer中，消息的条数达到阈值，将触发flush到磁盘

log.flush.interval.ms=3000

# 除topic需要server.properties中设置delete.topic.enable=true否则只是标记删除

delete.topic.enable=true

# Timeout in ms for connecting to zookeeper

# zookeeper链接超时时间

zookeeper.connection.timeout.ms=6000

B）、配置 consumer.properties

[root@hadoop1 config]# cat consumer.properties

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements. See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# see kafka.consumer.ConsumerConfig for more details

# Zookeeper connection string

# comma separated host:port pairs, each corresponding to a zk

# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002"

zookeeper.connect=hadoop1:2181,hadoop2:2181,hadoop3:2181

# timeout in ms for connecting to zookeeper

# 当消费者挂掉，其他消费者要等该指定时间才能检查到并且触发重新负载均衡

zookeeper.connection.timeout.ms=6000

# 指定多久消费者更新offset到zookeeper中。注意offset更新时基于time而不是每次获得的消息。一旦在更新zookeeper发生异常并重启，将可能拿到已拿到过的消息，经过测试

zookeeper.sync.time.ms=2000

# 指定消费

group.id=consumerCourse

# 当consumer消费一定量的消息之后,将会自动向zookeeper提交offset信息，注意offset信息并不是每消费一次消息就向zk提交一次,而是现在本地保存(内存),并定期提交,默认为true，经过测算当数量到达500条就会更新offset值

auto.commit.enable=true

# 自动更新时间。默认60 * 1000

auto.commit.interval.ms=1000

# 当前consumer的标识,可以设定,也可以有系统生成,主要用来跟踪消息消费情况,便于观察，如果不设置则会自增

conusmer.id

# 消费者客户端编号，用于区分不同客producer.properties户端，默认客户端程序自动产生，最好与group.id的value值相同

client.id=consumerCourseproducer.properties

# 最大取多少块缓存到消费者(默认10)

queued.max.message.chunks=50

# 当有新的consumer加入到group时,将会reblance,此后将会有partitions的消费端迁移到新的consumer上,如果一个consumer获得了某个partition的消费权限,那么它将会向zk注册 "Partition Owner registry"节点信>息,但是有可能此时旧的consumer尚没有释放此节点, 此值用于控制,注册节点的重试次数.

rebalance.max.retries=5

# 获取消息的最大尺寸,broker不会像consumer输出大于此值的消息chunk 每次拉取将得到多条消息,此值为总大小,提升此值,将会消耗更多的consumer端内存

fetch.min.bytes=6553600

# 当消息的尺寸不足时,server阻塞的时间,如果超时,消息将立即发送给consumer

fetch.wait.max.ms=5000

socket.receive.buffer.bytes=655360

# 如果zookeeper没有offset值或offset值超出范围。那么就给个初始的offset。有smallest(最小的)、largest(最大的)、anything(随便)可选，分别表示给当前最小的offset、当前最大的offset、抛异常。默认largest

auto.offset.reset=largest

# 指定序列化处理类

derializer.class=kafka.serializer.DefaultDecoder

C）、配置 producer.properties

[root@hadoop3 config]# cat producer.properties

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements. See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# see kafka.producer.ProducerConfig for more details

############################# Producer Basics #############################

# list of brokers used for bootstrapping knowledge about the rest of the cluster

# format: host1:port1,host2:port2 ... --> 指定kafka节点列表，用于获取metadata，不必全部指定

metadata.broker.list=hadoop1:2181,hadoop2:2181,hadoop3:2181

# name of the partitioner class for partitioning events; default partition spreads data randomly --> 指定分区处理类。默认kafka.producer.DefaultPartitioner，表通过key哈希到对应分区

#partitioner.class=kafka.producer.DefaultPartitioner

# 设置发送数据是否需要服务端的反馈,有三个值0,1,-1，默认的是0

# 0: producer不会等待broker发送ack

# 1: 当leader接收到消息之后发送ack

# -1: 当所有的follower都同步消息成功后发送ack.

request.required.acks=0

# specifies whether the messages are sent asynchronously (async) or synchronously (sync)

# 同步还是异步发送消息，默认“sync”表同步，"async"表异步。异步可以提高发送吞吐量,

# 也意味着消息将会在本地buffer中,并适时批量发送，但是也可能导致丢失未发送过去的消息

producer.type=sync

# 是否压缩，默认0表示不压缩，1表示用gzip压缩，2表示用snappy压缩。压缩后消息中会有头来指明消息压缩类型，故在消费者端消息解压是透明的无需指定。如果是文本的模式可以压缩成1:10

compression.codec=none

# message encoder --> 序列化的类

serializer.class=kafka.serializer.DefaultEncoder

# allow topic level compression

#compressed.topics=

############################# Async Producer #############################

# maximum time, in milliseconds, for buffering data on the producer queue

# 在async模式下,当message被缓存的时间超过此值后,将会批量发送给broker,默认为

# 5000ms

# 此值和batch.num.messages协同工作.

queue.buffering.max.ms=5000

# the maximum size of the blocking queue for buffering on the producer

# 在async模式下,producer端允许buffer的最大消息量

# 无论如何,producer都无法尽快的将消息发送给broker,从而导致消息在producer端大量沉积

# 此时,如果消息的条数达到阀值,将会导致producer端阻塞或者消息被抛弃，默认为10000

queue.buffering.max.messages=20000

# Timeout for event enqueue:

# 0: events will be enqueued immediately or dropped if the queue is full

# -ve: enqueue will block indefinitely if the queue is full

# +ve: enqueue will block up to this many milliseconds if the queue is full

# 当消息在producer端沉积的条数达到"queue.buffering.max.meesages"后，阻塞一定时间后,队列仍然没有enqueue(producer仍然没有发送出任何消息)

# 此时producer可以继续阻塞或者将消息抛弃,此timeout值用于控制"阻塞"的时间（-1: 无阻塞超时限制,消息不会被抛弃，0:立即清空队列,消息被抛弃）

queue.enqueue.timeout.ms=-1

# the number of messages batched at the producer --> 如果是异步，指定每次批量发送数据量，默认为20

batch.num.messages=500

# 当producer接收到error ACK,或者没有接收到ACK时,允许消息重发的次数

# 因为broker并没有完整的机制来避免消息重复,所以当网络异常时(比如ACK丢失)

# 有可能导致broker接收到重复的消息,默认值为3.

message.send.max.retries=3

# producer刷新topic metada的时间间隔,producer需要知道partition leader的位置,以及当前topic的情况

# 因此producer需要一个机制来获取最新的metadata,当producer遇到特定错误时,将会立即刷新

# (比如topic失效,partition丢失,leader失效等),此外也可以通过此参数来配置额外的刷新机制，默认值600000

topic.metadata.refresh.interval.ms=60000

1-1）、producer.type 参数的说明

Producer.type 可以设置为同步还是异步发送消息，默认“sync”表同步，"async"表异步。

A）、异步发送消息

当满足以下其中一个条件的时候就触发发送

# the number of messages batched at the producer

# 如果是异步，指定每次批量发送数据量，默认为20

batch.num.messages=500

# 在async模式下,当message被缓存的时间超过此值后,将会批量发送给broker,默认为

# 5000ms

# 此值和batch.num.messages协同工作.

queue.buffering.max.ms=5000

B）、同步发送消息

# the maximum size of the blocking queue for buffering on the producer

# 在async模式下,producer端允许buffer的最大消息量

# 无论如何,producer都无法尽快的将消息发送给broker,从而导致消息在producer端大量沉积

# 此时,如果消息的条数达到阀值,将会导致producer端阻塞或者消息被抛弃，默认为10000

queue.buffering.max.messages=20000

# Timeout for event enqueue:

# 0: events will be enqueued immediately or dropped if the queue is full

# -ve: enqueue will block indefinitely if the queue is full

# +ve: enqueue will block up to this many milliseconds if the queue is full

# 当消息在producer端沉积的条数达到"queue.buffering.max.meesages"后，阻塞一定时间后,队列仍然没有enqueue(producer仍然没有发送出任何消息)

# 此时producer可以继续阻塞或者将消息抛弃,此timeout值用于控制"阻塞"的时间（-1: 无阻塞超时限制,消息不会被抛弃，0:立即清空队列,消息被抛弃）

queue.enqueue.timeout.ms=-1

C）、设置ACK 发送数据是否需要服务端的反馈

# 设置发送数据是否需要服务端的反馈,有三个值0,1,-1，默认的是0

# 0: producer不会等待broker发送ack

# 1: 当leader接收到消息之后发送ack

# -1: 当所有的follower都同步消息成功后发送ack.

request.required.acks=0

1-3）、配置路径

[root@hadoop1 kafka]# vi /etc/profile

加入以下配置：

export KAFKA_BOME=/usr/local/kafka

[root@hadoop1 kafka]# source /etc/profile

1-4）、启动 Kafka

A、先启动zookeeper

B、依次在各节点上启动kafka启动命令为：kafka-server-start.sh config/server.properties

A）、前台启动

[root@hadoop1 kafka]# kafka-server-start.sh config/server.properties

[root@hadoop2 kafka]# kafka-server-start.sh config/server.properties

[root@hadoop3 kafka]# kafka-server-start.sh config/server.properties

************

[2016-10-02 00:12:32,691] INFO Property log.cleaner.enable is overridden to false (kafka.utils.VerifiableProperties)

[2016-10-02 00:12:32,691] INFO Property log.dirs is overridden to /usr/local/kafka/logs (kafka.utils.VerifiableProperties)

***********

[2016-10-02 00:12:32,841] INFO Client environment:user.home=/root (org.apache.zookeeper.ZooKeeper)

[2016-10-02 00:12:32,842] INFO Client environment:user.dir=/usr/local/kafka (org.apache.zookeeper.ZooKeeper)

**********

[2016-10-02 00:12:34,104] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

[2016-10-02 00:12:34,138] INFO [Kafka Server 0], started (kafka.server.KafkaServer)

、、、、、、、、

B）、后台启动

[root@hadoop1 kafka]# kafka-server-start.sh config/server.properties > /dev/null 2>&1 &

[root@hadoop2 kafka]# kafka-server-start.sh config/server.properties > /dev/null 2>&1 &

[root@hadoop3 kafka]# kafka-server-start.sh config/server.properties > /dev/null 2>&1 &

C）、查看进程

[root@hadoop1 logs]# jps

3338 SecondaryNameNode

3101 QuorumPeerMain

7257 Jps

3625 NodeManager

3499 ResourceManager

3211 DataNode

3105 NameNode

7189 Kafka

D）、在Zookeeper上查看

多出来的文件夹为：admin/config/brokers

得出结论不同的topic可以共用一个一个消费组。

E）、启动脚本

[root@hadoop1 start_sh]# cat kafka_start.sh

cat /usr/local/start_sh/slave |while read line

{

echo $line

ssh $line "source /etc/profile;nohup kafka-server-start.sh /usr/local/kafka/config/server.properties > /dev/null 2>&1&"

wait

done

Kafka常用操作命令

1-1）、查看topic

[root@hadoop1 kafka]# kafka-topics.sh --list --zookeeper hadoop1:2181

kafka

test

1-2）、创建topic

[root@hadoop1 kafka]# kafka-topics.sh --create --zookeeper hadoop1:2181 --replication-factor 2 --partitions 3 --topic test1

Created topic "test1".

A）、参数说明

参数说明：--replication-factor 2 副本数，就是这个分片复制三分

--partitions 3 多少个分片数

B）、查看创建的数据

[root@hadoop1 software]# cd /tmp/kafka-logs/

[root@hadoop1 kafka-logs]# ll

total 20

-rw-r--r--. 1 root root 0 Oct 25 04:42 cleaner-offset-checkpoint

drwxr-xr-x. 2 root root 4096 Oct 25 05:06 test1-0

drwxr-xr-x. 2 root root 4096 Oct 25 05:06 test1-2

-rw-r--r--. 1 root root 54 Oct 25 04:42 meta.properties

-rw-r--r--. 1 root root 18 Oct 25 04:45 recovery-point-offset-checkpoint

-rw-r--r--. 1 root root 18 Oct 25 04:45 replication-offset-checkpoint

[root@hadoop2 ~]# cd /tmp/kafka-logs/

[root@hadoop2 kafka-logs]# ll

total 16

-rw-r--r--. 1 root root 0 Oct 25 04:32 cleaner-offset-checkpoint

drwxr-xr-x. 2 root root 4096 Oct 25 05:06 test1-0

drwxr-xr-x. 2 root root 4096 Oct 25 05:06 test1-1

-rw-r--r--. 1 root root 54 Oct 25 04:32 meta.properties

-rw-r--r--. 1 root root 11 Oct 25 04:45 recovery-point-offset-checkpoint

-rw-r--r--. 1 root root 11 Oct 25 04:46 replication-offset-checkpoint

[root@hadoop3 ~]# cd /tmp/kafka-logs/

[root@hadoop3 kafka-logs]# ll

total 16

-rw-r--r--. 1 root root 0 Oct 25 04:32 cleaner-offset-checkpoint

drwxr-xr-x. 2 root root 4096 Oct 25 05:06 test1-1

drwxr-xr-x. 2 root root 4096 Oct 25 05:06 test1-2

-rw-r--r--. 1 root root 54 Oct 25 04:32 meta.properties

-rw-r--r--. 1 root root 11 Oct 25 04:45 recovery-point-offset-checkpoint

-rw-r--r--. 1 root root 11 Oct 25 04:46 replication-offset-checkpoint

可以看出当一台机器挂掉之后，会有其他的副本引用。

C）、查看数据

[root@hadoop1 kafka-logs]# cd test1-1/

[root@hadoop1 test1-1]# ls

00000000000000000000.index 00000000000000000000.log

Index 索引文件，log是数据文件

1-3）、删除topic

[root@hadoop1 kafka]# kafka-topics.sh --delete --zookeeper hadoop1:2181 --topic h1

Topic h1 is marked for deletion.·

Note: This will have no impact if delete.topic.enable is not set to true.

需要server.properties中设置delete.topic.enable=true否则只是标记无法删除。

1-4）、通过shell命令发送消息

[root@hadoop1 kafka]# kafka-console-producer.sh --broker-list hadoop1:9092 --topic kafka1

[2016-10-02 00:33:27,087] WARN Property topic is not valid (kafka.utils.VerifiableProperties)

1-5）、通过shell消费消息

[root@hadoop1 ~]# kafka-console-consumer.sh --zookeeper hadoop1:2181 --from-beginning --topic kafka1

Aaaaaa

Bbbbbbbb

ccccccccccccccc

dddddddddddddddddddd

Eeeeeeeeeeeeeeeeeeeeee

1-6）、查看消费信息

A）、查看消费信息

[root@hadoop2 ~]# kafka-console-consumer.sh --zookeeper hadoop1:2181 --from-beginning --topic test1

aaaaa

Bbbbbbbbb

[root@hadoop3 ~]# kafka-console-consumer.sh --zookeeper hadoop1:2181 --topic test1

dddddddddddddddddddd

Eeeeeeeeeeeeeeeeeeeeee

[root@hadoop1 ~]# kafka-console-consumer.sh --zookeeper hadoop1:2181 --from-beginning --topic test1

bbbbbbbbb

aaaaa

dddddddddddddddddddd

cccccccccccccccc

Eeeeeeeeeeeeeeeeeeeeee

可以看出消费的数据是无序的，如果要保证数据的有序，必须是一个生产者，一个partition一个消费者，不过这样的模式一般不用。

B）、web界面查看

http://localhost:9092/#/group/console-consumer-1748

http://localhost:9092/#/activetopicsviz

1-7）、查看某个Topic的详情

[root@hadoop1 ~]# kafka-topics.sh --topic test1 --describe --zookeeper hadoop1:2181

Topic:test1 PartitionCount:3 ReplicationFactor:2 Configs:

Topic: test1 Partition: 0 Leader: 1 Replicas: 1,2 Isr: 1,2

Topic: test1 Partition: 1 Leader: 2 Replicas: 2,0 Isr: 0,2

Topic: test1 Partition: 2 Leader: 0 Replicas: 0,1 Isr: 0,1

1-8）、对分区数进行修改

[root@hadoop1 kafka]# kafka-topics.sh --zookeeper hadoop1 --alter --partitions 15 --topic kafka

WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected

Adding partitions succeeded!

1-9）、查看Topic消费的情况

[root@hadoop1 /]# kafka-topics.sh --describe --zookeeper hadoop1:2181 --topic kafka

Topic:kafka PartitionCount:15 ReplicationFactor:1 Configs:

Topic: kafka Partition: 0 Leader: 2 Replicas: 2 Isr: 2

Topic: kafka Partition: 1 Leader: 0 Replicas: 0 Isr: 0

Topic: kafka Partition: 2 Leader: 1 Replicas: 1 Isr: 1

Topic: kafka Partition: 3 Leader: 2 Replicas: 2 Isr: 2

Topic: kafka Partition: 4 Leader: 0 Replicas: 0 Isr: 0

Topic: kafka Partition: 5 Leader: 1 Replicas: 1 Isr: 1

Partition：是kafka的分区

Leader ：负责读写指定分区的节点

Replicas ：复制该分区log的节点列表

Isr ： "in-sync" replicas，当前活跃的副本列表（是一个子集），并且可能成为Leader。

在producer发送消息的过程中，broker list的某个节点断掉，不会影响消息的发送。

1-10）、kafka平衡leader

每当一个broker停止或崩溃，broker转移到其他的副本，会用于客户端的读和写，不会耽误kafka的正常执行。

手动执行命令

bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot

修改配置文件

修改server.properties 配置文件

auto.leader.rebalance.enable=true

KafkaOffsetMonitor监听工具

A）、下载软件

链接：http://pan.baidu.com/s/1eSwUJK2 密码：cthf 如果无法下载请联系作者。

B）、修改文件信息

kafkaMonitor.bat 内容中加入以下信息

Java -cp KafkaOffsetMonitor-assembly-0.2.0.jar com.quantifind.kafka.offsetapp.OffsetGetterWeb --zk hadoop1:2181,hadoop2:2181,hadoop3:2181 --port 9092 --refresh 10.seconds --retain 1.days

Linux 上配置：

[root@hadoop1 kafkaOffsetMonitor]# cat start.sh

java -cp /usr/local/kafkaOffsetMonitor/KafkaOffsetMonitor-assembly-0.2.0.jar com.quantifind.kafka.offsetapp.OffsetGetterWeb --zk hadoop1:2181,hadoop2:2181,hadoop3:2181 --port 9092 --refresh 10.seconds --retain 3.days &

C）、查看界面

http://localhost:9092/

topic：创建时topic名称

partition：分区编号

offset：表示该parition已经消费了多少条message

logSize：表示该partition已经写了多少条message

Lag：表示有多少条message没有被消费。

Owner：表示消费者

Created：该partition创建时间

Last Seen：消费状态刷新最新时间。

D）、查看KafkaOffsetMonitor保存到数据信息

可以看出在启动的目录下生成了一个offsetapp.db文件来保存offset的信息，可以用其他的工具打开查看。

问题总结

1-1）、kafka是什么?

Kafka是一种高吞吐量的分布式发布订阅消息系统，它可以处理消费者规模的网站中的所有动作流数据。

1-2）、为什么需要消息队列?

**************

1-3）、Kafka 生产的数据消费不了怎么办？

*********************

1-4）、kafka怎样保证不重复消费？

Kafka重的重复是不可避免的，软件在设计时就是至少一次的逻辑，kafka把消息保存一定的时间(7天)后会删除，可以设置log.cleanup.policy = delete使用定期删除机制，不过这个不是很好，也可以把kafka收集过来的数据放在Redice中的hashSet以及hashmap去重。

1-5）、Kafka数据丢失怎么办？

、丢失的原因

使用同步的模式下，有三种状态保证数据安全生产，如果producer.properties下的queue.enqueue.timeout.ms配置为-1时，正好Leadr Patition挂了，数据就丢失了。
在异步的情况下，当buffer满了，如果producer.properties下的queue.enqueue.timeout.ms配置为0，数据回本立即丢失。

2、对于broker时的磁盘的数据，磁盘坏了数据会丢失

3、对于内存中的数据没有flush（清洗）的，broker重复消费

B）、解决

producer到broker的处理

生产者使退数据的(push),把request.required(必须的).acks(确认)设为1，丢失会重发，丢的概率小。

broker到 Consumer

当offset被storm ack后，及成功处理后，才会被写入到zookeeper中，所以基本上是保证数据不丢失的，即使spout线程crash(崩溃)后，重启后还能保证可以从zk中读取的到对应的offset。

1-6）、Kafka为什么高吐量

1、数据磁盘持久化：消息不在内存中cache，直接写入到磁盘，充分利用磁盘的顺序读写性能

2、zero-copy：减少IO操作步骤

3、数据批量发送

4、数据压缩，压缩的方式为gzip

5、Topic划分为多个partition，提高parallelism

6、Kafka每秒可以生产约25万消息（50 MB），每秒处理55万消息（110 MB）。

快学Big Data -- Kafka 总结（二十一)

Kafka总结

JMS 的概念

kafka文件存储机制

1-1）、查看topic

1-2）、创建topic

1-3）、删除topic

1-4）、通过shell命令发送消息

1-5）、通过shell消费消息

1-6）、查看消费信息

1-7）、查看某个Topic的详情

1-8）、对分区数进行修改

1-1）、kafka是什么?

1-2）、为什么需要消息队列?

猜你喜欢