kafka Getting Started: Introduction, usage scenarios, design principles, main configuration and cluster building (reprint)

 
Troubleshooting guide:
What is the role 1.zookeeper in kafka is?
What are the reasons almost do not allow the message "random access" 2.kafka is?
3.kafka clusters consumer and producer states how the information is stored?
What is the root cause of the purpose of 4.partitions design is?

 

I. Getting Started
    1 Introduction
    Kafka is a distributed, partitioned, replicated commit logservice. It provides features like JMS, but completely different in design and implementation, in addition it is not a JMS implementation specification. When messages are stored kafka Topic classification according to, send messages become Producer, Consumer message recipients become, in addition kafka kafka instances a plurality of clusters, each instance (server) becomes broker. Whether kafka cluster, or the producer and consumer are dependent on the zookeeper to ensure the preservation of some of the meta information system availability cluster.
<ignore_js_op>  
 
   2、Topics/logs
    Topic a message may be considered as a group, each topic is divided into a plurality of partition (zone), in each partition level is the append log file storage. Any release message to this partition will be appended directly to the end of the log file, each message location in the file called offset (offset), an offset for the digital long, it is the only mark a message. Its only mark a message. kafka does not provide any additional indexing mechanism to store offset, because almost does not allow the message "random access" in kafka in.

 

<ignore_js_op>  

 

    kafka and JMS (Java Message Service) to achieve (activeMQ) is different: even if the message is consumed, the message is still not deleted immediately log file will be based on configuration requirements broker, delete after a certain retention time; such as log files. after the reservation two days, then two days later, the file will be cleared, regardless of whether the message which is consumed .kafka by this simple means to free up disk space, and reducing the consumption of news content of the document changes to disk IO spending.
 
    For the consumer, the consumer offset it needs to save the message, for the preservation and use of offset, there are consumer controlled; when the message consumer normal consumption, offset will be "linear" driving forward, the message will be consumed in sequential order Indeed consumer consumption information may be used in any order, it is only necessary to reset the offset to any value .. (zookeeper will be stored in the offset, see below)
 
    kafka cluster almost no maintenance of any consumer and producer status information, which have zookeeper save; therefore producer and consumer clients achieve a very lightweight, they can leave freely, without causing additional impact clusters.
 
    Designed to have multiple partitions. The most fundamental reason is kafka-based file storage by partitions, you can log content spread across multiple server, to avoid the file size reaches the upper limit of single disks, each partiton will be the current server (kafka examples) preservation; it may be any of a plurality of multi-topic segmentation partitions, to save the message efficiency / consumption of the more partitions addition means can hold more consumer, effectively enhance the ability of concurrent consumption (see below specific principles. ).
 
    3、Distribution
    Topic of the plurality of partitions, is distributed over a plurality of server cluster in kafka; each server (kafka instance) is responsible for reading and writing messages partitions in operation; in addition can also be configured kafka number (Replicas) partitions to be backed up, each partition will be backed up on multiple machines to improve usability.
 
    Based replicated program, then it means that the need for multiple backup scheduling; each partition has a server as a "leader"; leader responsible for all read and write operations, if the leader fails, then there will be another follower to take over (to become the new leader); follower just monotonous and leader follow up, sync message can be seen carrying all .. request pressure as leader of the server, so from the overall consideration of the cluster, the number of partitions you mean how many a "leader", kafka will "leader" balanced dispersed in each instance, to ensure a stable performance overall.
 
    Producers
    Producer publishes a message to the specified Topic while Producer can decide this message belongs to which partition; for example, based on "round-robin" manner or by some other algorithm.
 
    Consumers
    Essentially kafka each consumer only supports Topic belong to a consumer group;. Conversely, each group can have multiple consumer sends a message to the Topic, a consumer will only be subscribed for each consumer group in this Topic. .
 
    If all have the same consumer group, and the case queue pattern like; message will load balancing among consumers.
    If all the consumer has a different group, that this is the "publish - subscribe"; the message will be broadcast to all consumers.
    In kafka in a partition of the message it will only be a consumer of the consumer group; each group in consumer news consumption independent of each other; we can say that a group is a "subscription" who a Topic for each partions, only when a consumer is a consumer "subscribers" in, but a consumer can consume messages .kafka multiple partitions in a partition can only guarantee the consumer message is a consumer, the message is sequential. in fact, Topic perspective, the message is still not ordered.
 
    Design principles kafka decision for a topic, the same group can not have more than the number of partitions of the consumer at the same time consumption, otherwise it will mean that certain consumer will not get the message.
 
    Guarantees
    1) to the partitions in the order in which the received message is appended to the log
    2) For consumers, the same order they consumption and log messages in the message order.
    3) If the Topic "replicationfactor" is N, the N-1 is allowed kafka instance fails.
 
Second, the use of scenarios
 
    1、Messaging   
    For some conventional messaging system, kafka is a good choice; partitons / replication and fault tolerance, you can make kafka has good scalability and performance advantages, but so far, we should be very well aware, kafka does not provide JMS in. "transactional" "messaging guarantee (message acknowledgment.)" "message packet" and other enterprise-level features; kafka can only be used as a "normal" messaging system, to a certain extent, send and receive messages yet ensure absolutely reliable ( For example, message retransmission message transmission loss, etc.)
 
    2、Websit activity tracking
    kafka as the best tool "Web site activity tracking"; can send information page / user operation to kafka and real-time monitoring, statistical analysis or off-line and so on.

 

    3、Log Aggregation
    kafka determine the characteristics of it is very suitable as a "log collection center"; file application may transmit the operation log "bulk" "asynchronous" to kafka cluster, rather than stored in the local or the DB; kafka may submit bulk message / compressed messages, etc. , which the producer end, I almost feel spending performance. At this point the end consumer can make other systematic hadoop such as storage and analysis system.
 
Third, the design principle
 
    kafka is designed is to collect information as a unified platform to collect real-time feedback of information, and need to be able to support a large amount of data, and have a good fault tolerance.
 
    1, endurance
    kafka using file store messages, which directly determines the characteristics kafka itself heavily dependent on the file system performance and whether any OS, optimized file system itself is almost impossible. file cache / direct memory mapping is commonly used means. since the log file is kafka append operation, so the disk cost is smaller retrieval; and in order to reduce the number of disk writes, Broker will temporarily buffer up message, when the number (or size) of the message reaches a certain threshold when, then flush to disk, thus reducing the number of disk IO calls.

2, Performance
    Need to consider the impact on performance is something that many, in addition to the disk IO, we also need to consider the network IO, which is directly related to throughput problems kafka of .kafka does not offer much great skill; for the producer end, the message buffer can be up when the number of messages reaches a certain threshold, the batch sent to the broker;.. the same is true for the end consumer, a plurality of batch fetch message but the message of the size can be specified by the configuration file for kafka broker end, it seems to be a sendfile system call can potentially enhance the performance of network IO: mapping the data files into the system memory, socket directly read the corresponding memory area can be, without having to copy and exchange process again, in fact, for producer / consumer / broker and three. words, CPU expenses should not large, so enabling messaging is a good strategy compression; compressing consumes a small amount of CPU resources, but for kafka, network IO should consider anything that can be transmitted over the network. messages are compressed .kafka support gzip / snappy and other compression methods.
 
    3, producer
    Load balancing: producer will partition leader and all the Topic holding socket connection; message sent by the producer to the Broker socket directly, without going through any intermediate "routing layer" In fact, the message is routed to the Partition of which, there are producer client. end decision may be employed, such as "random" "key-hash" "polling" and the like, if there is a topic in the plurality of partitions, then the producer side to achieve "balanced message distribution" is necessary.
 
    Which partition leader position (host: port) registered in the zookeeper, producer as a zookeeper client, the watch has been registered to listen for change events of partition leader.
    Asynchronous Transmission: multiple messages for the time being in the client buffer them, and they batches sent to the broker, small data IO too, will slow down the overall network delay, delay sending batches in fact enhance network efficiency. But it also has some problems, such as when the producer fails, that message has not been sent will be lost.

 

    4, consumer
    consumer sends to the broker "fetch" request and inform the acquired offset message; consumer will get the message after a certain number of pieces; consumer side may be reset to re-offset the consumption message.
 
    In JMS implementation, Topic model push mode, i.e. broker push a message to the consumer end, but in kafka, using a pull mode, i.e. consumer after and broker connection is established, the initiative to pull (or FETCH) message; this some of the advantages of the model, first of all consumer end can be timely to fetch messages according to their spending power and handling, and can control messages consumption of progress (offset); in addition, consumers can better control the number of messages consumption, batch fetch.
 
    Other JMS implementation, the position of news consumption is prodiver retained in order to avoid sending the message or not successful consumer message retransmission, etc., but also control the status of the message. This requires JMS broker require too much extra work. In in kafka, partition in a message only in consumer spending, and there is no state control message, and no complicated message acknowledgment visible kafka Broker is fairly lightweight end. when the message is received after the consumer, the consumer can the last locally saved messages offset, and intermittent registered offset the zookeeper. Thus, consumer client is also very lightweight.
 
<ignore_js_op>  



    5, messaging mechanism
    For JMS implementations, messaging guarantee is very straightforward: There is one and only one (exactly once) in a slightly different kafka:
    1) at most once: 最多一次,这个和JMS中"非持久化"消息类似.发送一次,无论成败,将不会重发.
    2) at least once: 消息至少发送一次,如果消息未能接受成功,可能会重发,直到接收成功.
    3) exactly once: 消息只会发送一次.
    at most once: 消费者fetch消息,然后保存offset,然后处理消息;当client保存offset之后,但是在消息处理过程中出现了异常,导致部分消息未能继续处理.那么此后"未处理"的消息将不能被fetch到,这就是"at most once".
    at least once: 消费者fetch消息,然后处理消息,然后保存offset.如果消息处理成功之后,但是在保存offset阶段zookeeper异常导致保存操作未能执行成功,这就导致接下来再次fetch时可能获得上次已经处理过的消息,这就是"at least once",原因offset没有及时的提交给zookeeper,zookeeper恢复正常还是之前offset状态.
    exactly once: kafka中并没有严格的去实现(基于2阶段提交,事务),我们认为这种策略在kafka中是没有必要的.
    通常情况下"at-least-once"是我们搜选.(相比at most once而言,重复接收数据总比丢失数据要好).
 
    6、复制备份
    kafka将每个partition数据复制到多个server上,任何一个partition有一个leader和多个follower(可以没有);备份的个数可以通过broker配置文件来设定.leader处理所有的read-write请求,follower需要和leader保持同步.Follower和consumer一样,消费消息并保存在本地日志中;leader负责跟踪所有的follower状态,如果follower"落后"太多或者失效,leader将会把它从replicas同步列表中删除.当所有的follower都将一条消息保存成功,此消息才被认为是"committed",那么此时consumer才能消费它.即使只有一个replicas实例存活,仍然可以保证消息的正常发送和接收,只要zookeeper集群存活即可.(不同于其他分布式存储,比如hbase需要"多数派"存活才行)
    当leader失效时,需在followers中选取出新的leader,可能此时follower落后于leader,因此需要选择一个"up-to-date"的follower.选择follower时需要兼顾一个问题,就是新leaderserver上所已经承载的partition leader的个数,如果一个server上有过多的partition leader,意味着此server将承受着更多的IO压力.在选举新leader,需要考虑到"负载均衡".
 
    7.日志
    如果一个topic的名称为"my_topic",它有2个partitions,那么日志将会保存在my_topic_0和my_topic_1两个目录中;日志文件中保存了一序列"log entries"(日志条目),每个log entry格式为"4个字节的数字N表示消息的长度" + "N个字节的消息内容";每个日志都有一个offset来唯一的标记一条消息,offset的值为8个字节的数字,表示此消息在此partition中所处的起始位置..每个partition在物理存储层面,有多个log file组成(称为segment).segmentfile的命名为"最小offset".kafka.例如"00000000000.kafka";其中"最小offset"表示此segment中起始消息的offset.
<ignore_js_op>  
    其中每个partiton中所持有的segments列表信息会存储在zookeeper中.
    当segment文件尺寸达到一定阀值时(可以通过配置文件设定,默认1G),将会创建一个新的文件;当buffer中消息的条数达到阀值时将会触发日志信息flush到日志文件中,同时如果"距离最近一次flush的时间差"达到阀值时,也会触发flush到日志文件.如果broker失效,极有可能会丢失那些尚未flush到文件的消息.因为server意外实现,仍然会导致log文件格式的破坏(文件尾部),那么就要求当server启东是需要检测最后一个segment的文件结构是否合法并进行必要的修复.
    获取消息时,需要指定offset和最大chunk尺寸,offset用来表示消息的起始位置,chunk size用来表示最大获取消息的总长度(间接的表示消息的条数).根据offset,可以找到此消息所在segment文件,然后根据segment的最小offset取差值,得到它在file中的相对位置,直接读取输出即可.
    日志文件的删除策略非常简单:启动一个后台线程定期扫描log file列表,把保存时间超过阀值的文件直接删除(根据文件的创建时间).为了避免删除文件时仍然有read操作(consumer消费),采取copy-on-write方式.
 
    8、分配
    kafka使用zookeeper来存储一些meta信息,并使用了zookeeper watch机制来发现meta信息的变更并作出相应的动作(比如consumer失效,触发负载均衡等)
    1) Broker node registry: 当一个kafkabroker启动后,首先会向zookeeper注册自己的节点信息(临时znode),同时当broker和zookeeper断开连接时,此znode也会被删除.
    格式: /broker/ids/[0...N]   -->host:port;其中[0..N]表示broker id,每个broker的配置文件中都需要指定一个数字类型的id(全局不可重复),znode的值为此broker的host:port信息.
    2) Broker Topic Registry: 当一个broker启动时,会向zookeeper注册自己持有的topic和partitions信息,仍然是一个临时znode.
    格式: /broker/topics/[topic]/[0...N]  其中[0..N]表示partition索引号.
    3) Consumer and Consumer group: 每个consumer客户端被创建时,会向zookeeper注册自己的信息;此作用主要是为了"负载均衡".
    一个group中的多个consumer可以交错的消费一个topic的所有partitions;简而言之,保证此topic的所有partitions都能被此group所消费,且消费时为了性能考虑,让partition相对均衡的分散到每个consumer上.
    4) Consumer id Registry: 每个consumer都有一个唯一的ID(host:uuid,可以通过配置文件指定,也可以由系统生成),此id用来标记消费者信息.
    格式:/consumers/[group_id]/ids/[consumer_id]
    仍然是一个临时的znode,此节点的值为{"topic_name":#streams...},即表示此consumer目前所消费的topic + partitions列表.
    5) Consumer offset Tracking: 用来跟踪每个consumer目前所消费的partition中最大的offset.
    格式:/consumers/[group_id]/offsets/[topic]/[broker_id-partition_id]-->offset_value
    此znode为持久节点,可以看出offset跟group_id有关,以表明当group中一个消费者失效,其他consumer可以继续消费.
    6) Partition Owner registry: 用来标记partition被哪个consumer消费.临时znode
    格式:/consumers/[group_id]/owners/[topic]/[broker_id-partition_id]-->consumer_node_id当consumer启动时,所触发的操作:
    A) 首先进行"Consumer id Registry";
    B) 然后在"Consumer id Registry"节点下注册一个watch用来监听当前group中其他consumer的"leave"和"join";只要此znode path下节点列表变更,都会触发此group下consumer的负载均衡.(比如一个consumer失效,那么其他consumer接管partitions).
    C) 在"Broker id registry"节点下,注册一个watch用来监听broker的存活情况;如果broker列表变更,将会触发所有的groups下的consumer重新balance.
<ignore_js_op>  
    1) Producer端使用zookeeper用来"发现"broker列表,以及和Topic下每个partition leader建立socket连接并发送消息.
    2) Broker端使用zookeeper用来注册broker信息,已经监测partitionleader存活性.
    3) Consumer端使用zookeeper用来注册consumer信息,其中包括consumer消费的partition列表等,同时也用来发现broker列表,并和partition leader建立socket连接,并获取消息.
 
四、主要配置
 
    1、Broker配置

 

<ignore_js_op>  

 

    2.Consumer主要配置

 

<ignore_js_op>  

 

3.Producer主要配置

 

<ignore_js_op>  

 

 
以上是关于kafka一些基础说明,在其中我们知道如果要kafka正常运行,必须配置zookeeper,否则无论是kafka集群还是客户端的生存者和消费者都无法正常的工作的,以下是对zookeeper进行一些简单的介绍:

 

五、zookeeper集群
    zookeeper是一个为分布式应用提供一致性服务的软件,它是开源的Hadoop项目的一个子项目,并根据google发表的一篇论文来实现的。zookeeper为分布式系统提供了高笑且易于使用的协同服务,它可以为分布式应用提供相当多的服务,诸如统一命名服务,配置管理,状态同步和组服务等。zookeeper接口简单,我们不必过多地纠结在分布式系统编程难于处理的同步和一致性问题上,你可以使用zookeeper提供的现成(off-the-shelf)服务来实现来实现分布式系统额配置管理,组管理,Leader选举等功能。
    zookeeper集群的安装,准备三台服务器server1:192.168.0.1,server2:192.168.0.2,
    server3:192.168.0.3.
    1)下载zookeeper
    到 http://zookeeper.apache.org/releases.html去下载最新版本Zookeeper-3.4.5的安装包zookeeper-3.4.5.tar.gz.将文件保存server1的~目录下
    2)安装zookeeper
    先在服务器server分别执行a-c步骤
    a)解压  
    tar -zxvf zookeeper-3.4.5.tar.gz
    解压完成后在目录~下会发现多出一个目录zookeeper-3.4.5,重新命令为zookeeper
    b)配置
    将conf/zoo_sample.cfg拷贝一份命名为zoo.cfg,也放在conf目录下。然后按照如下值修改其中的配置:
   
    # The number of milliseconds of each tick
    tickTime=2000
    # The number of ticks that the initial
    # synchronization phase can take
    initLimit=10
    # The number of ticks that can pass between
    # sending a request and getting an acknowledgement
    syncLimit=5
    # the directory where the snapshot is stored.
    # do not use /tmp for storage, /tmp here is just
    # example sakes.
    dataDir=/home/wwb/zookeeper /data
    dataLogDir=/home/wwb/zookeeper/logs
    # the port at which the clients will connect
    clientPort=2181
    #
    # Be sure to read the maintenance section of the
    # administrator guide before turning on autopurge.
    #
    # The number of snapshots to retain in dataDir
    #autopurge.snapRetainCount=3
    # Purge task interval in hours
    # Set to "0" to disable auto purge feature
    #autopurge.purgeInterval=1
    server.1=192.168.0.1:3888:4888
    server.2=192.168.0.2:3888:4888
    server.3=192.168.0.3:3888:4888
    tickTime:这个时间是作为 Zookeeper 服务器之间或客户端与服务器之间维持心跳的时间间隔,也就是每个 tickTime 时间就会发送一个心跳。
    dataDir:顾名思义就是 Zookeeper 保存数据的目录,默认情况下,Zookeeper 将写数据的日志文件也保存在这个目录里。
    clientPort:这个端口就是客户端连接 Zookeeper 服务器的端口,Zookeeper 会监听这个端口,接受客户端的访问请求。
    initLimit:这个配置项是用来配置 Zookeeper 接受客户端(这里所说的客户端不是用户连接 Zookeeper 服务器的客户端,而是 Zookeeper 服务器集群中连接到 Leader 的 Follower 服务器)初始化连接时最长能忍受多少个心跳时间间隔数。当已经超过 5个心跳的时间(也就是 tickTime)长度后 Zookeeper 服务器还没有收到客户端的返回信息,那么表明这个客户端连接失败。总的时间长度就是 5*2000=10 秒
    syncLimit:这个配置项标识 Leader 与Follower 之间发送消息,请求和应答时间长度,最长不能超过多少个 tickTime 的时间长度,总的时间长度就是2*2000=4 秒
    server.A=B:C:D:其中 A 是一个数字,表示这个是第几号服务器;B 是这个服务器的 ip 地址;C 表示的是这个服务器与集群中的 Leader 服务器交换信息的端口;D 表示的是万一集群中的 Leader 服务器挂了,需要一个端口来重新进行选举,选出一个新的 Leader,而这个端口就是用来执行选举时服务器相互通信的端口。如果是伪集群的配置方式,由于 B 都是一样,所以不同的 Zookeeper 实例通信端口号不能一样,所以要给它们分配不同的端口号
注意:dataDir,dataLogDir中的wwb是当前登录用户名,data,logs目录开始是不存在,需要使用mkdir命令创建相应的目录。并且在该目录下创建文件myid,serve1,server2,server3该文件内容分别为1,2,3。
针对服务器server2,server3可以将server1复制到相应的目录,不过需要注意dataDir,dataLogDir目录,并且文件myid内容分别为2,3
    3)依次启动server1,server2,server3的zookeeper.
    /home/wwb/zookeeper/bin/zkServer.sh start,出现类似以下内容
    JMX enabled by default
    Using config: /home/wwb/zookeeper/bin/../conf/zoo.cfg
    Starting zookeeper ... STARTED
   4) 测试zookeeper是否正常工作,在server1上执行以下命令
    /home/wwb/zookeeper/bin/zkCli.sh -server192.168.0.2:2181,出现类似以下内容
    JLine support is enabled
    2013-11-27 19:59:40,560 - INFO      [main-SendThread(localhost.localdomain:2181):ClientCnxn$SendThread@736]- Session   establishmentcomplete on server localhost.localdomain/127.0.0.1:2181, sessionid =    0x1429cdb49220000, negotiatedtimeout = 30000
 
    WATCHER::
   
    WatchedEvent state:SyncConnected type:None path:null
    [zk: 127.0.0.1:2181(CONNECTED) 0] [root@localhostzookeeper2]#  
    即代表集群构建成功了,如果出现错误那应该是第三部时没有启动好集群,
运行,先利用
    ps aux | grep zookeeper查看是否有相应的进程的,没有话,说明集群启动出现问题,可以在每个服务器上使用
    ./home/wwb/zookeeper/bin/zkServer.sh stop。再依次使用./home/wwb/zookeeper/binzkServer.sh start,这时在执行4一般是没有问题,如果还是有问题,那么先stop再到bin的上级目录执行./bin/zkServer.shstart试试。
 
注意:zookeeper集群时,zookeeper要求半数以上的机器可用,zookeeper才能提供服务。
 
六、kafka集群
(利用上面server1,server2,server3,下面以server1为实例)
    1)下载kafka0.8( http://kafka.apache.org/downloads.html),保存到服务器/home/wwb目录下kafka-0.8.0-beta1-src.tgz(kafka_2.8.0-0.8.0-beta1.tgz)
    2)解压 tar -zxvf kafka-0.8.0-beta1-src.tgz,产生文件夹kafka-0.8.0-beta1-src更改为kafka01   
3)配置
    修改kafka01/config/server.properties,其中broker.id,log.dirs,zookeeper.connect必须根据实际情况进行修改,其他项根据需要自行斟酌。大致如下:
     broker.id=1  
     port=9091  
     num.network.threads=2  
     num.io.threads=2  
     socket.send.buffer.bytes=1048576  
    socket.receive.buffer.bytes=1048576  
     socket.request.max.bytes=104857600  
    log.dir=./logs  
    num.partitions=2  
    log.flush.interval.messages=10000  
    log.flush.interval.ms=1000  
    log.retention.hours=168  
    #log.retention.bytes=1073741824  
    log.segment.bytes=536870912  
    num.replica.fetchers=2  
    log.cleanup.interval.mins=10  
    zookeeper.connect=192.168.0.1:2181,192.168.0.2:2182,192.168.0.3:2183  
    zookeeper.connection.timeout.ms=1000000  
    kafka.metrics.polling.interval.secs=5  
    kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter  
    kafka.csv.metrics.dir=/tmp/kafka_metrics  
    kafka.csv.metrics.reporter.enabled=false
 
4)初始化因为kafka用scala语言编写,因此运行kafka需要首先准备scala相关环境。
    > cd kafka01  
    > ./sbt update  
    > ./sbt package  
    > ./sbt assembly-package-dependency
在第二个命令时可能需要一定时间,由于要下载更新一些依赖包。所以请大家 耐心点。
5) 启动kafka01
    >JMX_PORT=9997 bin/kafka-server-start.sh config/server.properties &  
a)kafka02操作步骤与kafka01雷同,不同的地方如下
    修改kafka02/config/server.properties
    broker.id=2
    port=9092
    ##其他配置和kafka-0保持一致
    启动kafka02
    JMX_PORT=9998 bin/kafka-server-start.shconfig/server.properties &  
b)kafka03操作步骤与kafka01雷同,不同的地方如下
    修改kafka03/config/server.properties
    broker.id=3
    port=9093
    ##其他配置和kafka-0保持一致
    启动kafka02
    JMX_PORT=9999 bin/kafka-server-start.shconfig/server.properties &
6)创建Topic(包含一个分区,三个副本)
    >bin/kafka-create-topic.sh--zookeeper 192.168.0.1:2181 --replica 3 --partition 1 --topicmy-replicated-topic
7)查看topic情况
    >bin/kafka-list-top.sh --zookeeper 192.168.0.1:2181
    topic: my-replicated-topic  partition: 0 leader: 1  replicas: 1,2,0  isr: 1,2,0
8)创建发送者
   >bin/kafka-console-producer.sh--broker-list 192.168.0.1:9091 --topic my-replicated-topic
    my test message1
    my test message2
    ^C
9)创建消费者
    >bin/kafka-console-consumer.sh --zookeeper127.0.0.1:2181 --from-beginning --topic my-replicated-topic
    ...
    my test message1
    my test message2
^C
10)杀掉server1上的broker
  >pkill -9 -f config/server.properties
11)查看topic
  >bin/kafka-list-top.sh --zookeeper192.168.0.1:2181
  topic: my-replicated-topic  partition: 0 leader: 1  replicas: 1,2,0  isr: 1,2,0
发现topic还正常的存在
11)创建消费者,看是否能查询到消息
    >bin/kafka-console-consumer.sh --zookeeper192.168.0.1:2181 --from-beginning --topic my-replicated-topic
    ...
    my test message 1
    my test message 2
    ^C
说明一切都是正常的。
 
OK,以上就是对Kafka个人的理解,不对之处请大家及时指出。
 
 
补充说明:
1、public Map<String, List<KafkaStream<byte[], byte[]>>> createMessageStreams(Map<String, Integer> topicCountMap),其中该方法的参数Map的key为topic名称,value为topic对应的分区数,譬如说如果在kafka中不存在相应的topic时,则会创建一个topic,分区数为value,如果存在的话,该处的value则不起什么作用

 

2、关于生产者向指定的分区发送数据,通过设置partitioner.class的属性来指定向那个分区发送数据,如果自己指定必须编写相应的程序,默认是kafka.producer.DefaultPartitioner,分区程序是基于散列的键。

 

3、在多个消费者读取同一个topic的数据,为了保证每个消费者读取数据的唯一性,必须将这些消费者group_id定义为同一个值,这样就构建了一个类似队列的数据结构,如果定义不同,则类似一种广播结构的。

 

4、在consumerapi中,参数设计到数字部分,类似Map<String,Integer>,
numStream,指的都是在topic不存在的时,会创建一个topic,并且分区个数为Integer,numStream,注意如果数字大于broker的配置中num.partitions属性,会以num.partitions为依据创建分区个数的。
 
5、producerapi,调用send时,如果不存在topic,也会创建topic,在该方法中没有提供分区个数的参数,在这里分区个数是由服务端broker的配置中num.partitions属性决定的
 
关于kafka说明可以参考: http://kafka.apache.org/documentation.html

 

Guess you like

Origin www.cnblogs.com/momoyan/p/11616402.html