kafka entry - I. Introduction

Kafka background

kafka is a distributed stream processing platform, is an open source, lightweight, distributed, and may have to copy partition backup, based distributed workflow platform ZooKeeper coordination and management of the messaging system. The following three key features

  • Allow publication and subscription streaming data.
  • Provide fault tolerance when the stream data is stored.
  • When the flow reaches the data can be processed in a timely manner.

The basic structure of Kafka

  • Component (news producer, Producer) produced the message: the producer is responsible for the production of news, Kafka writes messages to cluster
  • Component consumption message (consumers, Consumer): consumer pull message from Kafka cluster.
  • Kafka cluster: management scheduling, load balance a message

Kafka basic concepts

  1. Theme
    Kafka a set of abstract messages grouped into a theme (Topic), that is to say, a theme is a classification of the message. Producers send messages to a particular topic, consumers subscribe to a topic or theme of some of the partitions for consumption
  2. Message
    message is the basic unit of communication Kafka, composed of a header and variable length message body of a fixed length.
    In the old version, each message called Message; Java client by the re-implementation, each message called Record.
  3. And a copy partition
    Kafka grouped into a set of a message topic, and each subject is further divided into one or more partitions (Partition). Each partition by a series of sequential, immutable message composition, is an ordered queue.
    Each partition corresponds to a physical folder, partition naming rules, after a subject name "-" connector, it
    then take the section number, the section number starting from 0, the maximum number of the total number of partitions minus one. There is one for each partition to multiple
    copies (Replica), copy partitions located on different clusters of the agent in order to improve usability. From the storage perspective,
    each copy of a logical partition in the abstract a log (the Log) objects, i.e. partition copy of the object is one to log
    in. The number of partitions corresponding to each topic can be loaded at boot time Kafka profile configuration, you can also create a theme at
    a specified time. Of course, the client can also modify the number of partitions theme after theme creation.
    Zoning makes it easier Kafka on concurrent processing, in theory, the higher the greater the number of partitions throughput, but it
    should be based on actual cluster environment and business scene. Meanwhile, the partition is also Kafka ensure messages are consumed as well as the order of the message
    load balancing foundation.
    Kafka can only keep things in order within a partition of a message does not guarantee orderly message across partitions. Each message
    is appended to the corresponding partition is sequential write disk, so efficiency is very high, which is an important guarantee for high-throughput Kafka
    card. At the same time the traditional message system is different, Kafka does not delete the message has been consumed immediately, because the disk limitations
    news would not have been stored (in fact this is not necessary), so Kafka provides two deleting old data strategy,
    First, the length of time based on the stored message, the second is based on the size of the partition. Both strategies can be equipped by the profile
    set
  4. Leader Follower copy and copy
    to ensure data consistency between multiple copies of a partition, Kafka selects as a copy of the partition Leader copy, a copy of the partition as the remaining copy Follower. Leader is responsible for handling client copy process read / write requests, Follower synchronize data from a copy of a copy of the Leader. A copy of the Leader of the role Folloer not fixed, if Leader failure, by the corresponding algorithms election to elect a new Leader Follower copy from the other copy.
  5. Offset
    any message published to the partition will be appended directly to the end of the log file, and the location of each message in the log file will correspond to a sequential incremental offset. The offset is a partition under stringent ordered logical value, the message does not indicate that a position in the room on the disk. Since kafka hardly allows random reads and writes the message, and therefore does not provide additional Kafka indexing mechanism to the offset stored. Consumers may be performed by controlling the consumption of message offset message, such as the consumer can specify the starting offset message. In order to ensure that the order message is consumed, the consumer has consumed the offset amount corresponding to the message needs to be saved. It should be noted that consumers offset the operation of the message and the message itself does not affect the offset. Older consumers will save consumption offset ZooKeeper to them, while the new version is to save consumers consumption to offset a theme among internal Kafka. Of course, consumers can also save their own consumption offset in the external system, without having to save to Kafka in.
  6. Log segment
    a log and the log is divided into a plurality of segments (LogSegment), the log segment is a minimum unit slice Kafka log object. Like the log objects, the log segment is a logical concept, a log segment corresponding to a particular log file on the disk and two index files. Log files are ".log" file name extension for the data files, used to store the actual data messages. Two index files respectively ".index" and ".timeindex" as the file name suffix, respectively, and a message offset index file index file message timestamp.
  7. Acting
    Kafka cluster is composed of one or more of Kafka instances, we will become a proxy for each instance of a Kafka (Broker), also commonly known as a proxy server Kafka (KafkaServer). Kafka cluster in a production environment typically includes one or more servers, we can configure one or more agents on a single server. Each agency has a unique identification id, and this is a non-negative integer. Kafka in a cluster, each additional proxy you need to configure a proxy, you can choose the id value different from other agents in the cluster id is any non-negative integer can be, as long as it is unique, this id in the whole Kafka cluster the lifeline is the agent, which is a value corresponding broker.id disposed at startup agent. Due to the different brokerId each agent is assigned, so that the agent migration becomes more convenient, so the consumer is transparent and does not affect consumer spending on message.
  8. Producers
    Producers (Producer) is responsible for sending a message to the agent, the agent is sending a message to customers Kafka's end.
  9. Consumers and consumer groups
    of consumers (Comsumer) to pull (pull) way to pull data, it is the client consumption. In Kafka in every consumer belongs to a specific consumer group (ConsumerGroup), we can specify a consumer group for each consumer to groupId on behalf of the consumer group name, by group.id configuration settings. If you do not specify the consumer groups, the consumer part of the default consumer group test-consumer-group. At the same time, each consumer also has a globally unique id, specified by the configuration item client.id, if the client does not specify the id of consumers, Kafka will automatically generate a globally unique id for the consumer, in the form g r O in p I d {groupId}- {hostName}- t i m e s t a m p {timestamp}- {} former UUID 8 characters. A message with a theme can only be a consumer group with the next one consumer spending, but consumers of different consumer groups can simultaneously consume the message. Consumer groups are Kafka used to achieve a topic message means broadcast and unicast, broadcast messages to achieve consumers only need to specify each belong to different consumer groups, simply leave a message Unicast Consumer belong to the same consumer group .
  10. The ISR
    Kafka in ZooKeeper maintains a dynamic ISR (In-sync Replica), i.e., save a copy of the list of synchronization, the
    stored list is kept of all proxy node id synchronized copy of the message corresponding to the Leader copy. If a Follower
    copy of downtime or lag too much behind the ISR list from the Follower replica nodes removed.
  11. ZooKeeper
    Kafka used to live use ZooKeeper save the corresponding metadata information, Kafka metadata information includes information such as proxy nodes, Kafka
    cluster information, consumer information and its legacy consumption offset information, subject information, partition status information, partition copy of the assignment square
    case information, dynamic configuration information. Kafka would be created in which start-up or during operation on ZooKeeper corresponding node to protect
    stored metadata information, Kafka register the corresponding listener in these nodes by listening mechanism to monitor changes in node metadata from
    the responsibility of the ZooKeeper management and maintenance Kafka cluster and, we can easily set by ZooKeeper of Kafka
    group horizontal scaling and data migration.
    Kafka cluster structure diagram
    Here Insert Picture Description

Guess you like

Origin blog.csdn.net/licheng989/article/details/90232958