Kafka rookie study notes

kafka:

  • Cluster mode, even if there is only one node, it is also a cluster
  • A distributed message system based on zookeeper, a distributed streaming platform, is not just a message queue
  • It has the characteristics of high throughput, high performance, real-time and high reliability

 

basic concepts:

  •     broker: an independent Kafka server that accepts messages from producers
  •     brkoer cluster: a cluster composed of several brokers
  •     Topic: Topic. A virtual concept, representing the type of message, a topic can have multiple partitions, and the partitions are stored on different brokers
  •     Partition: Partition. Actual message storage unit
  •     Producer: message producer
  •     Consumer: message consumer

Five major APIs: the first three commonly used

  1.   Producers
  2.   Consumers
  3.   Stream Processors
  4.   Connectors
  5.   Admin

AdminClient API:

  •     AdmminClient: Create an AdminClient client object
  •     NewTopic: Create Topic
  •     CreateTopicsResult: the return result of creating a Topic
  •     ListTopicsResult: Query the Topic list
  •     ListTopicsOptions: Query Topic list and options
  •     DescribeTopicsResult: Query Topics
  •     DescribeConfigsResult: Query Topics configuration items

Prodecers API:

Sending mode:

Synchronous sending, asynchronous sending, asynchronous callback sending

KafkaProducer:

  1. MetricConfig
  2. Load the load balancer
  3. Initialize Serializer
  4. Initialize RecordAccumulator, similar to a counter
  5. Start newSender, daemon thread

Source notice:

  1. Producer is thread safe
  2. Producer does not receive one post
  3. Producer sends in batches, reduces IO operations (writes a large amount of data at one time), and appends log files

producer.send(record):

  1. Calculate the partition: which partition the message enters
  2. Calculate the batch: accumulator.append(), add records to the batch to be sent
  3. main content:
    1. Create batch
    2. Append records to batch

Producer sending principle analysis:

  1. Send directly: Kafka's producer will send the message directly to the host of the partition leader, generally no other intervention is involved
  2. Load balancing: If there is no customization, it will be defaulted. The data can be controlled on which partition is determined by the client. Send pseudo-randomly by default
  3. Asynchronous sending: It is a Future object (which can not be obtained), and it can be sent in batches to reduce single IO and increase throughput

Message delivery guarantee:

Depends on the joint implementation of Producer and Consumer, and mainly depends on the
        producer's need to confirm the reception signal sent by the server after receiving the data. This configuration refers to how many such confirmation signals the producer needs. Represents the availability of data backup
        1. At most once: received 0 to 1 guarantee (fastest)
        2. At least once: received 1 to multiple times (second)
        3. Exactly once: Yes and only once


Consumer client operations

Configure the configuration file, then set which topic or several to subscribe to, and pull in batches in a loop

When pulling, you can set the offset to increase automatically. This is the easiest way to use it, but it cannot be rolled back after the data processing fails
        . You can manually update a batch through consumer.commitAsync()

Precautions:

  1. Messages from a single partition can only be consumed by a Consumer in the ConsumerGroup. That is: messages from a partition can only be given to one Consumer, but a Consumer can pull messages from multiple partitions
  2. Consumer consumes messages from the partition in order, and consumes from the beginning by default
  3. A single ConsumerGroup will consume messages in the partition

Optimal: A partition is consumed by a Consumer, and the resource performance
        consumer.assign() is used to formulate the partition for subscription.

Multi-threaded situation:


        It’s not thread-safe. You need to solve the
    
        classic mode yourself: (novice suggestion)
            Simply put, the thread class has its own consumer attribute, that is, every thread object has a consumer object, which is thread-safe
            but every thread needs A consumer object, creation and destruction are more resource-intensive
            
        distribution mode: (suitable for streaming data)
            a consumer pulls the message, and then distributes the data to different threads to
            quickly process the data, but the business cannot be rolled back because the thread feedback cannot be monitored.

offset:

Manually control the offset, when a program error occurs,
        consumer.seek() can be consumed repeatedly once

  1. Start consumption from 0 for the first time (generally)
  2. For example, if you consume 100 items at a time, then the ofet is set to 101 and stored in Redis
  3. Get the latest offset position from redis before each pull
  4. Start spending from this position every time

Stream API:

basic concepts:

  1. Client library for processing and analyzing data stored in Kafka
  2. Stream can achieve efficient state operation through the state store
  3. Support primitive Processor and high-level abstract DSL
  •         Stream and stream processor: data stream, a node that processes data
  •         Stream processing topology: stream direction, flow chart
  •         Source processor and sink processor: the source of the data, the source, the end of the data, the export

Realize data source and output through input theme and output theme

//创建流
Properties pros = .....
StreamsBuilder sB =....
KafkaStreams streams =  new KafkaStreams(sB.build(),props)
streams.start();

Connect API

Connect is a part of Kafka streaming computing and is
    mainly used to establish streaming channels with other middleware to
    support streaming and batch processing integration
    

 

 



Kafka cluster

Kafka naturally supports clusters

Rely on Zookeeper for coordination

Distinguish different nodes by brokerId

 

Kafka Duplicates:

Copy multiple copies of the log

Replica set can be set for each topic

The default number of replica sets can be set through configuration

 

Kafka core concepts

  • Broker: Kafka deployment node
  • Leader: used to process requests for message acceptance and consumption
  • Follower: Mainly used to back up message data

 

Node failure

  • Kafka and Zookeeper heartbeat is not maintained as a node failure
  • Too many follower messages behind the leader are also regarded as node failures
  • Kafka will remove the failed node

Fault handling method

  • Basically no data loss due to node failure
  • Semantic guarantee largely avoids data loss
  • The messages will be balanced within the cluster to reduce the overheating of messages on some nodes, that is, do not put them on a basket

 

Leader election

  • No majority vote is used to elect the leader
  • Will dynamically maintain a copy of a set of Leader data (ISR)
  • Choose a faster one in ISR as Leader

There is a helpless situation in Kafka. All the copies in the ISR crash. In this case, an unclean leader (dirty election) will be carried out .

1. Wait until one of them returns to normal

2. Use on nodes other than ISR to ensure fast recovery

Leader election configuration recommendations:

  • Disable "unclean leader" dirty election

  • Manually specify the minimum ISR

 



Kafka cluster monitoring

Guess you like

Origin blog.csdn.net/qq_20176001/article/details/108318050