kafka:
- Cluster mode, even if there is only one node, it is also a cluster
- A distributed message system based on zookeeper, a distributed streaming platform, is not just a message queue
- It has the characteristics of high throughput, high performance, real-time and high reliability
basic concepts:
- broker: an independent Kafka server that accepts messages from producers
- brkoer cluster: a cluster composed of several brokers
- Topic: Topic. A virtual concept, representing the type of message, a topic can have multiple partitions, and the partitions are stored on different brokers
- Partition: Partition. Actual message storage unit
- Producer: message producer
- Consumer: message consumer
Five major APIs: the first three commonly used
- Producers
- Consumers
- Stream Processors
- Connectors
- Admin
AdminClient API:
- AdmminClient: Create an AdminClient client object
- NewTopic: Create Topic
- CreateTopicsResult: the return result of creating a Topic
- ListTopicsResult: Query the Topic list
- ListTopicsOptions: Query Topic list and options
- DescribeTopicsResult: Query Topics
- DescribeConfigsResult: Query Topics configuration items
Prodecers API:
Sending mode:
Synchronous sending, asynchronous sending, asynchronous callback sending
KafkaProducer:
- MetricConfig
- Load the load balancer
- Initialize Serializer
- Initialize RecordAccumulator, similar to a counter
- Start newSender, daemon thread
Source notice:
- Producer is thread safe
- Producer does not receive one post
- Producer sends in batches, reduces IO operations (writes a large amount of data at one time), and appends log files
producer.send(record):
- Calculate the partition: which partition the message enters
- Calculate the batch: accumulator.append(), add records to the batch to be sent
- main content:
- Create batch
- Append records to batch
Producer sending principle analysis:
- Send directly: Kafka's producer will send the message directly to the host of the partition leader, generally no other intervention is involved
- Load balancing: If there is no customization, it will be defaulted. The data can be controlled on which partition is determined by the client. Send pseudo-randomly by default
- Asynchronous sending: It is a Future object (which can not be obtained), and it can be sent in batches to reduce single IO and increase throughput
Message delivery guarantee:
Depends on the joint implementation of Producer and Consumer, and mainly depends on the
producer's need to confirm the reception signal sent by the server after receiving the data. This configuration refers to how many such confirmation signals the producer needs. Represents the availability of data backup
1. At most once: received 0 to 1 guarantee (fastest)
2. At least once: received 1 to multiple times (second)
3. Exactly once: Yes and only once
Consumer client operations
Configure the configuration file, then set which topic or several to subscribe to, and pull in batches in a loop
When pulling, you can set the offset to increase automatically. This is the easiest way to use it, but it cannot be rolled back after the data processing fails
. You can manually update a batch through consumer.commitAsync()
Precautions:
- Messages from a single partition can only be consumed by a Consumer in the ConsumerGroup. That is: messages from a partition can only be given to one Consumer, but a Consumer can pull messages from multiple partitions
- Consumer consumes messages from the partition in order, and consumes from the beginning by default
- A single ConsumerGroup will consume messages in the partition
Optimal: A partition is consumed by a Consumer, and the resource performance
consumer.assign() is used to formulate the partition for subscription.
Multi-threaded situation:
It’s not thread-safe. You need to solve the
classic mode yourself: (novice suggestion)
Simply put, the thread class has its own consumer attribute, that is, every thread object has a consumer object, which is thread-safe
but every thread needs A consumer object, creation and destruction are more resource-intensive
distribution mode: (suitable for streaming data)
a consumer pulls the message, and then distributes the data to different threads to
quickly process the data, but the business cannot be rolled back because the thread feedback cannot be monitored.
offset:
Manually control the offset, when a program error occurs,
consumer.seek() can be consumed repeatedly once
- Start consumption from 0 for the first time (generally)
- For example, if you consume 100 items at a time, then the ofet is set to 101 and stored in Redis
- Get the latest offset position from redis before each pull
- Start spending from this position every time
Stream API:
basic concepts:
- Client library for processing and analyzing data stored in Kafka
- Stream can achieve efficient state operation through the state store
- Support primitive Processor and high-level abstract DSL
- Stream and stream processor: data stream, a node that processes data
- Stream processing topology: stream direction, flow chart
- Source processor and sink processor: the source of the data, the source, the end of the data, the export
Realize data source and output through input theme and output theme
//创建流
Properties pros = .....
StreamsBuilder sB =....
KafkaStreams streams = new KafkaStreams(sB.build(),props)
streams.start();
Connect API
Connect is a part of Kafka streaming computing and is
mainly used to establish streaming channels with other middleware to
support streaming and batch processing integration
Kafka cluster
Kafka naturally supports clusters
Rely on Zookeeper for coordination
Distinguish different nodes by brokerId
Kafka Duplicates:
Copy multiple copies of the log
Replica set can be set for each topic
The default number of replica sets can be set through configuration
Kafka core concepts
- Broker: Kafka deployment node
- Leader: used to process requests for message acceptance and consumption
- Follower: Mainly used to back up message data
Node failure
- Kafka and Zookeeper heartbeat is not maintained as a node failure
- Too many follower messages behind the leader are also regarded as node failures
- Kafka will remove the failed node
Fault handling method
- Basically no data loss due to node failure
- Semantic guarantee largely avoids data loss
- The messages will be balanced within the cluster to reduce the overheating of messages on some nodes, that is, do not put them on a basket
Leader election
- No majority vote is used to elect the leader
- Will dynamically maintain a copy of a set of Leader data (ISR)
- Choose a faster one in ISR as Leader
There is a helpless situation in Kafka. All the copies in the ISR crash. In this case, an unclean leader (dirty election) will be carried out .
1. Wait until one of them returns to normal
2. Use on nodes other than ISR to ensure fast recovery
Leader election configuration recommendations:
-
Disable "unclean leader" dirty election
- Manually specify the minimum ISR