kafka concept

 1. Manufacturer:

Message sent by the producer to the broker, there are three ways to confirm (request.required.acks)
ACKs = 0: Producer does not wait broker (leader) Send ack. Because the network sends a message out or broker crash (1.Partition not commit the Leader and Follower news 2.Leader data is not synchronized), both may be missing or may be re-issued.
acks = 1: When transmitting ack leader message is received, retransmits lost, the probability of lost small.
acks = -1: Send ack When all messages are synchronized follower message loss probability of success is relatively low.

Kafka There are three main ways to send a message:

 1. 2. and-forget asynchronous transmit synchronous transmission 3 + callback 

 

following test messages are transmitted in a manner 1w single node by three methods: 

Method 1: and-forget (do not care whether the message reaches the normal, to return results without any judgment process) 

in essence is an asynchronous manner as to send and forget, but it does not get the message sent by return result, the throughput of this approach is the highest, but can not guarantee the reliability of the message: 

copy the code
  . 1 Import the pickle
  2 Import Time
  . 3 from Kafka Import KafkaProducer
  . 4 
 . 5 Producer = KafkaProducer (bootstrap_servers = [ ' 192.168.33.11:9092 ' ],
  . 6 = key_serializer the lambda K: the pickle.dumps (K) ,
  . 7 = value_serializer the lambda V: the pickle.dumps (V))
 . 8 
 . 9 START_TIME = the time.time ()
 10 for I in Range (0, 10000 ):
 . 11      Print ( ' ------ --------- {} ' .format (I))
 12 is Future producer.send = ( ' test_topic ' , Key = ' NUM ' , value = I, = Partition 0)
 13 is 
14 # to buffer all messages to the push broker among 
15 producer.flush ()
 16 producer.close ()
 . 17 
18 is = END_TIME the time.time ()
 . 19 = END_TIME time_counts - START_TIME
 20 isPrint (time_counts) 
copying the code 
 Test results: . 1 .88s 

 

way: Synchronous transmission (response waiting Kafka by the get method, it is determined whether the message is sent successfully) 

when sending a message in a synchronous manner, a transmission of a return message to each the result of the judgment can be clearly aware of the situation of each message sent, but because of a synchronized manner will block future messages returned only when the object via get, will continue to send the next message: 

 

copy the code
  1 Import pickle
  2 Import time
  . 3 from Kafka Import KafkaProducer
  . 4 from kafka.errors Import kafka_errors
  . 5 
 . 6 Producer = KafkaProducer (
  . 7 bootstrap_servers = [ ' 192.168.33.11:9092 ' ],
  . 8 = key_serializer the lambdaK: the pickle.dumps (K),
  . 9 value_serializer = the lambda V: the pickle.dumps (V)
 10 )
 . 11 
12 is START_TIME = the time.time ()
 13 is for I in Range (0, 10000 ):
 14      Print ( ' --- {} --------- --- ' .format (I))
 15 producer.send Future = (Topic = " test_topic " , Key = " NUM " , value = I)
 16      # synchronous blocking, by call the get () method in turn guarantee a certain program is ordered. 
. 17      the try :
 18 is record_metadata = Future.get (timeout = 10)
 . 19          # Print (record_metadata.topic) 
20 is          # Print (record_metadata.partition) 
21 is          # Print (record_metadata.offset) 
22 is      the except kafka_errors AS E:
 23 is          Print (STR (E))
 24 
25 END_TIME = the time.time ()
 26 is time_counts = END_TIME - START_TIME
 27 Print (time_counts) 
copying the code 
 

test results: 16S 

 

three ways: asynchronous transmit + callback function (message transmitted in an asynchronous manner, the return message callback function to send the success / failure) 

while calling the send method of transmitting a message, specify a callback function when the server returns a response will call the callback function, can be processed through the callback function abnormality, when you call a callback function, only the callback function is finished before the end of the producer, would otherwise have been blocked: 

copy Code
 1 import pickle
 2 import time
 3 from kafka import KafkaProducer
 4 
 5 producer = KafkaProducer(
 6     bootstrap_servers=['192.168.33.11:9092'],
 7     key_serializer=lambda k: pickle.dumps(k),
 8     value_serializer=lambda v: pickle.dumps(v)
 9 )
10 
11 
12 def on_send_success(*args, **kwargs):
13     """
14     发送成功的回调函数
15     :param args:
16     :param kwargs:
17     :return:
18     """
19     return args
20 
21 
22 def on_send_error(*args, **kwargs):
23     """
24     发送失败的回调函数
25     :param args:
26     :param kwargs:
27     :return:
28     """
29 
30     return args
31 
32 
33 start_time = time.time()
34 for i in range(0, 10000):
35     print('------{}---------'.format (I))
 36     # If successful, pass into record_metadata, if it fails, pass into the Exception. 
37 [      producer.send (
 38 is Topic = " test_topic " , Key = " NUM " , value = I
 39      ) .add_callback (on_send_success). add_errback (on_send_error)
 40 
41 is producer.flush ()
 42 is producer.close ()
 43 is 
44 is END_TIME = the time.time ()
 45 = END_TIME time_counts - START_TIME
 46 is Print (time_counts) 
copying the code 
test results: 2 .15s 

 

three ways although some time difference, but time does not mean that the sooner the better, depends on the specific business application scenarios:

Scenario 1: If the service requires the message must be transmitted in order, you can use a synchronous manner, and only in a partation, binding parameter values so that retries the transmission retry fails, provided max_in_flight_requests_per_connection = 1 , production may be controlled can only be sent before noon should receive a message server to control the order of the messages sent; 

scenario 2: If you only care about business news throughput, allow a small amount failed to send message, send the order is not concerned about the message, you can use and-forget manner, and fitting parameters ACKs = 0, so that the producer does not need to wait for the response from the server to the network can support a maximum transmission speed of the message; 

scene 3: If the service needs to know whether the message is sent successfully, and the order of the message I do not care, it can be asynchronous callback + manner to send a message, with the parameter retries = 0, and records the transmission failure message to the log file;

2. Producers configuration

Producers configuration

So far, we only introduce a few necessary configuration parameters --bootstrap.servers API and serializer producers.

Producers there are many configurable parameters are explained in the documentation Kafka, most of them have reasonable defaults, so there is no need to modify them. However, there are several parameters in memory usage, performance and reliability for producers greatest impact, then we will them out.

1. acks

acks parameter specifies how many copies of the partition must have received the message, the message is written producers would think to be successful.

This parameter has a major impact on the possibility of message loss. This parameter has the following options.

• If acks = 0, producers quiet before successfully written message will not wait for any response from the server. That is, if there is a problem which led the server does not receive the message, then the producers would not know, the message will be lost. However, as producers need not wait for response from the server, it may send a message to the network can support a maximum speed, so as to achieve a high throughput.

• If acks = 1, as long as the leader nodes of the cluster receive a message, the producer will receive a successful response from the server. If the message is not reaching the hit leader node (such as anger collapse leader node, the new leader has not been elected), the producer will receive an error response, in order to avoid data loss, producers will resend the message. However, if a node does not receive the message become the new leader, the message is lost. This time depends on the throughput of synchronous or asynchronous send send. If you let the client waits for a response is sent (by calling the Future object get () method) server, will obviously increase the delay (round-trip transmission delay on the network). If the client uses an asynchronous callback, delay problems can be eased, but still throughput (for example, producer how many messages can be sent before the server response is received) will be affected by restrictions on the number of messages sent.

Tuning 1: using an asynchronous transmission

• If acks = all, only when all nodes participating in the replication of all messages received, the producer will receive a successful response from the server. This mode is the most secure, it can ensure that more than one server receives a message, even if there is a server crashes, the entire cluster can still run (Chapter 5 will discuss more details). However, its delay = 1 higher than acks, because we have to wait for more than one server node receives the message.

 

2. buffer.memory

This parameter is used to set the size of the memory buffer of the producer, the producer with that message buffer to be sent to the server. If the speed of application sends a message sent to the server exceeds the speed will lead to insufficient space producers. This time, send () method call is either blocked or throw an exception, depending on how block.on.buffe.full parameters (max.block.ms has been replaced in version 0.9.0.0, expressed thrown before anomaly can be blocked for some time).

Tuning 2: transfer large buffer.memory size, reduce the number of disk write, each batch is large

3. compression.type

By default, when the message is not compressed. This parameter can be set snappy, gzip or lz4, which specifies the compression algorithm before being sent to the message broker is compressed. snappy compression algorithm invented by cowardly Google had, it uses less CPU, but it can provide better performance and considerable compression ratio, if more concerned about performance and network bandwidth, you can use this algorithm. gzip compression algorithms usually take up more CPU, but provides a higher compression ratio, so if the network bandwidth is limited, you can use this algorithm. Use Compression can reduce network overhead transmission and storage overhead, which is often the bottleneck of sending messages to Kafka located.

Tuning 3: Compression

4. retries

Producers from the wrong server may have received a temporary error (such as a partition can not find the leader). In this case, the value of the parameter determines the number of retries producers can resend the message, and if it reaches this number, the producers will give up retry and returns an error. By default, the producer waits 1OOms between each retry, but this time period can be changed by retries.backoff.ms parameter. Advice before re-set the number of retries and the retry interval, first test restore a node crash How much time (for example, all partitions elected leader how long), so that the total retry time than Kafka cluster recover from a crash of a long time, otherwise the producer will prematurely give up and try again. However, some error is not a temporary error, do not be scared to address (such as "interest rate too quiet" error) by retrying. In general, because the producer will automatically retry, so there is no need to handle errors that can be retried in the logic of the code. You only need to handle the situation or the number of errors that can not retry retry exceeds the upper limit.

5. batch.size

When there are a plurality of messages needs to be sent to the same partition, producers will put them in a batch in. This parameter specifies a batch size of memory that can be used is calculated (rather than message number) number of bytes. When the batch is filled, all the messages in the batch will be sent out. But producers will not necessarily wait until the well is filled before sending batch, semi-batch capture, and even batch contains only one message is also likely to be sent. So even if the batch size is set very large, it will not cause delay, but will take up more memory only. However, if set too small, because producers need to send messages more frequently, it will add some additional overhead.

6. linger.ms

This parameter specifies the producer to wait before sending additional message batches batch addition time. KafkaProducer fills or linger.ms upper limit is reached the batch sent out in batches. By default, as long as there are threads available, the producer will send messages out, even if there is only one batch of messages. The linger.ms set to be larger than the number 0, so that producers wait a while before sending batches, so that more information is added to the batch. Although this will increase the delay, but will also enhance the throughput (due to one-off sending more messages, each message overhead becomes small).

7. client.id

The parameter may be an arbitrary character string, the server will use to identify the source of the message, and may also be used in the log in the quota.

8. max.in.flight.requests.per.connection

This parameter specifies the producer how many messages can be sent before receiving server should noon. The higher its value, will take up more memory, but will also enhance throughput. 1 can set it to ensure that messages are transmitted in the order written to the server, even if the retry occurs.

9. timeout.ms, request.timeout.ms and metadata.fetch.timeout.ms

request.timeout.ms specified time producers wait for the server when sending data to return a response, metadata.fetch.timeout.ms specify when producers wait for the server to obtain metadata (such as who is the chief target partition) return response time. If a timeout waiting for a response, then the producer either retry sending data, or returns an error (an exception is thrown or callback). timeout.ms designated broker to wait for a synchronized copy of the return message confirming the time, and asks if the configuration matches the eleven synchronized copy of the acknowledgment is not received within the specified time, then the broker will return an error.

10. max.block.ms

This parameter specifies the blocking time () method or using parttitionFor () in order to obtain metadata in the call to send the producer. When the transmit buffer has been caught producers, or no metadata available, these will block flexor side. When the blocking time to reach max.block.ms, producers will throw timeout exception.

11 . max.request.size

This parameter controls the size of the request sent by the producer. It may refer to a maximum for a single message can be transmitted, it can also refer to the total size of all the messages in a single request. For example, assuming that the value of 1MB, the largest single message may then be transmitted to 1MB, or producers can send a single request in a batch, the batch comprising 1,000 messages, each message size is 1KB. Further, the maximum value of the message may be received Broker also have their own limitations (message.max.bytes), preferably arranged so that both sides of the match, the message sent by the producer to avoid being rejected broker.

12. receive.buffer.bytes 和 send.buffer.bytes

These parameters specify the TCP socket buffer size for receiving and sending data packets. If they are set to -1, the default value of the operating system. If the producer or the consumer and the broker in a different data center, it can be appropriate to increase these values, because the network across data centers generally have relatively high latency and low bandwidth.

Order to ensure that Kafka can guarantee the same partition where the news is ordered. That is, if the producer to send messages in a certain order, broker will write them in that order partition, consumers will read them in the same order. In some cases, the order is very important. If retries non-zero integer, while the number max.in.flight.requests.per.connection set greater than 1, then, if the first message is written to a batch failure, and the second batch of write the success, broker will try to write the first batch. If at this time the first batch also write successfully, then the order is reversed the two batches.
In general, if a scene requires some messages are ordered, then the message is written to success is critical, it is not recommended to order is very important. If retries set to 0. Max.in.flight.requests.per.connection can be set to 1, so that the producer tries to send the first message, there will be other messages sent to the broker. However, this will seriously affect the throughput of the producers, so only do this under the strict requirements of the order message case

Second, consumers Configuration

So far, we have learned how to use consumer API, but only introduces a few configuration is a 'sex eleven as bootstrap.servers, key.deserializer, value.deserializer, group.id. Kafka's document lists all related consumer configuration instructions. Most parameters have reasonable defaults, generally do not need to modify them, but there are some parameters with the consumer has a lot of performance and availability. Next introduce these important properties.

1. fetch.min.bytes

This attribute specifies the minimum number of bytes to obtain records from the server consumers. broker upon receipt of the consumer's data request, if the amount of data available fetch.min.bytes less than the specified size, then it will wait until there is enough data is available when it is returned to the consumer. This reduces the workload broker and consumers, because they do not need to and fro in the subject is not very active time (or the day of the peak hours) to process the message. If you do not have much data available, but consumers CPU usage is very high, you need to set the value of the property is greater than the default. If the number of consumers is more, the value of the property is set a little too big can reduce the workload of the broker.

2. fetch.max.wait.ms

We told Kafka by fetch.min.bytes, wait until there is enough data when it is returned to the consumer. The fetch.max.wait.ms the waiting time for the designated broker default is 500ms. If there is not enough data to flow into Kafka, consumers get the minimum amount of data on the requirements are not met, leading to 500ms delay. To reduce the potential delay (in order to meet SLA), the parameter value can be set to be smaller. If fetch.max.wait.ms is set to 100ms, and fetch.min.bytes is set to 1MB, then Kafka after receiving the consumer's request, either return 1MB of data, or return all the available data after 100ms, to see what conditions are to be met.

3. max.parition.fetch.bytes

This attribute specifies the maximum number of bytes returned from each server partition to consumers. Its default value is 1MB, that is to say, KafkaConsumer.poll () method returns from each partition record a maximum of max.parition.fetch.bytes specified byte. If a theme has 20 partitions and five consumers, each consumer needs at least 4MB of memory available to receive the record. When allocating memory for the consumer, may be allocated with more to them, because if there are a group of consumers crash, leaving consumers need to process more partitions. The maximum number of bytes in the message must be able to receive value max.parition.fetch.bytes than the broker (by max.message.size property configuration) big, otherwise consumers may not be able to read these messages, consumers have been led to re-hang test. When you set this property, Another factor to consider is the time to deal with consumer data. Consumers need frequent calls to poll () method to avoid session expired and partitioning occurs rebalancing, too, if a single call to poll () returns the data, consumers need more time to process, may not be in time for the next poll to avoid the session expires. If this occurs, the value can max.parition.fetch.bytes piecemeal or extended session expiration time.

4. session.timeout.ms

This attribute specifies the time before the consumer is considered to be the death disconnected from the server, the default is 3s. If the consumer does not send a heartbeat to the group coordinator session.timeout.ms within the specified time, it was thought to have died, the coordinator will trigger rebalancing, assign it to the group in the partition of other consumers. This property is closely related to heartbeat.interval.ms. heartbeat.interval.ms specify the poll () method sends the heartbeat to the coordinator, session.timeout.ms specify how long consumers can not send a heartbeat. Therefore, the general need to modify these two properties, heartbeat.interval.ms must be smaller than session.timeout.ms, generally one-third of session.timeout.ms. If session.timeout.ms is 3s, then heartbeat.interval.ms should ls. The session.timeout.ms value set than the default value is small, it can quickly detect and recover from node crashes, but the long polling or garbage collection may lead to unintended rebalancing. The value of this attribute is set large, you can reduce accidents rebalancing, but detection nodes collapse take longer.

5. auto.offset.reset

This attribute specifies a read at the consumer without offset or shift the invalid amount partition case (due to failure of the consumer a long time, the recording including the offset well is obsolete deleted) the process for what. Its default value is the latest, meaning that, under the invalid offset, consumers will start reading the latest recording data (generated after consumers start recording). Earliest is another value, meaning that, in the case where the offset is invalid, the consumer reads the recording start position from the partition.

6. enable.auto.commit

We will be submitted to offset later introduced several different ways. This property specifies whether to automatically submit consumer offset default value is true. In order to avoid duplication of data and data loss, it can be set to false, control when submitting offset by themselves. If it is set to true, the frequency can also be controlled by configuring filed auto.commit.interval.mls properties.

7. partition.assignment.strategy

We know that the partition will be assigned to the group in the consumer. PartitionAssignor according to the given theme and consumers decide which partitions should be assigned to which the consumer. Kafka has two default assignment policy.

  • Range

Several consecutive partition the policy theme will be assigned to the consumer. Assuming that fees were quiet C1 and C2 while consumers subscribed to the topic and theme T1 T2, wells and each theme has three partitions. Then the consumer C1 likely to be assigned to these two themes partition 0 and partition 1, and consumers C2 assigned to these two themes partition 2. Because each topic has an odd number of partitions, and the assignment is done independently within the subject matter, the first consumer finally assigned to more consumers than the second partition. Just use the Range strategy, but also the number of partitions can not be divisible by the number of consumers, it will happen.

  • RoundRobin

The strategy put all partitions assigned individually themed to consumers. If a policy to consumers RoundRobin C1 and C2 dispensing partition consumer, then the consumer will be assigned theme T1 C1 partition 0 and partition 1 and partition 2 relating to T2, the topics assigned to the consumer C2 T1 and partition l theme T2 partition 0 and partition 2. In general, if all consumers subscribe to the same topic (which is common), RoundRobin distribution strategy will give all consumers the same number of partitions (or backward up to a partition).

You can select the partitioning strategy by setting partition.assignment.strategy. The default is org. Apache.kafka.clients.consumer.RangeAssignor, this class implements the Range strategy, but also can change it to org.apache.kafka.clients.consumer.RoundRobinAssignor. We can also use a custom policy, in this case, the value is the name of the custom class partition.assignment.strategy property.

8. client.id

This property can be any string, Broker use it to identify the message sent from the client over, typically used in the log, and quotas in metrics.

9. max.poll.records

This attribute is used to control a single call to call () method to return the number of records that can help you control the amount of data required in the polling process.

10. receive.buffer.bytes 和 send.buffer.bytes

socket used in the read and write data TCP buffer size may be provided. If they are set to -1, the default value of the operating system. If the producer or consumer and broker in different data centers, these values ​​can be appropriately increased, because the network across data centers generally have a relatively high latency and low band

 

Guess you like

Origin www.cnblogs.com/hejunhong/p/11450410.html