[Kafka] "Kafka Definitive Guide" - write data

Whether the Kafka as message queues, messages, bus or data storage platform to use, always you need to have a producer and a consumer can write Kafka can read the data from the data to Kafka, or both the two kinds of a application role.

For example, in a credit card transaction processing systems, there is a client application, it may be an online store, whenever payment behavior, which is responsible for sending transactions to Kafka. Another application rules engine to check this transaction in accordance with the decision to approve or deny. Approve or reject the response message is written back to Kafka, and then sent to the online store initiated the transaction. The third application reads from Kafka and review the state of affairs, save them to the database, then analysts can analyze these results, might be able to take to improve the rules engine.

Developers can use the built-in client API Kafka Kafka develop applications.

In this chapter, we will design and component producers Kafra talk, learn how to use Kafka producers. We will show you how to create KafkaProducer and ProducerRecords objects, how to send the record to Kafka, and how to handle errors returned from Kafka, and then with a dry introduce important configuration options to control the behavior of producers, the final depth discussion on how to use different partitioning methods and sequences gasifier, and how to customize the partitioner and serialization.

In the next chapter, we will introduce quiet Kafra fee by the client, as well as how to read a message from Kafka.

Producer Overview

An application needs to be written in many cases to Kafka messages: record the user's activity (for audit and analysis), metrics recording, save the log, messages, recording information intelligent appliances, asynchronous communication with other applications, buffer data to be written to the database, and so on.

A variety of usage scenarios imply diverse needs: whether each message is very important whether to allow a small portion of the message is lost if a duplicate message occasionally appears acceptable if there is a strict latency and throughput requirements????

In the credit card transaction processing systems previously mentioned, the message is lost or duplicate messages is not allowed, the maximum acceptable delay is 500ms, the higher throughput requirements, we hope that every second can handle one million messages.

Click Save site information is another usage scenario. In this scenario, the loss of a few messages or allow a small amount of duplicate messages, delay can be higher, it does not affect the user experience on the line. In other words, as long as the user clicks on the link to load the page right away, so we do not mind the message you want to get to Kafka server after a few seconds. Throughput depends on the frequency of site users use the site.

A direct impact on different usage scenarios will have on the use and configuration Producers API.

Although producers API is simple to use, but the process of sending messages still a little complicated. The following figure shows the main steps of sending a message to Kafka.

FIG Producer component Kafka

We start by creating a ProducerRecord objects, ProducerRecord object needs to contain the target subject and content to be sent. We can also specify the key or partition. When sending ProducerRecord object to producers and first key value into a byte array target sequence, so that they can be transmitted over the network only.

Next, the data is passed to the partitioner. If before ProducerRecord objects in the specified partition, then the partition is does not do anything directly to the specified partition to return. If you do not specify a partition, then the partition will be selected based on a partition ProducerRecord object key. After selecting the partition, the producer knows which partition the theme and to send this record. Then, this record is added to a record in batches, the batches in all of the messages will be sent to the same theme and partitions. There is a separate thread is responsible for these batch records sent to the appropriate broker.

Upon receipt of these messages server will return a response. If the message is successfully written to Kafka, it returns a RecordMetaData object that contains the theme and partition information, and records the offset of the partition. If the write fails, it will return an error. Producers will receive an error after attempting to resend the message, after several failed or if it returns an error message.

Creating Kafka producer

To write a message to Kafka, first create a producer object well set some properties.

The following code fragment shows how to create a new producer, where only specify the required attributes, the other using the default settings.

private Properties kafkaProps = new Properties(); 

kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092");
 
kafkaProps.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
 
kafkaProps.put("value.seializer","org.apache.kafka.common.serialization.StringSerializer");
 
producer = new KafkaProducer<String, String>(kafkaProps);

Kafka producer has three mandatory attributes

bootstrap.servers

This attribute specifies the broker address list, the address format is host: port. Need not be included in the list of all the addresses of the broker, the producer looks from the given information to other broker in the broker. However, the proposed information to be provided at least two broker, once one goes down, the producers are still able to connect to the cluster.

key.serializer

the key broker and desired values ​​are byte array of the received message. Producers interface allows the use of parameterized types, can be transmitted to the broker as a Java object keys and values. This code has good readability, but producers need to know how to convert these objects into Java byte array. key.serializer must be set to a org.apache.kafka.common.serialization.Serializer class that implements the interface, the producer will use the key sequence to the class into a byte array. Kafka client provides ByteArraySerializer (this only a few things), StringSerializer and IntegerSerializer default, so if you only use a few common Java object type, then there is no need to implement your own serializer. To note, key.serializer must be set, even if you only intend to send the value of content.

value.serializer

As with key.serializer, value.serializer value of the specified class will serialize. If the key strings and values ​​are, the same can be used key.serializer serializer. If the key is an integer and the value is a character type fan, you need to use different serializers.

Send message mainly three ways

1-and-forget (fire-and-forget): We send a message to the server, but the well does not care whether it is normal to arrive. In most cases, the message will reach normal, because Kafka is highly available, and the producers will try to resend automatically. However, this way sometimes losing messages.

2, synchronous transmission: We use the send () party mood to send a message, it will return a Future object, call the get () method to wait, you can know whether to send the information quiet success.

3, Asynchronous Transmission: we call the send () party mood, and specify a callback function, the function is called when the server returns a response.

In the following several examples, we will explain how to send a message, and how to handle exceptions that may occur using the above-described several ways.

All the examples in this chapter are using a single thread, but in fact the producers can use multiple threads to send the message. At first consumers can use a single and a single thread. If you need higher throughput, can increase the number of threads in the number of producers under the same premise. If you do this enough, you can increase the number of producers.

Send a message to Kafka

The simplest way to send a message synchronizing follows:

ProducerRecord<String, String> record = new ProducerRecord<>("CustomerCountry", "Precision Products", "France");
try{
  producer.send(record);
} catch(Exception e) {
  e.printStack();
}

Producers send () will live side ProducerRecord object as a parameter, it takes the name of the theme and target key and value objects to be sent, they are strings. Type of the key and value must match the target sequence and the producers of objects.

We use the producer's send () to send the party ProducerRecord object. Architecture can be seen from FIG producers, the message is first placed in a buffer, then sent to the server using a separate thread. send () method returns a Future object that contains RecordMetadata, but because we will ignore the return value, it is impossible to know whether the message is sent successfully. If you do not care about sending the results, you can use this to send way. For example, the Twitter message log records, or records less important application log.

We can ignore the errors that may occur when sending messages or error on the server side may occur, but before sending the message, producer or other possible anomalies occur. These abnormalities may be SerializationException (explanation of message sequence failure), BufferExhaustedException or TimeoutException (explanation buffer is full), or is InterruptException (explanation sending thread is interrupted).

Send Message Synchronization

ProducerRecord<String, String> record = new ProducerRecord<>("CustomerCountry", "Precision Products", "France");
try{
    producer.send(record).get();
} catch(Exception e) {
    e.printStack();
}

Here, producer.send () Returns a first side stay Future object, and then calls the Future object get () method waits for response to Kafka. If the server returns an error, get () party mood will throw an exception. If no errors occur, we will get a RecordMetadata object, you can use it to get the offset of the message. If any errors or occurred during a transmission before transmitting data, such as a broker is not allowed to return to resend a message of abnormal or has exceeded the number of retransmissions, it will throw an exception. We simply print out the anomalies.

How to handle errors returned from Kafka producer

KafkaProducer generally two types of errors can occur. One class is retry errors, such errors can be solved by retransmitting a message. For example, for a connection error, can be solved by establishing a connection again, "unowned (noleader)" error you can re-election as leader of the partition to solve. KafkaProducer can be configured to automatically retry if still no able to solve the problem after several retries, the application will receive a retry exception. Another type of error does not come out by retrying resolved, such as "big news" exception. For this type of error, KafkaProducer will not make any retry, direct throw.

Asynchronous send a message

Assume that a message needs to 10ms back and forth between the application and Kafka clusters. If both wait for a response after each message has been sent, the send message 100 takes one second. However, if only send messages without waiting for a response, then the time to send 100 messages require a lot less. Most of the time, we do not need to wait for a response - albeit offset Kafka will target topics, partition information, and messages sent back, but for the application of the sender is not required. But in the face of a message transmission fails, we need to throw an exception, log errors, or to write a message "error message" file for later analysis.

In order to send a message in an asynchronous exceptions can process, producers callback support. The following message is sent using asynchronous, a callback example.

Producers configuration

So far, we only introduce a few necessary configuration parameters --bootstrap.servers API and serializer producers.

Producers there are many configurable parameters are explained in the documentation Kafka, most of them have reasonable defaults, so there is no need to modify them. However, there are several parameters in memory usage, performance and reliability for producers greatest impact, then we will them out.

1. acks

acks parameter specifies how many copies of the partition must have received the message, the message is written producers would think to be successful.

This parameter has a major impact on the possibility of message loss. This parameter has the following options.
• If acks = 0, producers quiet before successfully written message will not wait for any response from the server. That is, if there is a problem which led the server does not receive the message, then the producers would not know, the message will be lost. However, as producers need not wait for response from the server, it may send a message to the network can support a maximum speed, so as to achieve a high throughput.

• If acks = 1, as long as the leader nodes of the cluster receive a message, the producer will receive a successful response from the server. If the message is not reaching the hit leader node (such as anger collapse leader node, the new leader has not been elected), the producer will receive an error response, in order to avoid data loss, producers will resend the message. However, if a node does not receive the message become the new leader, the message is lost. This time depends on the throughput of synchronous or asynchronous send send. If you let the client waits for a response is sent (by calling the Future object get () method) server, will obviously increase the delay (round-trip transmission delay on the network). If the client uses an asynchronous callback, delay problems can be eased, but still throughput (for example, producer how many messages can be sent before the server response is received) will be affected by restrictions on the number of messages sent.

• If acks = all, only when all nodes participating in the replication of all messages received, the producer will receive a successful response from the server. This mode is the most secure, it can ensure that more than one server receives a message, even if there is a server crashes, the entire cluster can still run (Chapter 5 will discuss more details). However, its delay = 1 higher than acks, because we have to wait for more than one server node receives the message.

2. buffer.memory

This parameter is used to set the size of the memory buffer of the producer, the producer with that message buffer to be sent to the server. If the speed of application sends a message sent to the server exceeds the speed will lead to insufficient space producers. This time, send () method call is either blocked or throw an exception, depending on how block.on.buffe.full parameters (max.block.ms has been replaced in version 0.9.0.0, expressed thrown before anomaly can be blocked for some time).

3. compression.type

By default, when the message is not compressed. This parameter can be set snappy, gzip or lz4, which specifies the compression algorithm before being sent to the message broker is compressed. snappy compression algorithm invented by cowardly Google had, it uses less CPU, but it can provide better performance and considerable compression ratio, if more concerned about performance and network bandwidth, you can use this algorithm. gzip compression algorithms usually take up more CPU, but provides a higher compression ratio, so if the network bandwidth is limited, you can use this algorithm. Use Compression can reduce network overhead transmission and storage overhead, which is often the bottleneck of sending messages to Kafka located.

4. retries

Producers from the wrong server may have received a temporary error (such as a partition can not find the leader). In this case, the value of the parameter determines the number of retries producers can resend the message, and if it reaches this number, the producers will give up retry and returns an error. By default, the producer waits 1OOms between each retry, but this time period can be changed by retries.backoff.ms parameter. Advice before re-set the number of retries and the retry interval, first test restore a node crash How much time (for example, all partitions elected leader how long), so that the total retry time than Kafka cluster recover from a crash of a long time, otherwise the producer will prematurely give up and try again. However, some error is not a temporary error, do not be scared to address (such as "interest rate too quiet" error) by retrying. In general, because the producer will automatically retry, so there is no need to handle errors that can be retried in the logic of the code. You only need to handle the situation or the number of errors that can not retry retry exceeds the upper limit.

5. batch.size

When there are a plurality of messages needs to be sent to the same partition, producers will put them in a batch in. This parameter specifies a batch size of memory that can be used is calculated (rather than message number) number of bytes. When the batch is filled, all the messages in the batch will be sent out. But producers will not necessarily wait until the well is filled before sending batch, semi-batch capture, and even batch contains only one message is also likely to be sent. So even if the batch size is set very large, it will not cause delay, but will take up more memory only. However, if set too small, because producers need to send messages more frequently, it will add some additional overhead.

6. linger.ms

This parameter specifies the producer to wait before sending additional message batches batch addition time. KafkaProducer fills or linger.ms upper limit is reached the batch sent out in batches. By default, as long as there are threads available, the producer will send messages out, even if there is only one batch of messages. The linger.ms set to be larger than the number 0, so that producers wait a while before sending batches, so that more information is added to the batch. Although this will increase the delay, but will also enhance the throughput (due to one-off sending more messages, each message overhead becomes small).

7. client.id

The parameter may be an arbitrary character string, the server will use to identify the source of the message, and may also be used in the log in the quota.

8. max.in.flight.requests.per.connection

This parameter specifies the producer how many messages can be sent before receiving server should noon. The higher its value, will take up more memory, but will also enhance throughput. 1 can set it to ensure that messages are transmitted in the order written to the server, even if the retry occurs.

9. timeout.ms, request.timeout.ms and metadata.fetch.timeout.ms

request.timeout.ms specified time producers wait for the server when sending data to return a response, metadata.fetch.timeout.ms specify when producers wait for the server to obtain metadata (such as who is the chief target partition) return response time. If a timeout waiting for a response, then the producer either retry sending data, or returns an error (an exception is thrown or callback). timeout.ms designated broker to wait for a synchronized copy of the return message confirming the time, and asks if the configuration matches the eleven synchronized copy of the acknowledgment is not received within the specified time, then the broker will return an error.

10. max.block.ms

This parameter specifies the blocking time () method or using parttitionFor () in order to obtain metadata in the call to send the producer. When the transmit buffer has been caught producers, or no metadata available, these will block flexor side. When the blocking time to reach max.block.ms, producers will throw timeout exception.

11 . max.request.size

This parameter controls the size of the request sent by the producer. It may refer to a maximum for a single message can be transmitted, it can also refer to the total size of all the messages in a single request. For example, assuming that the value of 1MB, the largest single message may then be transmitted to 1MB, or producers can send a single request in a batch, the batch comprising 1,000 messages, each message size is 1KB. Further, the maximum value of the message may be received Broker also have their own limitations (message.max.bytes), preferably arranged so that both sides of the match, the message sent by the producer to avoid being rejected broker.

12. receive.buffer.bytes 和 send.buffer.bytes

These parameters specify the TCP socket buffer size for receiving and sending data packets. If they are set to -1, the default value of the operating system. If the producer or the consumer and the broker in a different data center, it can be appropriate to increase these values, because the network across data centers generally have relatively high latency and low bandwidth.

Order to ensure that
Kafka can guarantee the same partition where the news is ordered. That is, if the producer to send messages in a certain order, broker will write them in that order partition, consumers will read them in the same order. In some cases, the order is very important. If retries non-zero integer, while the number max.in.flight.requests.per.connection set greater than 1, then, if the first message is written to a batch failure, and the second batch of write the success, broker will try to write the first batch. If at this time the first batch also write successfully, then the order is reversed the two batches.

In general, if a scene requires some messages are ordered, then the message is written to success is critical, it is not recommended to order is very important. If retries set to 0. Max.in.flight.requests.per.connection can be set to 1, so that the producer tries to send the first message, there will be other messages sent to the broker. However, this will seriously affect the throughput of the producers, so only do this under the strict requirements of the order of messages circumstances.

 

Guess you like

Origin www.cnblogs.com/weknow619/p/10941697.html
Recommended