Interviewer: How to ensure the reliability of the transmission of messages? Or, how to deal with the problem of lost messages?

This article from the yanglbme starting on GitHub technology community Doocs , currently stars exceeded 30k.
Project Address: github.com/doocs/advan...

stars

Interview questions

How to ensure the reliability of the transmission of messages? Or, how to deal with the problem of lost messages?

Interviewer psychological analysis

This is for sure, with the MQ have a basic principle that data can not be one more, nor one less , not more, said earlier that repeated consumption and power and other issues . No less, that this data do not get lost. That this question you have to think about it.

If you are using MQ to pass the very core of the message, for example, some message billing, chargeback, we must ensure that the MQ transfer process will not lose the message to the billing .

Face questions analysis

Data loss problem that may arise in the producer, MQ, consumers, and Kafka are we from RabbitMQ to analyze it.

RabbitMQ

Producers lost data

RabbitMQ producer to send data to when the data is likely to get lost halfway, because of what network problems are likely.

At this point you can choose to provide transactional capabilities with RabbitMQ, is the producer before sending the data to open RabbitMQ transaction channel.txSelect, then send the message, if the message is not received RabbitMQ to succeed, then the producer will receive an exception error, then you can roll back the transaction channel.txRollbackand try to send a message; if you receive a message, you can commit the transaction channel.txCommit.

// 开启事务
channel.txSelect
try {
    // 这里发送消息
} catch (Exception e) {
    channel.txRollback

    // 这里再次重发这条消息
}

// 提交事务
channel.txCommit
复制代码

But the problem is, RabbitMQ transaction mechanism (synchronous) in a practice, basically throughput will be down, because the consumption of too much performance .

So, in general, if you want to make sure to speak and write RabbitMQ messaging do not lose, you can open confirmmode, the producers set to open confirmlater mode, every time you write a message will be assigned a unique id, then if written in RabbitMQ , RabbitMQ will give you a return ackmessage that tells you that the news ok. If RabbitMQ could not process the message, you will be a callback nackinterface to receive a message telling you this fails, you can try again. And you can combine this mechanism to maintain their own state id for each message in memory, if more than a certain time has not received the news of callbacks, then you can re-issued.

Transaction mechanism and the confirmbiggest difference is the mechanism that transaction mechanism are synchronized , then you commit a transaction will be blocked there, but confirmthe mechanism is asynchronous , after which you send a message you can send the next message, then that message RabbitMQ received after asynchronous callback will notify you of an interface you this message has been received.

This is generally the producer to avoid loss of data , are used confirmmechanisms.

RabbitMQ lost data

RabbitMQ own data is lost, and that you have to open the persistence of RabbitMQ , after a message is written to be persisted to disk, even if it is hung up RabbitMQ own, previously stored data will be automatically read after recovery , the data is not generally throw. Unless it is extremely rare, RabbitMQ not persistent, and that they hung up, it could result in a small amount of data loss , but the probability is small.

Setting persistence has two steps :

  • Create a queue when it is set to persistent
    so as to ensure RabbitMQ persistent metadata queue, but it is not persistent queue in the data.
  • The second message is sent when the message is deliveryModeset to 2
    is set to the message persistent, in which case the message will RabbitMQ persisted up to disk.

Both must be set to persist for the job, RabbitMQ even if it is hung up, reboot again will restart recovery queue from disk, restore the data in the queue at the same time.

Note that even if you give RabbitMQ open the persistence mechanism, there is also a possibility that the news wrote in RabbitMQ, but not enough time persisted to disk, the results unfortunately, at this time RabbitMQ hung up, it will lead to memory in a little bit of data loss.

So, persistence can tell there's producers confirmwith up mechanism, messages are only after the disk, will inform the producer persisted ack, so even before the persisted to disk, RabbitMQ hung up, lost data, production who do not receive ack, you too can own retransmission.

Consumer side lost data

RabbitMQ If you have lost data, mainly because when you consume, just to consume, not processed and the results linked to the process , such as restart, then embarrassed, and RabbitMQ think you are a consumer, this data is lost.

This time starting RabbitMQ provides a ackmechanism, in simple terms, is that you must turn off the automatic RabbitMQ ackcan be called via an api on the line, then each time your own code to ensure that when processed, and then in the program acka. In this case, if you have not been processed, it is not no ackthe? That RabbitMQ think you have not been processed, this time the consumer will RabbitMQ assigned to another consumer to deal with, the message is not lost.

Kafka

Consumer side lost data

The only conditions that may cause consumers to lose data, that is, to the news you consume, then the consumer side automatically submitted offset , let Kafka thought you were a consumer good news, but in fact you are just ready to process the message, you have not treated yourself hung up, then this message is thrown slightly.

It's not almost like RabbitMQ it, we all know that Kafka will be automatically submitted to offset, as long turn off the automatic submission offset, offset to manually submit after processing, can ensure that the data will not be lost. But this time, or indeed may be a repeat consumption , such as you have just processed, not submitting offset, the result himself hung up, then consumption will certainly be repeated once promised myself idempotency just fine.

One problem encountered in a production environment, that is to say after our Kafka consumer spending data is written to the queue in a first buffer memory, the results sometimes, you just had a message written to the memory queue, then consumers will automatically submit offset. Then we restart the system at this time, it will cause data in memory queue not had time to deal with the lost.

Kafka lost data

A piece of the more common scenario is that Kafka a broker goes down, then the re-election of the leader partition. Think about it, after this time if the other follower just some data is not synchronized, the result this time leader hung up, and then the election of a follower into a leader, not less some data? It lost some data ah.

Production environments encountered, we also, before Kafka's leader machine goes down, then the follower is switched to the leader, you will find this to say on the lost data.

At this time, it is usually the minimum requirement set the following four parameters:

  • Set to topic replication.factorparameters: This value must be greater than one, it requires that each partition must have at least two copies.
  • Set in Kafka server min.insync.replicasparameters: This value must be greater than 1, this requires a leader is at least perceived to have at least one follower was kind enough to keep in touch with yourself, not left behind, so as to ensure that the leader and a follower hang of it.
  • At the producer end of the set acks=all: this is to ask each of the data must be after writing all replica, believed to be written in order to succeed .
  • Provided at an end producer retries=MAX(a lot of very big value, meaning infinite retries): This is the required write once failed, would infinitely retry , the card here.

Our production environment is configured according to the above requirements, then this configuration, at least at the end of Kafka broker can guarantee when the leader where the broker fails, a leader switches, data is not lost.

Producers will not lose data?

If you set up in accordance with the above ideas acks=all, will not be lost, the requirement is that your leader a message is received, all of the follower are synchronized to the news only after considers that the write succeeded. If you do not meet this condition, the producer will automatically continue to retry, retry unlimited.


I welcome attention to the micro-channel public number "Doocs the open source community," the first time to push original technology articles.

Guess you like

Origin juejin.im/post/5dd1ff725188254ebf7e1bbe