Taobao is too detailed: 5 consistency schemes for mysql and es, do you know?

say up front

In the reader exchange group (50+) of the 40-year-old veteran architect Nien , some friends have recently obtained the interview qualifications of first-tier Internet companies such as Pinduoduo, Jitu, Youzan, and Xiyin. It is very important to meet a few interview questions:

  • Talk about 5 mysql and elasticsearch data consistency schemes

Similar problems encountered by other small partners include:

  • Mysql and ES data consistency issues and solutions?
  • Mysql and redis data consistency problems and solutions?
  • How to ensure data consistency between Mysql and redis?
  • How to ensure data consistency between Mysql and HBase?
  • Etc., etc…

Here, Nien will give you a systematic and systematic thread pool sorting, so that you can fully demonstrate your strong "technical muscles", and make the interviewer love "can't help yourself, salivate" .

This question and reference answers are also included in our " Nin Java Interview Collection " V70 version, for the reference of later friends, to improve everyone's 3-high architecture, design, and development level.

For the PDF files of the latest "Nin Architecture Notes", "Nin High Concurrency Trilogy" and "Nin Java Interview Collection", please go to the official account [Technical Freedom Circle] at the end of the article to obtain

Problem Scenario Analysis

In terms of our production needs, in order to facilitate the aggregate search and high-speed search of commodities, we adopt two optimization schemes:

  • Store product data redundantly in Elasticsearch to achieve high-speed search
  • Redundantly store commodity data in redis to realize high-speed cache

In many cases, high data consistency is required.

for example:

  • Requires mysql and es to achieve data synchronization at the second level.
  • Requires mysql and redis to achieve second-level data synchronization.
  • It is required that mysql and hbase achieve second-level data synchronization.

Next, the data consistency between mysql and es is analyzed as a business scenario. Other scenarios, such as the data consistency scheme between mysql and redis, are similar.

As long as everyone can talk about the following 5 major data consistency solutions, the interviewer will definitely love it to the point of "I can't help myself, and my mouth is flowing."

Solution 1: Synchronous double writing

Synchronous double writing is the simplest way, when writing data to MySQL, write data to ES at the same time.

Advantages of synchronous double writing:

This method is simple and rude, and real-time writing can be done in seconds.

Disadvantages of synchronous double writing:

  • Business coupling, this way the code is highly intrusive, and a large amount of data synchronization code is coupled in the management of commodities, and the code of es should be added in the place where mysql was written before. In the future, where mysql is written, the code of es should also be added.
  • Affecting performance, writing to two storages, the response time becomes longer. Originally, the performance of MySQL is not very high. Adding an ES will inevitably reduce the performance of the system.
  • Inconvenient to expand: search may have some personalized requirements, and data aggregation is required, which is inconvenient to implement
  • High risk: there is a risk of data loss due to double write failure

Solution 2: Asynchronous double write

Synchronous operations have low performance and asynchronous operations have high performance.

Asynchronous double write, divided into two types:

  • Asynchronous using in-memory queues (such as blocking queues)
  • Asynchronous using message queues

Scenario 2.1 Use memory queues (such as blocking queues) asynchronously

After writing the commodity data into DB first, then write the data into BlockingQueue blocking queue

The consumer thread asynchronously drains data and writes batches to ElasticSearch to ensure data consistency

Scenario 2.2 Use a message queue (such as a blocking queue) asynchronously

If the data in the memory queue is lost, then the data in es and DB are inconsistent

What if it is solved?

  • Method 1: Synchronize db data to es regularly, the synchronization period is generally longer, and there are long-term inconsistencies here
  • Method 2: To ensure the reliability of the queue, use a highly reliable message queue

In a production scenario, there is usually a search service, which subscribes to news of product changes to complete synchronization.

Advantages of asynchronous double write:

  • high performance;
  • It is not prone to data loss, mainly based on the consumption guarantee mechanism of MQ messages, such as ES downtime or write failure, and MQ messages can be consumed again;
  • Multi-source writes are isolated from each other, making it easy to expand more data source writes.

Asynchronous double write disadvantages:

  • Hard-coded issues, access to new data sources requires new consumer codes;
  • The complexity of the system increases, and message middleware is introduced;
  • MQ is an asynchronous consumption model, and the data written by users may not be immediately visible, causing delays.

Solution 3: Periodic synchronization

In order to ensure the data consistency between DB and ES/HBase, it includes two aspects:

  • Incremental Data Consistency
  • Full data consistency

In order to ensure the full data consistency of DB and ES/HBase, regular full data synchronization is often required

Incremental data is rare, and the consistency requirement is not high, so the synchronous double-write and asynchronous double-write of incremental data consistent rows can be removed.

Advantages of periodic synchronization:

Simpler to implement

Disadvantages of periodic synchronization:

  • Real-time performance is difficult to guarantee
  • greater pressure on storage

Of course, incremental data can be processed with timed tasks:

  1. Add a timestamp field to the related table of the database, any CURD operation will cause the time of this field to change;
  2. The CURD operation in the original program does not make any changes;
  3. Add a timer program, let the program scan the specified table according to a certain period of time, and extract the changed data within the period of time;
  4. Write to ES one by one.

Solution 4: Data Subscription

If you want to improve real-time performance and low intrusion, you can use MySQL's Binlog for synchronization.

MySQL implements master-slave synchronization through binlog subscription. Canal Server is a disguised slave node. After receiving the binlog log, it sends it to MQ. Others store and consume the binlog log in MQ to realize data subscription.

The architecture diagram is as follows:

This method is similar to asynchronous double write, but has two advantages:

  • First, it reduces the intrusion of commodity services,
  • The real-time performance of the second data is better.

So use datasubscription:

  • advantage:
    • Less business intrusion
    • better real-time

As for the selection of data subscription frameworks, the mainstream ones are roughly the following:

cancal Maxwell Python-Mysql-Rplication
open source party alibaba Zendesk Community
Development language Java Java Python
Activity active active active
high availability support support not support
client Java/Go/PHP/Python/Rust none Python
news landing Kafka/RocketMQ etc. Kafka/RabbitNQ/Redis etc. customize
message format customize JSON customize
Detailed documentation detailed detailed detailed
Boostrap not support support not support

Note that in the practical operation of Nien's 100Wqps three-level cache component architecture, it is also introduced that this architecture has a second-level delay.

This architecture cannot be used if scenarios with second-level delays are not allowed.

For details, please refer to Nien's 100Wqps L3 cache component architecture practice.

Solution 5: etl tools

MySQL synchronization to Redis, MySQL synchronization to hbase, MySQL synchronization to es, or computer room synchronization, master-slave synchronization, etc., can all consider using the elt tool.

What are etl tools?

ETL is the abbreviation of Extract-Transform-Load in English, which is used to describe the process of extracting, transforming, and loading data from the source to the destination. The term ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses.

ETL is an important part of building a data warehouse. Users extract the required data from the data source, after data cleaning, and finally load the data into the data warehouse according to the pre-defined data warehouse model.

Commonly used etl tools include: databus, canal (plan 4 uses this component, and has some functions of etl), otter, kettle, etc.

Let's take databus as an example to introduce.

Databus is a low-latency, reliable, transactional, and consistent data change capture system. Open sourced by LinkedIn in 2013.

Databus pulls database changes from the database in a real-time and reliable manner by mining database logs, and the business can obtain changes in real time and perform other business logic through a customized client.

Features:

  • Multiple data sources: Databus supports change capture from multiple data sources, including Oracle and MySQL.
  • Scalable, Highly Available: Databus can scale to support thousands of consumer and transactional data sources while maintaining high availability.
  • Ordered Transaction Commit: Databus maintains transactional integrity in source databases and delivers change events sequentially by transaction grouping and source commit order.
  • Low latency and support for multiple subscription mechanisms: After the data source change is completed, Databus can submit the transaction to the consumer within milliseconds. At the same time, consumers can only obtain the specific data they need by using the server-side filtering function in Databus.
  • Unlimited backtracking: Support infinite backtracking capability for consumers, for example, when consumers need to generate a complete copy of data, it will not impose any additional burden on the database. This feature can also be used when the consumer's data lags significantly behind the source database.

Look at the Databus system architecture.

Databus consists of Relays, bootstrap service and Client lib, etc. Bootstrap service includes Bootstrap Producer and Bootstrap Server.

  • Fast-changing consumers get events directly from Relay;
  • If a consumer's data update is far behind, the data it wants is not in the Relay log, but needs to request the Bootstrap service, and the returned will be a snapshot of all data changes since the last time the consumer processed the change.

Open source address: https://github.com/linkedin/databus

say at the end

The data consistency scheme is a very common interview question.

If you can answer the above 5 major solutions fluently and familiarly, basically the interviewer will be shocked and attracted by you.

In the end, the interviewer fell in love so much that he "couldn't help himself, drooling" . The offer is coming.

During the learning process, if you have any questions, you can come to the 40-year-old architect Nien to communicate.

The title and reference answers of this article are included in our "Nin Java Interview Collection" V70 version, please go to the official account [Technical Freedom Circle] at the end of the article to obtain

references:

Tsinghua University Press "Nin Java High Concurrency Core Programming Volume 2 Enhanced Edition"

Topic 29 Multi-threaded interview topic in 4000 pages of " Nin's Java Interview Collection "

[1]. https://www.infoq.cn/article/1afyz3b6hnhprrg12833

[2].https://www.iamle.com/archives/2900.html

[3]. https://blog.51cto.com/lianghecai/4755693

[4].https://qinyuanpei.github.io/posts/1333693167/

[5].https://github.com/alibaba/canal/wiki/ClientAdapter

The realization path of technical freedom PDF:

Realize your architectural freedom:

" Have a thorough understanding of the 8-figure-1 template, everyone can do the architecture "

" 10Wqps review platform, how to structure it? This is what station B does! ! ! "

" Alibaba Two Sides: How to optimize the performance of tens of millions and billions of data?" Textbook-level answers are coming "

" Peak 21WQps, 100 million DAU, how is the small game "Sheep a Sheep" structured? "

" How to Scheduling 10 Billion-Level Orders, Come to a Big Factory's Superb Solution "

" Two Big Factory 10 Billion-Level Red Envelope Architecture Scheme "

… more architecture articles, being added

Realize your responsive freedom:

" Responsive Bible: 10W Words, Realize Spring Responsive Programming Freedom "

This is the old version of " Flux, Mono, Reactor Combat (the most complete in history) "

Realize your spring cloud freedom:

" Spring cloud Alibaba Study Bible "

" Sharding-JDBC underlying principle and core practice (the most complete in history) "

" Get it done in one article: the chaotic relationship between SpringBoot, SLF4j, Log4j, Logback, and Netty (the most complete in history) "

Realize your linux freedom:

" Linux Commands Encyclopedia: 2W More Words, One Time to Realize Linux Freedom "

Realize your online freedom:

" Detailed explanation of TCP protocol (the most complete in history) "

" Three Network Tables: ARP Table, MAC Table, Routing Table, Realize Your Network Freedom!" ! "

Realize your distributed lock freedom:

" Redis Distributed Lock (Illustration - Second Understanding - The Most Complete in History) "

" Zookeeper Distributed Lock - Diagram - Second Understanding "

Realize your king component freedom:

" King of the Queue: Disruptor Principles, Architecture, and Source Code Penetration "

" The King of Cache: Caffeine Source Code, Architecture, and Principles (the most complete in history, 10W super long text) "

" The King of Cache: The Use of Caffeine (The Most Complete in History) "

" Java Agent probe, bytecode enhanced ByteBuddy (the most complete in history) "

Realize your interview questions freely:

4000 pages of "Nin's Java Interview Collection" 40 topics

The PDF file update of the above Nien architecture notes and interview questions, ▼Please go to the following [Technical Freedom Circle] official account to get it▼

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/130966907