How to optimize the large amount of table data and slow reading and writing (2) [query separation]

As I talked about in the previous article , the cold and hot separation solution is cost-effective, but it is not an optimal solution. There are still many shortcomings, such as: cold data query is slow, the business cannot modify the cold data, and the cold data is too large. The degree system still can't hold, if we want to solve these problems one by one, we can use another solution- query separation . ( Note: There is still a difference between query separation and read-write separation. )

Business scenario two

In a SaaS customer service system, there is a work order query function in the system. The work order table stores tens of millions of data, and when querying the work order table data, more than a dozen sub-tables need to be associated, and each sub-table has more than 100 million pieces of data.

Faced with such a huge amount of data, like the previous separation of hot and cold, it takes tens of seconds to return results every time a customer queries data. Even if we use database optimization techniques such as indexing and SQL, the effect is still not obvious.

In addition, some data in the work order table is a few years ago, but these data involve litigation issues and need to continue to be updated. Therefore, these old data cannot be sealed elsewhere, and the previous cold and hot separation scheme cannot be used. solve.

Finally, the solution of query separation was adopted to successfully solve this problem: the updated data is placed in one database, and the queried data is placed in another system. Because the data update is a single table update, no association or foreign key is required, the update speed is immediately improved, and the data query is solved by a query engine that specializes in processing large amounts of data, which quickly satisfies the actual query demand.

After processing through this solution, every time the data is queried, the result can be returned within 500ms, and the customer no longer complains.

Through the above example, everyone has a certain understanding of the business scenario of query separation, but if you want to master the entire business scenario, continue to look down.

What is query separation?

The concept of query separation is easy to understand from a simple literal meaning, that is, every time data is written, a copy of data is saved to another storage system, and users directly obtain data from another storage system when querying data. The schematic diagram is as follows:

How to optimize the large amount of table data and slow reading and writing (2) [query separation]

In what scenarios is query separation used?

When you encounter the following situations in actual business, you can consider using query separation solutions.

  • Big amount of data;
  • The efficiency of all write data requests is acceptable;
  • The request for querying data is very inefficient;
  • All data may be modified at any time;
  • The business wants to optimize the efficiency of querying data;

Everyone is very familiar with the concept of query separation, but knows nothing about the use scenarios of query separation. This is not enough. Only by understanding the real use scenarios of query separation can we take the most correct solution when encountering actual problems.

Query separation realization ideas

In actual work, if the business requirements must use query separation solutions, we must grasp the realization of query separation ideas. Only in this way can we work in an orderly manner when we really encounter problems.

The implementation ideas of the query separation solution are as follows:

  1. How to trigger query separation?
  2. How to achieve query separation?
  3. How to store query data?
  4. How to use query data?

In response to the above issues, we will discuss them bit by bit.

(1) How to trigger query separation?

This question shows when we should save a copy of data to the query data, that is, when to trigger the query separation action.

Generally speaking, there are three types of trigger logic for query separation.

(1) Modify business code: After writing regular data, query data is established synchronously.

How to optimize the large amount of table data and slow reading and writing (2) [query separation]

(2) Modify business code: After writing regular data, create query data asynchronously.

How to optimize the large amount of table data and slow reading and writing (2) [query separation]

(3) Monitoring database log: If there is data change, update the query data.

How to optimize the large amount of table data and slow reading and writing (2) [query separation]

By observing the above three trigger logic diagrams, did you find anything? The comparison table of the advantages and disadvantages of the three trigger logics is as follows:

Modify the business code to synchronize the establishment of query data Modify business code to create query data asynchronously Monitor database logs
advantage 1. Ensure the real-time and consistency of query data. 2. The business logic is flexible and controllable 1. Does not affect the main process 1. Does not affect the main process. 2. Business code 0 intrusion
Disadvantage 1. Invade the business code. 2. Slow down the write operation speed. 1. Before the query data is updated, users may query outdated data. 1. Before the query data is updated, users may query outdated data. 2. The architecture is more complicated

In order to facilitate the understanding of the content in the table, let's talk about some of the concepts together.

What is business flexible and logically controllable? For example: Generally speaking, the person who writes business code can quickly determine from the business logic under what circumstances to update the query data, but the person who monitors the database log cannot exhaust all the database change branches, and then put all the The possibility is related to the corresponding update query data logic, and eventually any data change requires the re-establishment of the query data.

What is slowing down the write operation speed? How much write operation speed can be slowed down by an action of establishing query data? Answer: a lot. For example: when you simply update an identification of the order, it originally only takes 2ms to query the data, and it may involve reconstruction when querying the data (for example, when using ES to query data, it will involve indexes, shards, and master-slave backups. Each action is subdivided into many sub-actions, which will be discussed later in the article). At this time, the process of establishing query data may take 1s, from 2ms to 1s. Do you think the slowdown is large?

Before the query data is updated, the user may query outdated data. Here we combine the second trigger logic. For example, an operation is in the order update state, and the data will be queried asynchronously when the state is updated. After the update, the order will change from the "pending review" state to the "reviewed" state. Assuming that it takes 1 second to update the query data, if the user is querying the order status during this 1 second, although the master data has changed to the "reviewed" status, the final query result still shows the "pending review" status.

According to the previous comparison table, the applicable scenarios for each trigger logic are summarized as follows:

Trigger logic Applicable scene
Modify business code and establish query data synchronously The business code is relatively simple and does not require high response speed for write operations
Modify business code and create query data asynchronously The business code is relatively simple and responds to write operations
Monitor database logs The business code is more complicated, or the change is too costly

Here, combined with the actual case description: In a real business scenario, although we are familiar with the business code, the business requires a fast response time every time a work order is modified, so we finally chose to modify the business code to create query data asynchronously This trigger logic.

(2) How to realize query separation?

There are a total of 3 trigger logics mentioned above. The first is the relatively simple process of synchronizing and establishing query data. I will not explain it here. I will explain the third monitoring database log in detail in 13, so this part of the content mainly focuses on the first 2 kinds of discussions.

Regarding the second trigger scheme: modify the business code to create query data asynchronously. The most basic way to achieve this is to create a separate thread to create query data. However, this approach will have the following situations:

  • There are too many write operations and too many threads, which will eventually burst the JVM.
  • An error occurred in the thread that created the query data, how to automatically retry.
  • When multi-threaded concurrency, many concurrency scenarios need to be solved.

Faced with the above three situations, how should we deal with it? At this time, using MQ to manage these threads can be solved.

The specific operation idea of ​​MQ is that every time the main data write operation request is processed, a notification will be sent to MQ. After receiving the notification, MQ will wake up a thread to update the query data. The schematic diagram is as follows:

How to optimize the large amount of table data and slow reading and writing (2) [query separation]

After understanding the specific operation ideas of MQ, we should also consider the following 5 major issues.

Question 1: How to choose MQ?

If the company has used MQ, then the selection problem will not exist. After all, the technical department will not maintain two sets of MQ middleware at the same time, and if the company has not used MQ, this needs to consider the selection problem.

Here I share two selection principles, I hope it will be helpful to you.

(1) Convene all those who can make technical decisions in the technology center to vote for the selection.

(2) No matter which MQ we choose, we can finally achieve the desired function, but it is easy to use and not easy to use, write more and less business code, so we can consider from the perspective of ease of use and code workload.

Question 2: What should I do if MQ is down?

If MQ goes down, we only need to ensure that the main process is proceeding normally and the data is processed normally after MQ is restored. The specific plan is divided into three steps.

  • For each write operation, add an identifier to the main data: NeedUpdateQueryData=true, so that the message sent to MQ is very simple, just a simple signal to update the data, and does not contain the updated data id.

  • After receiving the signal, MQ consumers first batch query the master data to be updated, and then batch update the query data. After the update, the master data identifier NeedUpdateQueryData of the query data is updated to false.

  • Of course, there are also cases where multiple consumers carry actions at the same time, which involves the problem of concurrency. Therefore, the problem is similar to the concurrency processing logic in the separation of hot and cold in the previous chat. Students can go and see).

Question 3: What should I do if the thread that updates the query data fails?

If the update thread fails, the ID of NeedUpdateQueryData will not be updated, and subsequent consumers will again take out the data identified by NeedUpdateQueryData for processing. But if it keeps failing, we can add an additional number of handling attempts to the main data, such as +1 every time a handling attempt is made, and clear it after success, so as to monitor the data that has too many attempts to move.

Problem 4: Idempotent consumption of messages

In programming, the characteristic of an idempotent operation is that executing an operation multiple times has the same effect as executing an operation.

For example, after order A of the master data is updated, we insert A into the query data, but at this time the system has a problem. The system mistakenly believes that the query data has not been updated, and then inserts and updates order A again.

The so-called idempotence means that no matter how many times the logic of updating the query data is executed, the result is the result we want. Therefore, when considering the issue of concurrency on the consumer side, we need to ensure that the update query data is idempotent.

Question 5: The timing of the message

For example, a certain order A updates the data once to become A1, and thread A moves the data of A1 to the query data. After a while, the back-end order A updates the data once more and becomes A2, and thread B also starts work to move the data of A2 to the query data.

The so-called timing is that if thread A starts earlier than thread B, but the movement of moving data is completed later than thread B, it is possible that the query data will eventually become expired A1. As shown in the figure below (the serial number in front of the action represents the sequence of the actual action):

How to optimize the large amount of table data and slow reading and writing (2) [query separation]

At this time, the solution is to update the last update time last_update_time every time the main data is updated, and then after each thread updates the query data, check whether the last_update_time of the current order A is the same as the time the thread first obtained, and whether NeedUpdateQueryData is equal to false , If they are all satisfied, we will change NeedUpdateQueryData to true, and then do another move.

Seeing this, you may have a question in your mind: The role of MQ here is just a tool for triggering signals. If you don't use MQ, it doesn't seem to be a problem. You are quite wrong. MQ has a lot of functions. If you don't believe me, look down.

  • Decoupling of services : In this way, the main business logic will not rely on the service of updating and querying data.
  • Control the concurrency of the update query data service : If we directly call the update query data service, the write operation speed is fast, and the update query data speed is slow. Once the concurrency of the write operation is high, it will cause overload pressure on the update query data service. If the query data service is updated through the message trigger, we can control the load by controlling the number of threads of the message consumer.

(3) How to store query data?

What technology should we use to store query data? At present, Elasticsearch is mainly used in the market to implement search queries with large amounts of data. Of course, technologies such as MongoDB and HBase may also be used. This requires us to know the characteristics of various technologies and then make technical selection.

Regarding the issue of technology selection, I think that many times we cannot just consider the needs of business functions, but also need to consider the organizational structure. Which middleware the team is most familiar with has the least cost, and this should be the priority.

(4) How to use query data?

Because ES has its own API, when using query data, we can directly call ES API in the query business code.

However, there is a problem with this approach: what should I do if the query data is inconsistent before the data query is updated? Two solutions are shared here.

  1. Before the query data is updated to the latest, users are not allowed to query. (We have never used this design, but I did see such a design on the market.
  2. Reminder to users: The data you are currently queried may be the data 1 second ago. If you find that the data is inaccurate, you can try to refresh it. This reminder is generally easier for users to accept.

Overall program

Above, we have discussed all four issues, let's take a look at the overall solution of query separation, as shown in the following figure:

How to optimize the large amount of table data and slow reading and writing (2) [query separation]

To sum up, the architecture of query separation in this article is mainly divided into four parts: How to trigger query separation? How to achieve query separation? How to store query data? How to use query data?

Historical data migration

After the new architecture scheme goes online, how can the old data be applied to the new architecture scheme? This is a problem that we need to consider in actual business.

In this solution, we only need to add this mark to all historical data: NeedUpdateQueryData=true, and the program will automatically process it.

Insufficiency of query separation solutions

Although the solution of query separation can solve some problems, we also have to be aware of its shortcomings.

Less than one: When using Elasticsearch to store and query data, what are the precautions (this solution is not expanded in detail)?

Less than two: After the main data volume becomes larger and larger, the write operation is still slow, and there will still be problems at that time.

Less than three: When the master data and the query data are inconsistent, what if the business logic requires the query data to be consistent?

The next article will talk about what problems need to be paid attention to when using elastic search as a storage system for querying data. We will encounter this problem whether it is an interview or actual work. A technology is not difficult to use. What is difficult is what problems you will encounter when using this technology, and how do you solve it?


Guess you like

Origin blog.51cto.com/11996285/2644385