Database performance optimization-separation of master and slave

Article from: Alibaba's billion-level concurrent system design (2021 version)

Link: https://pan.baidu.com/s/1lbqQhDWjdZe1CBU-6U4jhA Extraction code: 8888 

In the last lesson, we used pooling technology to solve the problem of database connection reuse. At this time, although the overall architecture of your vertical e-commerce system has not changed, the process of interacting with the database has changed. In your Web project and The database connection pool is added between the databases, which reduces the cost of frequently creating connections. From the test in the previous lesson, the performance can be improved by 80%. The current architecture diagram is as follows:

At this time, your database is still deployed on a single machine. According to the results of some cloud vendors' Benchmarks , running MySQL 5.7 on a 4-core 8G machine can probably support 500 TPS and 10,000 QPS . At this time, the person in charge of operations said that he was preparing for the Double Eleven event, and the company level would continue to invest funds in omni-channel promotion, which would undoubtedly cause a sudden increase in the number of queries. So today, let's take a look at how to separate the master and slave to solve the problem when the query request increases.

Master and slave read and write separation

In fact, the access model of most systems is to read more and write less, and the gap in the amount of read and write requests may reach several orders of magnitude. This is easy to understand. The number of requests for refreshing Moments must be larger than the amount of Moments posted, and the number of page views of a product on Taobao must also be much greater than its order. Therefore, we give priority to how the database can withstand higher query requests . First, you need to distinguish between read and write traffic, because this is convenient for separate expansion of read traffic. This is what we call master-slave read-write separation .

It is actually a traffic separation problem, just like road traffic control. A four-lane road has three lanes for the leaders to pass through, and the other lane is for us to use. This is the reason why the leader is given priority. This method itself is a common practice, even in a large project, it is also an effective method to deal with the burst read traffic of the database.

In my current project, a sudden increase in front-end traffic has caused the load of the secondary library to be too high. The DBA Brothers will give priority to expanding the capacity of one secondary library, so that the read traffic to the database will fall into multiple secondary libraries. The load from the library is reduced, and then the R&D students will consider what scheme to use to block the traffic above the database layer.

Two key technical points of master-slave reading and writing

Generally speaking, in the master-slave read-write separation mechanism, we copy the data of a database into one or more copies and write them to other database servers. The original database is called the master database and is mainly responsible for data writing. The target database to be copied is called the slave database, which is mainly responsible for supporting data query . It can be seen that there are two key technical points in the separation of master and slave read and write:

  1. One is the copy of data, which we call master-slave replication;
  2. In the case of separation of master and slave, how do we shield the changes in the way of accessing the database caused by the separation of master and slave, so that the development students are like using a single database.

Next, let's take a look at each.

1. Master-slave replication

Let me first take MySQL as an example to introduce master-slave replication.

MySQL's master-slave replication relies on binlog , which means that all changes on MySQL are recorded and stored in a binary log file on the disk in binary form. Master-slave replication is to transfer the data in binlog from the master library to the slave library. Generally, this process is asynchronous, that is, operations on the master library will not wait for the binlog synchronization to complete. The process of master-slave replication is as follows: first, when the slave library connects to the master node, an IO thread is created to request the binlog updated by the master library, and the received binlog information is written to a log file called relay log , And the main library will also create a log dump thread to send binlog to the slave library; at the same time, the slave library will also create a SQL thread to read the contents of the relay log, and playback in the slave library, and finally achieve master-slave consistency Sex. This is a relatively common master-slave replication method. In this solution, the use of a separate log dump thread is an asynchronous method, which can avoid affecting the main update process of the main library, and the slave library does not write the information to the storage of the slave library after receiving the information, but writes Entering a relay log is to avoid writing to the actual storage of the slave library, which will be time-consuming, and eventually cause the delay between the slave library and the master library to become longer.

You will find that based on performance considerations, the write process of the main library does not wait for the completion of the master-slave synchronization to return the result. Then in extreme cases, for example, the binlog on the main library has not had time to flush to the disk. If the disk is damaged or the machine is powered off, the binlog will be lost, and the master-slave data will be inconsistent. However, the probability of this situation is very low, which is tolerable for Internet projects.

After doing master-slave replication, we can only write to the main library when writing, and read only the slave library when reading data , so that even if the write request locks the table or locks the record, it will not affect the execution of the read request . At the same time, when the read traffic is relatively large, we can deploy multiple slave libraries to share the read traffic. This is the so-called "one master, multiple slaves" deployment method, which can be used in your vertical e-commerce projects. Ways to resist higher concurrent read traffic. In addition, the slave library can also be used as a standby library to avoid data loss due to failure of the master library.

Then you might say, can I resist a lot of concurrency by increasing the number of slave libraries without limit? In fact, it is not. Because as the number of slave libraries increases, there are more IO threads connected to the slave library, and the main library also needs to create the same number of log dump threads to process replication requests. The resource consumption of the main library is relatively high, and it is limited by the main library. Network bandwidth, so in actual use, generally a master library can have at most 3~5 slave libraries .

Of course, master-slave replication also has some shortcomings. In addition to the complexity of deployment, there is also a delay in master-slave synchronization. This delay sometimes has a certain impact on the business. Let me give you an example You will understand.

In the process of sending Weibo, there will be some synchronous operations, such as updating the database, and some asynchronous operations, such as synchronizing the information of Weibo to the audit system, so we will update the main database after we update the main database. The blog ID is written into the message queue, and the queue processor obtains the Weibo information from the database according to the ID and sends it to the review system. At this time, if there is a delay in the master-slave database, the microblog information will not be obtained from the slave database, and the whole process will be abnormal.

There are many ways to solve this problem. The core idea is to try not to query information from the library. Purely using the above example, I have three solutions:

  • The first solution is data redundancy. You can not only send the Weibo ID when sending the message queue, but also send all the Weibo information required by the queue processor to avoid re-querying data from the database.
  • The second option is to use caching. I can write the data of Weibo into the Memcached cache while simultaneously writing the database, so that the queue processor will first query the cache when obtaining the information of the Weibo, which can also ensure data consistency.
  • The last option is to query the main library. I can query the master database instead of querying the slave library in the queue processor. However, this method should be used with caution, and it should be clear that the magnitude of the query will not be very large, and it is within the tolerance of the main library, otherwise it will cause a relatively large pressure on the main library.

I will give priority to the first option, because this method is simple enough, but may cause a single message to be larger, thereby increasing the bandwidth and time of message sending.

The cache scheme is more suitable for scenarios where new data is added. In the scenario where data is updated, updating the cache first may cause data inconsistency. For example, two threads update data at the same time, and thread A updates the data in the cache to 1. Another thread B updates the data in the cache to 2, and then thread B updates the data in the database to 2. At this time, thread A updates the data in the database to 1, so that the value in the database (1) and the value in the cache (2) It is inconsistent.

Finally, if it is not a last resort, I will not use the third option. The reason is that this solution needs to provide an interface for querying the main library. In the process of team development, you can hardly guarantee that other students will not abuse this method. Once the main library undertakes a large number of read requests and causes a crash, then the overall system The impact is great.

So for these three options, you have to make a choice based on the actual project situation.

In addition, the delay of master-slave synchronization is a problem that is easy to overlook when we troubleshoot . Sometimes when we encounter the weird problem of not getting information from the database, we will be entangled in whether there is some logic in the code that will delete the previously written content, but you will find that after a period of time, when you check again The data can be read. This is basically the master-slave delay at work. Therefore, generally we will take the time behind the library as a key database indicator for monitoring and alarming. The normal time is at the millisecond level. Once the time behind has reached the second level, an alarm is required.

2. How to access the database

We have used the technology of master-slave replication to replicate data to multiple nodes, and also achieved the separation of database read and write. At this time, the way the database is used has changed. In the past, only one database address was needed. Now you need to use a master library address and multiple slave library addresses, and you need to distinguish between write operations and query operations. If you combine the content to be explained in the next lesson, Table", the complexity will increase even more. In order to reduce the complexity of implementation, many database middleware have emerged in the industry to solve the problem of database access . These middleware can be divided into two categories.

The first type is represented by Taobao's TDDL (Taobao Distributed Data Layer) , which runs in the form of code embedded in the application. You can think of it as a proxy for a data source. Its configuration manages multiple data sources. Each data source corresponds to a database, which may be the master database or the slave database. When there is a database request, the middleware sends the SQL statement to a specified data source for processing, and then returns the processing result. The advantage of this type of middleware is that it is simple and easy to use, and there is no redundant deployment cost. Because it is embedded into the application and runs with the application, it is more suitable for small teams with weak operation and maintenance capabilities; the disadvantage is Lack of multi-language support. In addition to TDDL, the current mainstream solutions in the industry include the early NetEase DDB. They are all developed in the Java language and cannot support other languages. In addition, version upgrades also rely on user updates, which is more difficult.

The other type is a separately deployed proxy layer solution . This type of solution has many representatives, such as the early Alibaba open source Cobar, the Mycat developed based on Cobar, the 360 ​​open source Atlas, the Meituan open source DBProxy based on Atlas, etc. . This type of middleware is deployed on an independent server. The business code uses it as if using a single database. In fact, it manages many data sources internally. When there is a database request, it will rewrite the SQL statement as necessary. Then sent to the specified data source. It generally uses the standard MySQL communication protocol, so it can support multiple languages ​​well. Because it is deployed independently, it is also more convenient to maintain and upgrade, and is more suitable for large and medium-sized teams with certain operation and maintenance capabilities. Its drawback is that all SQL statements need to cross the network twice: from the application to the proxy layer and from the proxy layer to the data source, so there will be some loss in performance.

These middleware may not be unfamiliar to you, but I want you to notice that when using any middleware, you must ensure that you have enough in-depth understanding of the middleware, otherwise it will not be fast if something goes wrong. It would be a tragedy to solve it. In one of my previous projects, I have been using a self-developed component to implement sub-database sub-table. Later, I discovered that this set of components has a certain chance of generating redundant connections to the database, so the team decided to replace it with Sharding-JDBC after discussion . Originally thought it was a simple component switching, but two problems were found after going online: First, because the posture was incorrect, the sub-database and sub-tables would occasionally fail to take effect and scan all the library tables, and the second was the occasional query delay. Reach the second level. Due to the lack of sufficient understanding of Sharding-JDBC, we did not solve these two problems quickly. Later, we had to switch back to the original component and switch after finding the problem.

Course summary

In this lesson, I took you to understand how we can resist the increased database traffic through master-slave separation and one-master-multi-slave deployment when the query volume increases. In addition to mastering the technology of master-slave replication, you also need to understand master-slave separation What problems will it bring and their solutions. The main points I want you to make clear here are:

  1. The separation of master-slave reading and writing and the deployment of one master with multiple slaves can solve the sudden database read traffic, which is a method of database horizontal expansion;
  2. After the read and write separation, the delay of the master and the slave is a key monitoring indicator, which may cause the situation that the data cannot be read immediately after the data is written;
  3. There are many solutions in the industry that can shield the details of database access after the master-slave separation, allowing developers to access a single database, including solutions embedded in applications such as TDDL and Sharding-JDBC, and independent deployments such as Mycat Agency program.

In fact, we can extend the master-slave replication to the technology of mutually replicating stored data between storage nodes. It can realize data redundancy to achieve backup and improve horizontal expansion capabilities. When using the technical point of master-slave replication, you will generally consider two issues:

  1. The balance between master-slave consistency and write performance. If you want to ensure that all slave nodes write successfully, then the write performance will definitely be affected; if you only write to the master node and return success, then the slave node is possible Data synchronization failure occurs, resulting in inconsistency between master and slave. In Internet projects, we generally give priority to performance rather than strong data consistency .
  2. The problem of master-slave delay, many weird problems that cannot read data may be related to it, if you encounter this kind of problem, you may wish to look at the master-slave delay data first.

Many components we use will use this technology. For example, Redis also achieves read-write separation through master-slave replication; index shards stored in Elasticsearch can also be replicated to multiple nodes; files written to HDFS will also be Copy to multiple DataNodes . It's just that different components have different requirements for replication consistency and delay, and different solutions are adopted. But the idea of ​​this design is universal and you need to understand it so that you can learn by analogy when learning other storage components.

Guess you like

Origin blog.csdn.net/sanmi8276/article/details/113093562