Dry Goods Sharing | The Practice of Database Resource Scheduling

Introduction: Exclusive, autonomous, controllable, and most cost-effective on the cloud

Author: Chen still recruit (Winner)

 

 

1. Questions to be considered in self-built database

 

Questions to consider when building a database:

First, solve the problem of database high availability.

Second, improve the efficiency of database resource utilization.

Third, the resource bottleneck. While improving utilization efficiency, solve resource bottlenecks.

Fourth, operation and maintenance happiness. The operation and maintenance database needs to solve the problem of operation and maintenance happiness.

Fifth, save costs for the company. As the company's business grows stronger, more demands are made for cost savings.

 

Summary: It is a basic job requirement for a DBA to manage the database well. "Makeup is there but nothing", we painted the makeup, but others couldn't see it. There is no problem with our database products, and it saves the company every year. For the company management, it is the basic skill of DBA management workers.

 

However, the normal operation of the database is affected by many factors. When there is a problem, the DBA may be upset. Because of application bottlenecks, most of them are solved by expanding the machine, but sometimes the expanding machine cannot solve the database problem. There is also an imbalance between the number of DBAs and the number of R&D personnel. One DBA corresponds to 100 to 200 R&D personnel. There needs to be a balance between supporting business and solving operation and maintenance problems. More supporting business requires less investment in operation and maintenance. DBAs are often struggling to balance between supporting business and operation and maintenance, and the two restrict each other.

 

There are many issues that need to be considered in self-built databases. In the past for a long time, we often faced various selection issues. For example, in order to improve business happiness, if the delay processing mechanism is enabled at night, etc.

 

 

 

2. The deployment history of Alibaba's internal database

 

The deployment history of Alibaba's internal database, the MySQL version was relatively low 10 years ago, and it has not been widely used in large-scale systems.

 

1) The first deployment is very simple, as shown in the figure: all main deployments are all on one machine, and all standby deployments are all on one machine.

 

1.png

 

In this case, with the rapid development of the company's business, the investment in back-end operation and maintenance positions and IT capital are relatively large. From the perspective of the DBA team, stability overrides everything , so this deployment method is chosen. For example, three instances are deployed, and these three instances provide services at the same time, as long as the utilization resources of this machine are sufficient.

 

In those days, there was no problem in normal times, but in an environment where the business pressure on Double 11 suddenly increased, there would be a connection problem : there were a lot of connections, and if the pressure was not enough, the response time to the database would be very long.

 

For example, a database instance, which serves 200 application servers in the front, usually has only 5 connections to the database. Normally, there are only 1,000 instances, which can provide 1,000 connection services. If in the double 11 high-pressure scenario, the application server is expanded from 200 to 1,000. In order to resist the pressure, the number of connections will be set to 20-40, and the connection in the database system is likely to reach 10,000 or more than 10,000.

 

The situation we were facing was that there were no exceptions in all aspects of the application database, but the application response time was particularly long. At the time of MySQL version 5.1, the number of connections was a big problem when business pressure increased.

 

The second problem is: the host is down . If a host is down, all three instances in the host need to be switched to a new machine. Switching hardware has a relatively large impact. The three instances may contain 1/3 of the company's business, and then 1/3 of the business will be affected. In order to reduce this impact, the standby database is required to be able to 100% support the situation that the host hangs up. Therefore, the configuration of the main database and the standby database need to be consistent, then the input cost of the standby database and the input cost of the main database Same as many.

 

2) The second deployment

Because the business is developing too fast and the hardware has reached an exponential level of investment, the company requires the DBA to optimize the cost . As shown in the figure below, to realize the cross deployment of active and standby, the main instance uses more resources, and the standby instance is used for replication and backup scenarios. Less resources will be used.

 

2.png

 

Under this deployment, the first problem we face is: partial oversold, blocking luck . For example, if the host deploys 64 cores, it feels like 96 cores are deployed. The DBA has implemented partial oversold , which is a block of luck at this time . If the host on the left hangs up, the two main deployments will all switch to the host on the right, and the left will not actually reach the total number of 64-core physical CPUs. This is the first problem faced.

 

The second problem is the separation of reading and writing . If you want to improve the efficiency of resource utilization when the active and standby are deployed, you will naturally think of using the standby deployment to separate reads and writes, and put some read requests on the deployment. In this case, the host on the left has Traffic, the host on the right also has traffic, and the utilization rate is higher. Of course, this will also block luck. In large-scale promotion scenarios, it is necessary to automatically switch between the main and standby deployments. I believe that the machine will not hang up during peak hours. In addition, many machines are cut off in the business, and the impact will be relatively small, so at that time Respond in this way.

 

3) The third deployment

Start to build a cluster-based approach, taking into account cost and stability , as shown in the following figure:

 

3.png

 

The first improvement is to fully break up the different machines and introduce the internal system that is still in use called the Yishan system to perform resource scheduling . According to the monitoring of the resource utilization efficiency of each machine, the trigger conditions are triggered in time, and the entire cluster The instance performs resource scheduling. It is equivalent to keeping the overall instance utilization level in a relatively balanced state.

 

The second improvement is the warm -up of the standby database . If the active and standby switchover suddenly occurs, the standby database will be cold-started if it does not undertake reading. The standby database needs to have a warm-up process. Otherwise, if the traffic is large, the standby database may be affected. In the case of scalding death, we have prepared the warehouse for preheating.

 

The third improvement is the separation of reads and writes , and we also consider read traffic.

 

4) The fourth deployment

Cross deployment of important systems and non-critical systems , as shown in the following figure:

 

4.png

 

In the first three modes, the core system and the non-core system are divided into two clusters for processing, the core system will be deployed more towards stability, and the non-core system will be deployed more towards cost optimization. At this time, there is a situation, that is, I know which system is the core system and which system is the non-core system, but if the non-core system suddenly becomes the core system, consult the non-core system that you think is used by the non-core system. Resources will be enhanced. In response to this problem, we have added an important system to lock resources and prevent contention. Locking resources will not be enhanced.

 

Non-critical systems: a large number of mixed deployments, resource flexibility at any time, and timely rebalancing of mountains. For example, large-scale activities and increased traffic can move other resources from the machine according to the utilization efficiency of the resource CPU and IOPS to achieve an overall balance of resource utilization efficiency.

 

 

 

3. Experience sharing

 

Experience 1: Active and standby cross deployment, a cost-reducing solution for pseudo-overconfiguration

 

As shown in the figure below, a machine actually uses 96 cores, but 64 cores on the mainframe. Because the main resources are alternately deployed, the utilization efficiency has not reached full load.

 

5.png

 

income:

  • The overall efficiency of CPU resource utilization reaches 70%.
  • Connection resources are effectively used.
  • Lower cost.

 

important point:

  • The HA mechanism should be flexible, when it is automatically cut, and when is it manually cut. For example, manually disconnect the main and standby databases at peak times to deal with high-traffic consulting issues.
  • Host problem monitoring, core resources are broken up. For example, on Double 11, the back-end resource guarantee will be broken up. Don’t let the host run at full capacity. If the host runs at full capacity, an avalanche will occur once it hangs up, because when you switch to the standby database, the standby database is also Will hang up, this is an experience of ours.

 

Experience 2: Application classification to ensure granularity differentiation

 

Application classification, with different classifications to ensure different granularity differentiation, such as general business or important business. For general services, cross deployment can be used to obtain more resource utilization. For important business, it is more biased towards the stability plan, which must have external SLA guarantee. Because many general businesses rely on small and medium businesses, application grading can improve overall utilization efficiency and improve usability. As shown below:

 

6.png

 

Two database resource models for different levels of business

 

Two database resource models of different levels of business need to be implemented through three-stage isolation:

  • MySQL actually eats memory. For example, if there is 64g of memory, it will basically use up 64g of memory. Oversold in memory is not possible, so I began to isolate it from memory . After isolation , use memory on a machine. Everyone will not grab orders.
  • If in a non-critical cluster, business traffic suddenly rises and the non-critical becomes important, bind the CPUs to achieve CPU isolation so that they will not be competed.
  • If the importance rises to the company's core business, it may be separated at this time, physical machine isolation , and enter the third stage.

 

7.png

 

Experience 3: Two abstract resource deployment schemes

 

Whether it is a database or application deployment, there are only two ways to allocate the lowest level of resource scheduling: balanced allocation and compact allocation .

 

Balanced allocation , as shown in the figure below, discrete database instances are distributed, and instances are allocated on more vacant hosts first. The utilization of resource hosts is relatively balanced, and the overall performance will be relatively stable.

 

8.png

 

Compact distribution , as shown in the following figure: Stacked database instance distribution, priority is given to continue to allocate instances on more full hosts, the utilization of resource hosts will be higher, the performance of each host will be different, and the cost will be better.

 

9.png

 

Compact and balanced crossover application scenarios

 

Typical application scenarios, such as compact allocation at the beginning, pay more attention to cost. During the big promotion, machines are added, which becomes a balanced distribution, realizing a balanced and comprehensive scheduling of resources. After the big promotion is finished, return to the compact distribution and free up the machine to use where it is needed. This resource shifting can be automated or manual shifting. Normally compact allocation makes resource utilization higher and lower cost, while large promotion uses balanced allocation and higher stability. As shown below:

 

10.png

 

Lesson 4: Resource flexibility, eliminate business flow assessment anxiety

 

Business traffic assessment anxiety:

  • Evaluation process: business indicators-business QPS/TPS-full link stress test-database preparation.
  • Standard on-line process: advance resource reporting, detailed deployment monitoring, SQL review, and late-night duty observation of performance.
  • Three results: failure to resist is because it is not evaluated well, to resist it is right, and resource utilization is not high because it is not evaluated well.

 

11.png

 

As shown in the figure above, every database in the mind of the business side is very important, and the business is his all. However, in actual operation, the DBA will distinguish which are the core businesses, as shown in the red part on the right; which businesses are non-core businesses, and those businesses will go offline after running for a period of time.

 

Therefore, from the perspective of the DBA, the status of the database is actually different. In order to deal with this difference, we have built a hybrid cluster internally. In this mode, the DBA does not need to evaluate the traffic size at the beginning, and can directly put it Into the cluster. If the business suddenly rises, you can quickly bounce it up, and then immediately lock the CPU so that it will not increase anymore, ensuring its importance for a period of time. If the business continues to grow, then transfer it to a cluster set up specifically for it. This process ensures the stability and availability of the database. At the same time, problems due to DBA evaluation errors are reduced.

 

 

 

4. Introduction to Cloud Database Exclusive Cluster Products

 

Cloud database dedicated cluster deployment architecture diagram

 

In response to Ali’s practice, a dedicated cloud database cluster has been launched on the cloud. What is the difference between this dedicated cluster and the PaaS platform?

 

The PaaS platform buys all instances, and the exclusive cluster buys the mainframe. Customers can build a complete set of exclusive cluster models and build an exclusive set of technical management services. The benefits also include the ability to over-provision, the customer’s own management, and the ability to achieve active/standby crossover. Deployment to improve resource utilization, can realize the ability of resource scheduling between different machines, resource rebalancing, and rapid rise and shrinkage of resources.

 

The above is the product introduction of the dedicated cluster. The summary is: first, it is a dedicated data center on the cloud; second, it is autonomous and controllable; third, because the DBA has a better understanding of the business, it can achieve cost optimization.

 

12.png

 

Progress in capacity building of dedicated clusters for cloud databases

 

The functions of the PAAS service are available in the cloud exclusive database. Hybrid deployment is in research and development, and we have already implemented support for others. The original database has PasS capabilities, including high availability and high reliability performance.

 

In addition, we have added new resource scheduling functions, including compact, balanced, resource balancing by moving mountains, elastic expansion and contraction of resources, and so on. The cloud database dedicated cluster manages data resources in the form of a host, and is no longer managed in the form of an instance like the PaaS platform.

 

13.png

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/114987387