The Road to Stability Construction of Vivo Account Services - Platform Product Series 06

Author: Vivo Internet Platform Product R&D Team - Shi Jianhua, Sun Song

An account is a core basic service, for which stability is the lifeline. In this article, we will share with you our experience and exploration in account stability construction.

I. Introduction

The vivo account is a necessary pass for users to enjoy the entire vivo ecological service, and it is also the cornerstone of the development of various businesses in the ecosystem. With the rapid growth of the company's business, the account system currently serves 270 million online users, and the average daily call volume exceeds 10 billion. As a typical system with three high attributes (high performance, high concurrency, and high availability), the account system's Stability is particularly important. To ensure the stability of the system, we need to comprehensively consider many factors. This article will start from the three dimensions of application services, data architecture, and monitoring, and share the experience summary of account server stability construction.

2. Application Service Governance

The book "The Way of Clean Architecture" summarizes the value of software into two dimensions: "behavior" and "architecture".

Behavioral value : Let the machine operate in a specified way to create or increase profits for the users of the system.

Architectural value : Always keep the software flexible so that we can flexibly change the working behavior of the machine.

Behavioral value describes the present, and the most intuitive feeling for users is ease of use and richness of functions. Good behavioral value can attract users, which in turn can have a positive return on service providers.

Architectural value describes the future, and refers to the internal structure, technical system, stability, etc. of the service system. Although these values ​​are invisible to users, they determine the continuity of services.

The purpose of the governance of application services is to maintain the "architecture value" of the system, and then continue the "behavioral value". In the "Service Governance" chapter, we will focus on two points: "service splitting" and "relationship governance".

2.1 Service split

Service splitting refers to splitting a service into multiple small, relatively independent microservices. Service splitting has many benefits, including improving system scalability, maintainability, stability, and more. The following will introduce the splitting scenarios we encountered during the system construction process.

2.1.1 Adjustment and splitting based on organizational structure

Conway's Law (Conway's Law) was proposed by Malvin Conway in 1967: "The architecture of design systems is subject to the communication structure of the organizations that produce these designs.". That is to say, the system design essentially reflects the organizational structure of the enterprise, and the relationship between the various modules of the system also reflects the information flow and cooperation methods between the various departments of the enterprise. The content is shown in the following figure (Figure 1):

Figure 1 (Image source: WORK LIFE )

Organizational structure adjustment is an important challenge that enterprises often need to face in the process of development. The reasons are usually related to market demand, business changes, and synergistic efficiency. If the service split is not followed up in time, problems such as poor cross-team collaboration and communication difficulties will follow. In essence, these problems all stem from differences in team division of labor and core goals.

Case introduction

Vivo has launched the game joint operation business in the early days of the Internet. The full name of game joint operation is game joint operation, which specifically refers to the game R & D manufacturers grafting products to the vivo platform for operation in the form of cooperation and sharing. In the beginning, the vivo Internet team was small in scale, and account-related businesses belonged to the current system account team. In the game intermodal business, we provide the service of creating corresponding sub-accounts (namely game accounts) for different games, and the sub-accounts include relevant information such as game characters.

With the rapid development of the game business, the game business department was established, and its core goal is to serve game users well. The goal of the system account is to provide our mobile phone users with a simple and safe experience from the perspective of the entire vivo ecosystem. Soon after the change in organizational structure, the two teams quickly reached a consensus on the business boundary and completed the split of the corresponding services.

Figure 2 (Split game account)

2.1.2 Splitting based on stability statement

The split of services caused by the adjustment of organizational structure is an external cause, and its content scope and time point are relatively easy to determine. The split based on stability considerations is an internal cause and needs to be done at the right time to avoid affecting the normal version iteration of the business. In practice, in terms of splitting strategy, we are more splitting based on the core process.

(1) Splitting of core behaviors

In a business system, there will be core processes. The core process undertakes the core work in the system. Take an account as an example: registration, login, and credential verification are undoubtedly the core processes in the system. We separate the core processes independently, mainly to achieve the following two goals:

service isolation

Avoid interaction between different processes. Taking the account credential verification process as an example, the verification logic is fixed, and the architecture only relies on the distributed cache. Once coupled with other processes, in addition to bringing more external dependency risks, other process modifications and releases will also affect the stability of the credential verification process. 

resource isolation

Service splitting enables isolation of server resources, which provides more flexible possibilities for horizontal resource expansion. For example, for core process services, resources can be appropriately redundant, and strategies for dynamic expansion and contraction can be customized.

How to identify core behaviors?

Some core processes are obvious, such as registration and login in account, but some processes need to be identified and judged. Our practice is to make judgments based on the two dimensions of "business value" and "call frequency", where "business value" can select the process associated with the core business indicator, and "call frequency" corresponds to the execution times of the process . Superimposing these two dimensions, we can get a four-quadrant matrix diagram. The figure below is a schematic diagram of the account business matrix (Figure 3). The core process is located in the upper right corner of the figure (high value, high call), here is a principle that the processes located on the diagonal should be isolated from each other as much as possible;

Figure 3 (matrix diagram)

(2) Minimum element aggregation

Services are not split as finely as possible. Too fine split will lead to too many services, which will increase the complexity and maintenance cost of the system. In order to avoid excessive splitting, we can analyze the business elements relied on in the process and perform aggregation between processes appropriately. Taking registration as an example, in the case of the simplest process, it only needs to be completed around the four elements of the account (username, password, email, and mobile phone number). For the process of changing the mobile phone number, it relies on the verification of the password or the original mobile phone number (two of the four elements). Therefore, we can combine the two processes of registration and mobile phone number change into the same service to reduce maintenance costs.

Figure 4 (minimum element closed loop)

(3) Schematic representation of the overall split 

The early account main service included processes such as account login, registration, credential verification, and user profile query/modification. If we need to split the service, we should sort out the core process first. According to the illustration in Figure 4 above, we should first complete the login, registration, credential verification and splitting of user data. User information mainly includes extended information such as nicknames and avatars, and does not include the four elements of the main account (username, password, email address, and mobile phone number).

For the three behaviors of login, registration, and credential verification, as the number of registered users increases, the frequency of login and credential verification far exceeds that of registration. Therefore, we performed a second split, splitting login and credential verification into one service, and splitting registration into another service. The split structure is shown in the figure below (Figure 5).

Figure 5

(4) Changes in business value

Business value changes dynamically, so we need to timely adjust the structure of service splitting according to business changes. Practical cases include the splitting of the real-name module in the account information service. Early real-name information was only used in comment scenarios, so its value is not much different from information such as nicknames and avatars. However, with the in-depth development of the game business and the national anti-addiction requirements, if the user does not have real-name authentication, relevant services cannot be provided. The importance of real-name information to the game business is equivalent to credential verification. Therefore, we split the real-name module into independent services to better support business development and changes.

2.1.3 Split implementation plan

Stability is key when doing service unbundling for a mature business. It must be ensured that there is no impact on the business and that users are not aware of it. In order to reduce the difficulty of splitting implementation, we will adopt the plan of splitting the service first (Figure 6), and then splitting the data. When splitting services, in order to further reduce risks, the following two practices can be considered:

  • In the service splitting phase, only code migration is done, no code refactoring is done

  • Introduce grayscale capability, gradient verification through controllable flow

Figure 6

 It is necessary to emphasize the importance of gray scale again, and use controllable traffic to verify the split service. Here are two grayscale implementation ideas:

  • Forwarding in the application layer, the specific processing details: apply for an intranet domain name for the new service, and intercept in the original service to realize the logic of request forwarding.

  • In the more pre-stage link of the architecture, traffic distribution is completed. For example: configure traffic forwarding at the ingress gateway layer or reverse proxy layer (such as Nginx).

2.2 Relationship Governance

Dependencies between services are critical to service architecture. In order to make the dependencies between services clear and clear, we can adopt the following optimization measures: First, the dependencies between services should be hierarchical. Each service should be at a specific level, and dependencies should be hierarchical to avoid cross-level dependencies. Secondly, the dependency should be one-way, and it must comply with the ADP (Acyclic Dependencies Principle) principle of no dependency ring.

2.2.1 ADP principles

ADP (Acyclic Dependencies Principle) has no dependency ring principle. The dependencies marked by the red line in the figure below (Figure 7) are all violations of the ADP principle. This relationship affects the achievement of the "deployment independent" goal. Imagine a scenario where A and B services depend on each other. A requirement needs to modify the interdependent interfaces of A and B at the same time. The order of release should be that the dependent ones are deployed first, and the interdependence enters an endless loop.

Figure 7

2.2.2 Relationship processing

In the service architecture, the relationship between services can be divided into weak dependencies and strong dependencies according to the strength of dependencies. When A service depends on B service, if B service fails abnormally, it will not affect the business process of A service, then this dependency is called weak dependency; on the contrary, if B service fails, A service will not work properly , then this dependency is called a strong dependency.

(1) Strong dependence on redundancy

For strong dependencies, we will adopt redundant strategies to ensure the stability of core service processes. In the account system, "one-click login" and "real-name authentication" all adopt the same scheme. The premise of this solution is to be able to find multiple services that provide the same capabilities. Secondly, the services themselves need to do some adaptation work, as shown in the figure below (Figure 8). The traffic distribution processing module is added to monitor the quality of dependent services and dynamically adjust them. Traffic distribution ratio, etc.

Figure 8

In addition to the implementation of dynamic traffic allocation, you can also choose a relatively simple primary and secondary solution, that is, to rely on one of the services fixedly, and then rely on another service when the service is abnormal or broken. This primary and secondary scheme can improve the availability of services to a certain extent, and it is also relatively simple and easy to implement.

(2) Weakly dependent on asynchronous

A common asynchronous solution is to rely on an independent message component (Figure 9), and change the processing of the original synchronous call to message sending. Doing so can not only achieve decoupling of dependencies, but also increase system throughput. Looking back at the circular dependency we mentioned in the ADP principle, it can be decoupled and avoided through message components.

Figure 9

What needs to be reminded is that the use of message components will increase the complexity of the system. Asynchrony is inherently more complicated than synchronization, and issues such as message out-of-order, delay, and loss need to be considered. For these problems, you can try the following solution: Instead of sending messages directly in the service process, rely on the data generated by the service process for message production, as shown in the figure below (Figure 10). The usage scenarios in the account system include service notifications after account registration and cancellation.

Figure 10

The choice of kafka components is a feature that can provide orderly messages. In the solution, from binlog collection to push messages, it can be understood as a data transmission service (Data Transmission Service, referred to as DTS). There is a self-developed "Luban platform" in vivo to realize the DTS capability. Readers and friends can use similar open source The Canal project achieves the same effect.

3. Data Architecture Governance

3.1 Caching

In a high-concurrency system architecture, caching is one of the most effective ways to improve system performance. Cache can be divided into two types: local cache and distributed cache. In the account system, in order to cope with different scenarios, we use a combination of local cache and distributed cache.

3.1.1 Local cache

Local caching is to cache data in the local memory of the service. The advantage is that the response time is fast and it is not affected by external factors such as cross-process communication. However, there are also many disadvantages. Due to the limitation of the service memory size and the consistency of multiple nodes, the scenario used in the account is to cache relatively fixed data.

3.1.2 Distributed cache

Distributed caching can effectively avoid problems such as service memory size limitations, and at the same time provide better read and write performance than databases. However, the introduction of distributed caching will also bring additional problems, the most prominent of which is the problem of data consistency.

(1) Data consistency

There are many options for dealing with data consistency. According to the business scenario used by the account, the solution we choose is: Cache Aside Pattern. The specific logic of Cache Aside Pattern is as follows:

  • Data query : Fetch from cache, return directly if hit, fetch from database and set to cache if miss.

  • Data update : first update the data to the database, and then directly delete the cache.

Figure 11 (Schematic diagram of Cache Aside Pattern)

The core point of processing is to delete the cache directly when the data is updated, instead of refreshing the cache. This is to avoid data inconsistency that may be caused by concurrent modification. Of course, the Cache Aside Pattern cannot eliminate consistency problems.

There are mainly two scenarios:

The first case removes the cache exception. This can either try to retry, or directly rely on setting a reasonable expiration time to reduce the impact.

The second scenario is a theoretical possibility with a very low probability.

A read operation does not hit the cache, and fetches data from the database. At this time, a write operation occurs. After the database is written, the cache is deleted, and then the previous read writes the old data into the cache. It is said that it exists in theory because the conditions are too harsh. First, the cache invalidation needs to occur when the cache is read, and a write operation is concurrent. Then we know that the write operation of the database is usually much slower than the read operation, and the problem is that the read operation must enter the database operation before the write operation, and the cache is updated later than the write operation, so it is only theoretically possible sex.

Based on the comprehensive consideration of the above situation, we chose the Cache Aside Pattern solution to reduce the probability of concurrent dirty data as much as possible, instead of ensuring strong consistency through the more complex 2PC or Paxos protocol.

(2) Batch read operation optimization

Although using cache can significantly improve system performance, it cannot solve all performance problems. In the account service, we provide the user profile query capability to obtain the user's nickname, avatar, signature and other information based on the user ID. In order to improve the performance of the interface, we cache relevant information in Redis. However, with the rapid growth of the number of users and calls, as well as the new demand for batch queries, the capacity of Redis and the performance of the service interface are under pressure.

In order to solve these problems, we have adopted a series of targeted optimization measures:

First, we compress the cached data before writing it to Redis. This can reduce the size of the cached data, thereby reducing the overhead of data transmission and storage over the network.

Next, we replaced the default serialization method and chose protostuff as an alternative. Protostuff is an efficient serialization framework that has the following advantages over other serialization frameworks:

  • High performance : protostuff adopts zero-copy technology to directly serialize objects into byte arrays, avoiding the creation and copying of intermediate objects, thus greatly improving the performance of serialization and deserialization.

  • Space efficiency : thanks to a compact binary format, protostuff can serialize objects into smaller byte arrays, saving storage space.

  • Ease of use : protostuff is developed based on protobuf, but it supports the Java language better. You only need to define the structure and annotations of Java objects to perform serialization and deserialization operations.

There are many serialization schemes, such as thrift, etc. For their performance comparison, you can refer to the figure below (Figure 12), and readers can choose according to their actual project conditions.

Figure 12 (Image source: Google Code)

Finally, the application of Redis Pipeline commands. Pipeline can package multiple Redis commands into one request and send it to the Redis server at one time, thereby reducing network delay and server load. The main function of Redis Pipeline is to improve the throughput of Redis and reduce the delay, especially when a large number of the same Redis commands need to be executed, the effect is more obvious.

The above optimizations finally brought us half of the Redis capacity savings and a performance improvement of about 5 times, but at the same time increased the additional CPU consumption by about 10%.

3.2 Database

Compared with application services, databases are more likely to become the bottleneck of the system in high-concurrency systems. It cannot achieve the same convenient horizontal expansion as the application, so the planning of the database must be done in advance.

3.2.1 Read and write separation

Account business is characterized by more reads and less writes, so the first pressure encountered is the pressure of database reads, and the read-write separation architecture (Figure 13) can effectively reduce the load on the main database. In the read-write separation scheme, the master library bears all the write traffic, and the slave library and the master library share the read traffic. Multiple slave libraries can be configured at the same time, and high concurrent query traffic can be shared through multiple slave libraries

Figure 13

The read ability of the main library is reserved because of the problem of "master-slave synchronization delay", and the main library is continued to be queried for scenarios where data delay is unacceptable. The advantage of the read-write separation scheme is that it is simple, there is almost no code transformation cost, and only the master-slave relationship of the database needs to be added. There are also many disadvantages. For example, the problem of high TPS (write) cannot be solved, and slave libraries cannot be added without restraint. Too many slave libraries will aggravate the delay problem.

3.2.2 Sub-table and sub-database

Reading and writing separation certainly cannot solve all problems, and some scenarios need to be combined with table and database sub-schemes. There are two schemes of splitting tables and databases: vertical splitting and horizontal splitting. The official account of vivo Internet technology has a detailed explanation of the scheme of partitioning databases and tables. I won’t go into details here. If you are interested, you can go to read the detailed discussion on horizontal database and table splitting  .  . Here I will talk to you about the motivation of sub-table sub-library and some experience summarization of auxiliary decision-making.

(1) What problem does sub-table solve?

The general answer is to solve the performance problems caused by large tables. Where is the specific impact? How to judge whether to divide the table?

① Query efficiency

The most direct feeling of large tables is that it will affect query efficiency. Let's take mysql-InnoDB as an example to analyze the specific impact. The InnoDB storage engine uses a B+Tree structure to organize indexes. Taking the primary key index (clustered index) as an example, its characteristic is that leaf nodes store complete data, and non-leaf nodes store key values ​​+ page address pointers. The nodes here correspond to the concept that storage is a data page. A data page is the smallest storage unit of InnoDB, with a default size of 16k. A schematic diagram of a clustered index (Figure 14) is as follows:

Figure 14

The data query operation on the clustered index tree starts from the root node, performs binary search within the node to determine the position of the data page at the next layer of the tree, and locates the data through binary search after reaching the leaf node. From this search process, we can see that the impact on the query mainly depends on the height of the index tree. If there is one more layer height, there will be one more data page load (memory does not exist) and one more binary search in the data page.

If you want to evaluate the impact of data volume on queries, you can achieve it by estimating the relationship between the height of the index tree and the data volume. As mentioned earlier, non-leaf nodes store key values ​​+ page address pointers, and the size of page address pointers is fixed at 6 bytes, so the calculation formula for the storage capacity of a non-leaf node is pagesize/(index size+6). Leaf nodes store specific data, and the number of storage can be simplified as pagesize/(data size), so the relationship between the height of the tree and the amount of data is as follows: 

According to the formula, we take the self-incrementing BIGINT field as the primary key, the data size of a single row is 1K, and the data page size is the default 16K as an example. The amount of data that the three-layer tree structure can accommodate is about 20 million. This method is only to assist you in making estimates. If you want to determine the real value, you can use some tools to obtain it directly on the data page.

After understanding these, let's look at the logic behind the sub-table scheme. Horizontal splitting is to actively control the amount of data in the table to achieve the purpose of controlling the height of the tree. The vertical split of the table is to increase the capacity of the leaf nodes, so that a tree of the same height can accommodate more data.

② Table structure adjustment efficiency

Business changes occasionally involve table structure adjustments, such as adding new fields, adjusting field sizes, adding indexes, and so on. You will find that the larger the amount of data in the table, the longer the execution time of some DDLs, and the execution time of adding fields to some online large tables may take several days. Which specific DDL will be more time-consuming? You can refer to the online-ddl operation instructions ( details ) on the mysql official website , and pay attention to whether the operation involves Rebuilds Table. If so, the larger the data volume, the more time-consuming it will be.

In addition to the impact of table structure adjustment and data query, the larger the amount of data, the worse the tolerance for errors, which is a hidden danger for stability assurance.

Based on the reasons described above, Jinli in the business controls the height of the index tree to three layers. At this time, the size of the table data is about tens of millions. If the growth of the data volume exceeds this expectation, it is necessary to evaluate the importance of the data table to the business, usage scenarios, etc., and then split the table in due course.

(2) What problem does the sub-database solve?

The sub-library is usually understood to solve the problem of resource bottlenecks. For a single database, no matter how powerful the hardware is, it still has upper limit problems such as the number of connections and disk space. After the database is divided, different instances can be deployed on different physical machines, breaking through resource bottlenecks such as disks and connections, while providing better performance.

In addition to the consideration of resource constraints, the sub-database processing will also combine reliability and other requirements in the account to split the database. In this way, the core module can be isolated from the non-core module, and the mutual influence between them can be reduced. The situation after dismantling the current account system is as follows (Figure 15).

Figure 15

The split account main library is the core business library. Curry organizes data around the four elements of the account (username, password, email, and mobile phone number), so that the core processes of the account, login and registration, rely on data without interference from other data. This splitting method belongs to vertical splitting, which divides tables into different libraries according to certain rules.

(3) Data Migration Practice

Data migration is the most costly thing in the implementation of database and table partitioning schemes. In the practice of vertical database partitioning, the account system mainly uses the master-slave replication mechanism of mysql to reduce the cost of data migration. First let the DBA hang a new slave library on the original master library, and copy the table data to the new library. In order to ensure data consistency, the online library cutting process is divided into three steps (Figure 16).

  • Step1: Prohibit writing to the main library to ensure that the master-slave data is synchronized and consistent;

  • Step2: Disconnect the master and slave, and the new library becomes an independent master library;

  • Step3: The application completes the routing switch of the new library (switch implementation).

Figure 16

With the cooperation of the DBA, these operations can control the impact on the business at the minute level, and the impact is relatively controllable. Moreover, the code-level modification cost of the entire solution is also very small. The only thing to note is that you must do a drill before going online.

In addition to the above vertical database splitting scenario, the account has also experienced horizontal splitting after the data volume of a single core business table exceeds 100 million, so the copy migration solution in this scenario is not applicable. The split was implemented at the end of 2018, and the scheme uses the open source Canal to realize data migration. The overall scheme is as follows (Figure 17).

Figure 17 

4. Monitoring and governance

The purpose of monitoring and governance is to allow us to understand the status of the system in real time, to give early warning of faults in a timely manner, and to assist in rapid problem location. The early account has experienced that the content of the alarm is not comprehensive, and the research and development cannot receive the alarm in time. Sometimes an alarm is received, but because the cause is unknown, it is difficult to troubleshoot the alarm problem, and the processing time is too long. With continuous governance and multiple online verifications, we can be sensitive to problems and handle them quickly.

4.1 Monitoring content

We summarize the monitoring content into three dimensions (Figure 18), which are from top to bottom:

  • Upper-layer application service monitoring : monitor the status of the application layer, such as: service access throughput, return code (failure), response time, business exceptions, etc.;

  • Middle-level independent component monitoring : independent components cover the middleware of service operation, such as: Redis (cache), MQ (message), MySQL (storage), Tomcat (container), JVM, etc.;

  • Low-level system resource monitoring : monitor the host and low-level resources, such as: CPU, memory, hard disk I/O, network throughput, etc.;

The monitoring content covers three reasons. If you only pay attention to application services, if a problem occurs, you only know a result and cannot perform a quick positioning analysis. You can only check the possibilities based on experience. Such a fault handling speed is not enough. way to bear. Often, alarms from upper-layer applications may be caused by abnormalities in some components or underlying system resources. Assume that when we encounter a service response time alarm, if there is a corresponding JVM FGC time length alarm or myql slow query sql alarm at this time, it is very convenient for us to quickly clarify the direction of priority investigation and determine the follow-up processing measures.

Figure 18

In addition to supporting and locating problems, component monitoring and underlying resource monitoring also serve the purpose of eliminating hidden dangers in advance. Many hidden dangers have limited impact on application services at the beginning, but this impact will gradually increase with changes in external factors such as call volume.

Maintenance of monitoring content. Among the three dimensions of monitoring content, the underlying system resources and middle-level independent components have relatively fixed content and do not require frequent maintenance. However, if business exceptions are involved in the upper-layer application service monitoring, it is necessary to continuously add and subtract as the function version iterates.

4.2 Aggregation of associated indicators

The content of three-dimensional monitoring, because of the division of labor within the company, R&D, application operation and maintenance, and system operation and maintenance, are prone to separate management, and monitoring indicators may also be scattered in different systems, which is very unfavorable for problem location analysis. The best monitoring system is able to connect the indicators of these three dimensions, so that problem analysis and processing will be more efficient. The following is our experience when tracking the "sporadic dubbo service thread full" problem. The difficulty of troubleshooting sporadic problems cannot be concluded with one analysis result. With the help of the company's business monitoring system, after we ruled out the influence of intermediate components such as redis, we began to focus on the host indicators. The key indicators (CPU, IO, NET) of all virtual machines on the host are aggregated, and the effect is as follows (Figure 19). After multiple verifications, it was determined that the disk IO of individual applications on the host was abnormally high.

Figure 19

4.3 Caller distinction

In application service monitoring, the service interface call volume TOP N will be the key monitoring object. However, in the middle-end service, the granularity of the interface is not enough. It needs to be refined to the dimension that can distinguish the caller to monitor the growth trend of TOP N callers on a specific interface. The advantage of this is that the finer the granularity of monitoring, the better the risk can be perceived in advance. The second is that once it is confirmed that the flow is unreasonable, it can also be targeted for flow control and other processing.

V. Summary

This article introduces some experience summaries of the account system in terms of stability construction from the dimensions of service splitting, relationship governance, cache, database, and monitoring governance. However, just doing this is not enough. The construction of stability requires a rigorous and scientific engineering management system, which involves not only the design, development and maintenance of R&D, but also the work content of each role in the project team. All in all, stability building requires continuous careful planning and practice throughout the project lifecycle. We also hope that the experience and ideas described in this article can play a guiding role for readers in practice. 

references:

Guess you like

Origin blog.csdn.net/vivo_tech/article/details/131232973