With high concurrency and high availability on the game server, how to support millions of players online at the same time without problems

To describe a good server architecture in a popular way, the most basic and important thing is to  support millions of players online at the same time without problems  . These two points correspond to high concurrency and high availability respectively.

This article systematically introduces high concurrency and high availability in the game server.

High concurrency and high availability are complementary tasks. When we support millions of players online at the same time, but we cannot guarantee the stable availability of the server, then high concurrency support is impossible to talk about; and if the server often has problems when the number of players is large, then Nor can it be called high availability.

1. Horizontal expansion

Horizontal expansion is the basis for high concurrency and high availability. By supporting horizontal expansion, we can theoretically obtain unlimited load capacity by adding machines to support high concurrency; on this basis, if a process is abnormal, other processes can be substituted It provides services to achieve high availability.

The following figure is an example. For an architecture that does not support horizontal expansion, there is only one battle process in the game server to provide battle services for all players. There are two problems here: 1. A process can only use the computing resources of one machine at most. There is an upper limit on performance. 2. If the process or the machine/network is abnormal, the entire system is unavailable.

With high concurrency and high availability on the game server, how to support millions of players online at the same time without problems

 

Does not support horizontal expansion

There are two common implementation models for horizontal expansion:

  • The lobby server is fully connected with all battle processes. When you need to access the battle service, you need to check the process address of the service in the manager, and then go directly to the process. (Left picture)
  • Hang a route in front of the battle process. The route records the battle process of each battle, and related requests will be forwarded to the corresponding process. (Picture on the right)

With high concurrency and high availability on the game server, how to support millions of players online at the same time without problems

Two models of horizontal expansion

1.1 Stateful vs. Stateless

From the perspective of whether the state is saved in the process memory, services can be divided into stateful and stateless:

  • Stateful service: The state is saved in the process memory. For example, the battle service saves battle information (player character state, mob state, etc.) in the memory, and the player operation or battle logic will change the battle information. Because the state in the game is more complicated and the frequency of business changes is relatively high, most of the business of the game is provided in the form of stateful services.
  • Stateless service: The service only processes the process and does not save the data. Generally, the data is saved in the back-end db. The logic of this service generally has many db operations. This type of service is widely used in the Internet and web industry, such as recharge and login, which are common in games. (Stateless services may have some temporary variables to apply for memory during the execution of a process)

The stateless service itself does not save the state, so the process crash will not lose information. In addition, as will be introduced below, stateless services are more tolerant of exceptions due to the use of randomly allocated routing methods. Therefore, from the perspective of high availability, stateless services are better.

However, because statelessness does not save state, all state operations are database operations, which results in higher development costs (code is more complicated to write) and greater database pressure. Therefore, statelessness is not suitable for all services. Generally, the state is simple For clear services, stateless can be used first, such as friend services.

1.2 routing strategy

For stateful and stateless services, they use different routing methods.

For stateless services, random  routing is generally used  . The random routing method has great advantages. If a process crashes or the network fails, we only need to remove this process from the routing. It will not affect subsequent requests, but will only affect the current process. Processing logic.

The routing of stateful services needs to specify which process each request is processed, and other processes cannot be processed because they do not have relevant state information. For example, in the battle service mentioned above, routing can process the relevant request based on the battle ID and send it to the corresponding battle process. Routing generally uses  modulus  or  consistent hashing  , and generally uses consistent hashing instead of modulo to prevent jitter caused by faults.

1.3 Give an example

The following figure is a simplified model of a game architecture. The real server is much more complicated than this. The main purpose here is to give an example.

With high concurrency and high availability on the game server, how to support millions of players online at the same time without problems

 

Clusters can be divided into three categories: stateful services that support horizontal expansion, stateless services that support horizontal expansion, and single point services that do not support horizontal expansion

Among them, the number of processes that support horizontal expansion of stateful services and stateless services can account for 90% of the project, and there are few single point services. Under this kind of architecture, the bottleneck of our game's load limit lies in the single point service, and the single point service logic is relatively simple, and the load limit is very high. In addition, the abnormality of the service process that supports horizontal expansion will only affect the players served by this process, which has high availability.

Stateful service

In the server cluster of our game, about two-thirds of the process is the process of processing the player's personal logic (player cluster, many game projects are called lobby servers). Each process handles a part of the player's business logic, and distributes the players to different player processes through shading.

The server capacity can be increased by increasing the number of personal logic processes. We support the continuous increase or decrease of processes, that is, dynamic expansion and contraction. , These processes are equal, and there is no strong dependency between different processes. When a process crashes, it will not affect players in other processes.

In addition to player processes, there are battle processes, family processes, and other similar processes that can be designed in this way.

The above mentioned are all stateful services. We need to record which process each player/fight/family is in. In addition, if the process is abnormal, it will not affect other players/fights/family, but the players/fights in the current process /Family will be unavailable, and some data will be lost.

Stateless service

We use stateless implementation for some services, such as login, payment, friends, and some leaderboards. Since stateless services are more friendly to exceptions and simpler dynamic expansion and shrinking models than stateful services, we prioritize the use of stateless services for a new service, and consider using stateful services if the state is more complex.

Single point of service

It is inevitable that there will be some single point services in game services, such as player manager, cluster manager, family manager, etc. These services have no scalability and are the bottleneck of the game server. In addition, it does not have high availability. If an exception occurs, the entire game cluster will be unavailable.

The single point service logic is generally simple (complex logic must support horizontal expansion), and the performance load is generally high. For example, the current evaluation of our game and the online conservative estimate should be 50w. At this time, we believe that some of our single point services will be fully loaded, causing the game to be unable to continue to expand.

In addition, the number of single point services is small, and the possibility of abnormalities is low. Our game has been online for nearly two years, and we have only encountered two machine downtimes, which affected non-single point processes and did not affect the overall game cluster availability.

Of course, a single point of service can also be changed to support horizontal expansion, it is just a matter of workload. The theory is that it can completely eliminate the single point, but for most projects, the cost performance is not high and it is of little significance.

Two, high concurrency

The horizontal expansion scheme is the main means to support high concurrency (also called scalable Scalability), which has been introduced above.

The following mainly introduces other solutions besides horizontal expansion and high concurrency, as well as points that need attention.

2.1 Vertical expansion and performance optimization

To improve the carrying capacity, there are generally two solutions:

  • Horizontal expansion: Increase the carrying capacity by increasing the number of machines.
  • Vertical expansion: Increase the carrying capacity by increasing the machine configuration. Allow one machine/thread to carry more players.

Vertical expansion is also useful in certain scenarios. Generally speaking, for the single point we mentioned above, if it is not easy to eliminate or the elimination cost is high, this logic can be placed on a high-profile machine through vertical expansion to increase the load of the single point logic.

In addition, we often optimize the performance of combat uniforms, such as writing high-consumption modules in C++, but generally do not use them as the primary means of improving the load-bearing capacity of the lobby uniform. We will not discuss this in depth. On the one hand, there will always be an upper limit, and it is difficult to change qualitatively. On the other hand, different game optimization schemes vary greatly, all of which are code-level optimizations.

The goal of server-side optimization is similar to that of vertical expansion, which is to allow one machine to carry more players/logic.

2.2 Eliminate system single points and logic single points

The elimination of single points introduced above is mainly a system single point, that is, multiple processes are used to provide services instead of one process.

The premise of eliminating system single points is to eliminate logical single points.

For example: when we drop a weapon, we must generate a unique ID for the weapon to identify the weapon. This ID can use an auto-incremented ID, which creates a logical single point.

In this case, if the frequency of weapon generation in the game is very low, then this solution is also possible, but if the frequency of weapon generation is high, because all logic in the game needs to go to one place to apply for this ID, it may cause a bottleneck . In this case, we can generally use uuid instead of self-incrementing ID. (This scenario is also common in auto-increment columns in DB, so it is generally recommended to use auto-increment less)

2.3 Database hosting

When the online amount of players reaches a certain level, it often causes great pressure on the back-end database.

Generally speaking, the database itself has the ability to expand horizontally, and it is easier to improve the carrying capacity with the addition of solutions such as database sub-tables. However, when designing the database structure, it is also necessary to consider issues such as indexes and shadowkeys, otherwise the database performance will be seriously affected. In addition, consider the concurrent read and write capabilities of the database. For example, the MMAPv1 storage engine in mongo is a doc-level lock, while the WiredTiger storage engine is a collection-level lock. The concurrency capabilities of the two are very different.

The game logic is generally more complicated, and the amount of data read and write is large. If the database is read and written every time the player information changes, it will cause greater database pressure. Therefore, game player services are generally stateful services. Players read data from the database into the memory when they are online, and read and write data during the online period to directly manipulate the memory, and land to the database when they are offline or after a period of time. This solution greatly reduces the database read and write operations, and the pressure on the database will be much less.

For some services with low frequency of data read and write operations, you can consider making the service stateless and then operating the database every time you read and write.

2.4 Multi-cluster and cross-cluster

When the game server reaches a certain scale, it often needs to be deployed in clusters, and the scenarios to be resolved by clusters:

  • The load capacity of a single cluster has an upper limit. For example, skynet only supports 256 processes.
  • Multi-zone server requirements, each cluster corresponds to a zone server of the game. If the game supports multi-region servers and is completely isolated without cross-server communication, it will be much easier to achieve high concurrency.
  • Global service, some clusters hope to deploy to the player's area. For example, the combat uniforms used by American players are deployed in the United States, and the combat uniforms used by Southeast Asian players are deployed in Southeast Asia, and they share the lobby uniform deployed in a certain place.

One problem that needs to be solved in multi-cluster is the cross-cluster communication problem. The cluster is generally fully connected between processes, but if the processes are fully connected between clusters, the topology will be chaotic and the number of connections will explode. Therefore, the communication between clusters generally uses the message bus. The cluster communicates through the message bus.

With high concurrency and high availability on the game server, how to support millions of players online at the same time without problems

 

2.5 Temporary high concurrency

In the game business scenario, the player's online and time, activities, etc. are closely related, and the number of online at different times may be several times and dozens of times different.

For the expected high traffic, it can be carried by expanding the capacity in advance. Refer to "Dynamic Expansion and Reduction" in "Ninzo's Server Optimization".

For unexpected momentary high concurrency, you can use the queuing system to block traffic out of the system, and then slowly enter the game after dynamic expansion.

2.6 High concurrency in battle scenarios

The game also has a special high-concurrency scene, where large-scale MMO players gather in a certain scene, such as a national war.

There is no perfect solution for this scenario. You can only increase the carrying capacity as much as possible. Common improvement solutions are:

  • Divide a scene into cells, and put different cells into different processes. Such as bigworld/kbengine and the latest SpatialOS.
  • Improve the single-process carrying capacity. For example, logic optimization, vertical expansion, using C++ to write game logic, etc. There is an order of magnitude difference in performance between C++ and Python.
  • Service downgrade: Simplify the game logic. For example, during the national war, it is almost the same as long as the players feel that the scene is lively. In fact, many battle logics are simplified.
  • Sub-service / sub-line / sub-dungeon: Let players isolate in business.

Water Wind: the realization and evolution of the game's numerical system

With high concurrency and high availability on the game server, how to support millions of players online at the same time without problems

 

Three, high availability

The pursuit of high-availability system minimizes the unavailability of system services during operation.

The evaluation indicator is the available time (SLA, Service Level Agreement) of the service in a cycle, and the calculation formula is service availability = (total minutes of service cycle-minutes of unavailable service)/total minutes of service cycle × 100%.

It is generally evaluated from two dimensions: 1. The system is fully usable: all services are available to all users. 2. The overall availability of the system: some services or some users are not available, but the system as a whole is available.

The goal of high availability is to strive for the complete availability of the system and ensure the overall availability.

Exceptions under large clusters

Due to the objectively existing small-probability abnormalities such as machine failures, network jams or disconnections, the server also needs to consider these issues, especially in large cluster scenarios, where small probability events accumulate into large probability events, so in large clusters In the server scenario, high availability is a problem we must consider.

High availability is actually the isolation and handling of various abnormal conditions, so that small probability abnormal events will not affect the overall service of the game.

The common exceptions are as follows:

  • Machine/process/network abnormality: The availability of Alibaba Cloud machines is promised 99.975%, which means that each machine is guaranteed to be unavailable within 1 hour for one year. If the cluster uses one hundred machines, the worst case in theory is that one machine is unavailable for one hour every three days. Of course, the real availability is much better than what Alibaba Cloud promised. Our game has about 100 ECS machines. Two machines automatically restarted due to machine failures in a year, and there was no continuous unavailability.
  • Saas service/DB exception: Because we use a large number of cloud services such as Alibaba Cloud's mysql and redis, these services themselves also have availability problems, leading to master-slave switching, etc. The most common problem encountered in our games is that the redis master-slave switch causes a network crash and requires a network reconnection.
  • Business BUG: For business BUG, ​​try to reduce the impact of the bug and prevent some small BUG from causing the overall system unavailability.
  • Sudden performance hotspot: The sudden busy business caused by the player's normal or abnormal behavior is mainly the high concurrency problem mentioned above. In addition, for some players' abnormal behaviors (such as plug-in \DDOS attacks), it is also necessary to ensure that they will not affect the overall availability of the system. I remember a long time ago (the legendary age), some plug-ins could directly restart the server.

3.1 Realize high availability based on horizontal expansion

We mentioned above that horizontal expansion can increase the concurrent load capacity and at the same time increase the usability, but the focus is different. For high concurrency, horizontal expansion means that we can increase the capacity by adding machines/processes. For high availability, it means that when a machine/process is abnormal or crashes, it will not affect the overall availability of the cluster.

As mentioned in the horizontal expansion above, for services that support horizontal expansion, the abnormal state of the service will only affect the service provided by this process, and the normal operation of other processes; for the stateless service, the impact will be smaller and only affect The process being executed.

Of course, this requires us to write some processing logic, including:

  • Abnormal monitoring: Through abnormal monitoring, abnormalities can be found quickly, generally using heartbeat or message timeout mechanism.
  • Exception handling: For example, how to deal with a message after it times out, whether to pay attention or ignore it. If a process is unavailable, we need to kick this process out of the cluster.
  • Service recovery: For stateless, just restart it. For stateful, the state can be migrated to other processes to provide services. There are many pitfalls in service restoration (more stateful services). Common restoration schemes such as kicking off the service players directly and then logging in again.
  • Service degradation: the common queuing system for service degradation and the closure of designated functions.

How to implement the above logic is actually quite complicated, so I won't introduce it in detail here.

Service isolation and gray release

In the development process, we should split the big functions into smaller services as much as possible, and each service is only responsible for a small function. Skynet also provides a better service model. Different services can be placed in one process or in different processes.

The previous article has introduced service isolation and gray-scale release. It is also to isolate high-risk services so that even if there are problems, it will not affect the overall availability of the system.

3.2 Master-slave replication

For stateful services, a process that supports horizontal expansion can ensure that an exception in one process does not affect other processes to provide services, but a crash of this process will cause the service provided by this process to be unavailable and cause problems such as data loss in memory.

To solve this, the common solution is master-slave replication. Master-slave replication is very common in databases and is a common solution to ensure database high availability.

Master-slave replication is to hang one or more slave nodes (slave) behind the master node, and the master node replicates the status/data to the slave nodes in real time. Usually the master node provides services. When the master node has a problem, the slave node becomes the master node and continues to provide services. Because the master node replicates the data to the slave node almost in fact, it can approximately guarantee that the data is not lost.

Therefore, if you want to further improve the availability of stateful services or single point services, you can use the master-slave replication scheme.

Game servers use this solution to write less business logic, and some cluster management nodes (non-business logic) will use this solution.

In addition, because common db (mysql/mongo/redis) has its own master-slave replication, stateless services actually let db manage the state for us, so as to obtain the ability of db master-slave replication that data will not be lost.

3.3 Exception Handling of Cloud Services

In addition to ECS machines, we have extensively used various SAAS services of Alibaba Cloud, such as DBs such as redis/Mysql/Mongo, and log services similar to ELK.

Most of these services support high-availability solutions such as master-slave switching, but we need to consider the impact on our system when they perform master-slave switching.

In Mysql and Redis, when a master-slave switch or gateproxy fails, the network connection will be disconnected. Therefore, we must handle the network interruption and reconnect in the logic. During the network disconnection and reconnection phase, some db requests will inevitably fail, and we also need to deal with this abnormal problem.

In the data landing scenario, it is necessary to determine whether each db request is successful, if it is not successful, retry and ensure the idempotence of the request to prevent the request from being executed multiple times.

Fourth, the goal of high concurrency and high availability

In order to achieve high concurrency and high availability, it has a relatively large development cost, and human resources are not unlimited in most projects. Therefore, when you are doing related work, you must over-design, comprehensively considering business needs, carrying expectations, and development costs.

In fact, the development cost I am talking about is not only that the amount of program development is larger, but the more complex the system is more prone to problems. If there is not enough manpower to test, maintain and iterate, it is better to use a simpler way to achieve it. The probability of problems is lower.

Of course, if your project team is the glory of the king and the elite of peace, the requirements for high concurrency and high availability are very high, and the manpower is almost unlimited, please ignore this paragraph.

Millions of simultaneous online

In the game industry, one million simultaneous online is generally regarded as the concurrent goal of the game server architecture, and the magnitude of one million is also the upper limit of most games (except King and Chicken).

Therefore, in the pre-game architecture design, planning, and stress testing, we can estimate the capacity required by different systems based on millions of simultaneous online as a benchmark, and it is enough to reach this capacity.

For example, our game, although there are many single points and performance bottlenecks, according to our estimation, even if these single points exist, we can still be online at the same time by adding machine support to 100w. Then, these single points and bottlenecks are within our expectations, and we will not further optimize.

If one day we need to support tens of millions of games online at the same time, theoretically we can continue to eliminate single point and optimize performance bottlenecks, but the cost will increase significantly.

Highly available! = Fully available

Our pursuit of high availability of server clusters does not require disaster tolerance for all exceptions. It is impossible and unrealistic.

According to skynet's thinking, if a node fails to respond in a timely manner, all requests to access it will be blocked without timeout. If the node is down, it will directly report trace. It is equivalent to that Skynet regards the cluster as a whole and does not have a fault-tolerant mechanism. If a core node fails, the entire cluster should be unavailable.

As I said above, as a whole, I agree with the idea of ​​skynet, which can effectively reduce the burden of thinking during business development. On this basis, for some common anomalies, the impact should be minimized to avoid the avalanche effect.

High concurrency and high availability beyond technology

High concurrency and high availability also have a greater relationship with non-technologies, such as operation and maintenance capabilities, hardware conditions, personnel quality, and management level.

To solve high concurrency, large-scale clusters need to be deployed, which will put higher requirements on operation and maintenance capabilities. In the case of a small number of users and a small cluster, manual operation and maintenance is acceptable, but as the scale and complexity of the cluster increase, manual operation and maintenance will become more and more difficult, and must be tooled and automated.

Many of the problems in the game system are actually caused by problems in operation and maintenance, procedures, and specifications, such as misdelivery of benefits after Zhan Shuang went online.

Therefore, in order to obtain high concurrency and high availability, operation and maintenance tools, operation and maintenance procedures and specifications must be done well, without human operation and maintenance, and operation and maintenance must be tooled and automated. In addition, monitoring, alarming, and quick response of personnel are also necessary conditions for stable operation of large-scale systems.

I hope it can be helpful for everyone to learn and use high concurrency. My favorite friends can help with one click and three links. Thank you!

Guess you like

Origin blog.csdn.net/Ppikaqiu/article/details/112474255