Guanying Interactive Entertainment implements game cloud native architecture upgrade based on OpenKrusieGame

Author: Liming

About Guanying Interactive Entertainment

Guanying Interactive Entertainment is a game company that integrates the research, development and distribution of mobile games, online games, VR games, etc. Its officially authorized legendary mobile games - "Legend of Ragnarok" series are deeply loved by the majority of players. Based on years of self-research and operation experience in MMORPG-type games, Guanying Interactive Entertainment officially launched the 2D MMO game development engine Thousand, and successfully applied it to the recently launched mobile game "Legend of Ragnarok - Meng Hui Zero 3". The cloud-native architecture adopted behind it has greatly improved the operation and maintenance efficiency of game server opening and updates, while reducing server resource costs, and providing a solid foundation for the subsequent development of better products and accelerating the formation of the game ecosystem.

MMORPG mobile game "Legend of Ragnarok - Meng Hui Zero Three"

The original intention of enabling cloud-native architecture

At the beginning of the Thousand engine project, the R&D team decided to adopt a cloud-native architecture based on the characteristics of traditional regional server games. The main considerations are as follows:

  1. District servers have strong isolation properties, and resource preemption should be avoided as much as possible. In the past, when operating games, there would be mutual resource interference between different zones on the same host machine, increasing the number of affected players. The use of container technology can achieve refined resource control, avoid mutual interference between regional servers, and effectively reduce the impact of failures.

  2. Managing game servers in a declarative manner brings efficiency advantages. The evolution from operating and maintaining machines and executing a series of scripts in the past to a service-oriented, batch and automated management method can not only greatly improve the efficiency of server creation, but also reduce the probability of errors during game maintenance.

  3. More refined fault location and the ability to quickly restore services are required. District servers share computing nodes. When a fault occurs, it is impossible to determine in time whether the root cause of the fault is district service A, district service B, or the host machine. Moreover, when a machine fails, business migration efficiency is very low. Through the cloud native architecture, a certain degree of decoupling of infrastructure resources and business brings the ability to quickly locate business faults, and the lightweight and environmental consistency of the container brings efficient business recovery capabilities, problem location and recovery efficiency. raised dramatically.

  4. The cloud native ecosystem is growing stronger and stronger. Cloud native technology can not only highly integrate computing, network, storage and other infrastructure resources, but also make use of observability, scheduling, application delivery and other capabilities very easily.

Challenges of implementing game servers on Kubernetes

However, the container orchestration standard Kubernetes, which is inevitable for cloud nativeization, has very limited support for games. Guanying Legend Games also encountered many challenges in the process of implementing Kubernetes:

How to manage many regional servers on Kubernetes

  1. Each regional server needs to expose its public network address separately. Players can directly connect to the corresponding regional server after selecting a server. Additional access layer network management will undoubtedly increase the operation and maintenance costs of batch management of district servers; at the same time, if you choose the mode of binding an EIP to a single district server pod, it will consume a large amount of EIP resources, resulting in a waste of economic costs.

  2. A single zone server is composed of multiple services and exists in the form of a "rich container" after containerization. Native Kubernetes only manages business status at the container level and cannot finely perceive the status of specific processes in the container, making it difficult to locate and handle faults or exceptions. Splitting services and deploying them separately increases the complexity of the architecture and makes transformation more difficult.

  3. A complete game server consists of an engine side and a script side. The game server engine supports hot update scripts to avoid player loss caused by frequent server shutdowns. The R&D team has designed a variety of hot update solutions for game servers after they are implemented on Kubernetes, including pulling the latest hot update files from public servers or dynamically mounting hot update files through cloud storage. But no matter which way, you will encounter various problems, including:

1) Version management of hot update files is not supported. After frequent updates, the many existing versions cannot form a corresponding relationship with the files, resulting in complicated rollback after update failure;

2) The update status is difficult to locate. Even if the files in the container are updated and replaced, it is difficult to determine whether the current hot-changed file has been mounted when executing the reload command. The status maintenance of whether the update is successful or not needs to be left to the operator for additional management, and to a certain extent Increased operation and maintenance complexity;

3) When the container is abnormal, the pod is rebuilt and the old version of the image is pulled up, and the hot update file is not retained continuously;

4) The update speed is always unsatisfactory.

OpenKruiseGame helps the cloud-native implementation of game servers

Guanying uses the community's open source project OpenKruiseGame to solve the above problems and realize the smooth implementation of the 2D MMO game development engine Thousand on Kubernetes. OpenKruiseGame (OKG for short) is a sub-project of the CNCF incubation project OpenKruise in the game field. It is specially built for games to help game developers achieve more agile game elastic architecture, unified standard operation and maintenance actions, multi-cloud consistent delivery, and game creation. Self-operation and maintenance platform and other capabilities.

Faced with the above challenges, Guanying used the following capabilities of OKG:

  1. OKG provides the ability to automatically manage the access layer network. Users do not need to manually build/deconstruct the network for each district server, and supports different network models for different scenarios. Guanying uses the NATGW model based on its own business characteristics. The dnat entry is automatically generated when the district server is opened. The dnat entry is automatically deleted when the district server is merged (deleted). Multiple district servers will share the EIP, making full use of the bandwidth resources of the EIP.

  2. OKG believes that the service quality of game servers should be defined by users. Users can set the status of game servers based on their business and handle them in a refined manner. Facing the "rich container" game scenario, Guanying detects the abnormal status of specific processes through OKG's "customized service quality" and exposes it to the Kubernetes side, and then uses event notification components such as kube-event to alert the exception to In the operation and maintenance group, it helps operation and maintenance engineers quickly find problems, realize fault location in seconds, and handle faults in minutes.

  3. OKG provides an in-place hot update solution based on container images. The hot update script is deployed on the same game server as a sidecar container and the main container. The two share hot update files through emptyDir. When updating, you only need to update the sidecar container. In this way, the hot update of game servers will be carried out in a cloud-native way:

1) The sidecar container image has version attributes, which solves the problem of version management;

2) After the Kubernetes container is successfully updated, it is in the Ready state and can sense whether the sidecar script update is successful;

3) Even if the container restarts abnormally, the hot update file will be continuously retained as the image is solidified;

4) The hot update process can be quickly completed through the image preheating mechanism.

Cloud-native achievements

The overall cloud native architecture diagram of Thousand engine is shown below. Guanying has implemented a platform project based on OKG:

  1. The game engineer uploads a new script, triggers the CI process, automatically packages the image and automatically deploys the new GameServer to the Kubernetes cluster; by editing the old script, CICD can also be triggered, and the sidecar image corresponding to the GameServer is updated based on OKG's in-place upgrade to achieve game server Hot updates. The entire process does not require the participation of game operation and maintenance engineers. The ability to deploy and update game servers is handed over to game developers through cloud native technology, which improves game production efficiency.

  2. The automatically generated GameServer is based on OKG's network function and has an independent public network access address (EIP: unique port). The Thousand engine platform provides a service discovery mechanism to allow players to directly connect to the corresponding regional server to play games.

  3. When the game server occasionally encounters an abnormality, through the custom service quality function provided by OKG, the Thousand engine platform will sense the specific abnormal information and notify the operation and maintenance engineers. The operation and maintenance engineers can quickly locate and respond to the problem, ensuring the player's safety to the greatest extent. Game quality.

The birth of the Thousand engine marks Guanying Interactive Entertainment’s upgrade of its game cloud-native architecture. Proven in production, cloud-native architecture brings the following advantages:

  • In terms of server opening efficiency: Traditional server opening requires manual configuration and association of IP ports between all servers. Due to manual configuration, the failure rate is relatively high, resulting in a relatively long time for new zone server opening. After containerization, all parameters are standardized and visualized, and servers can be quickly opened in the face of traffic peaks to ensure the speed of container server opening and the integrity of the configuration. The time efficiency of opening a new zone is optimized from 30 minutes to 15 seconds; the time efficiency of opening a new server is optimized from 2 minutes to 10 seconds, which greatly improves the efficiency of server opening.
  • In terms of update efficiency: the traditional update process will overwrite and update the executable files of each directory file, and the update speed is slow and the error rate is high; after containerization, the engine and script are split into two containers, and they can be directed separately. Being updated, the update granularity is more detailed and controllable, reducing the update error rate. At the same time, the image preheating method brings a second-level update experience, and the update efficiency is increased by 5 times.
  • In terms of cost saving: when the traditional server is opened, the corresponding game server configuration is purchased based on the estimated number of people to reserve resources. At the same time, the resources required for the service cannot be accurately isolated, resulting in a large redundancy of service resources and the inability to adjust the resource configuration in time. , causing a lot of waste of resources; after containerization, game servers are resource isolated from each other, and combined with refined scheduling, host resources can be fully utilized, saving at least 10% of resource costs.
  • In terms of problem location: In the traditional manual deployment environment, problems often occur such as district server crashes that cannot be discovered in time; after containerization, the specific error reporting process of the district server can be directly revealed, service problems can be quickly located and solved, and the problem response efficiency is increased by 5 times. .

Outlook for cloud native games

Although the game cloud-nativeization has achieved fruitful results, Guanying’s cloud-nativeization process is not yet over. Sheng Hao, technical director of Guanying Cloud Platform, said: "Cloud native technology is booming. In the future, Guanying will embrace cloud native more comprehensively and work hand in hand with the OKG community. We plan to introduce chaos engineering and establish a fault self-healing system to further strengthen platform automation. Operation and maintenance capabilities; through vertical scaling and dynamic allocation, we can further save resource costs while ensuring the playability of regional server players."

In fact, the future is not far away. OKG has opened up the function of custom fault definition and supports the function of automatically executing operation and maintenance scripts to containers in specific states. K8s version 1.27 also introduces the ability to automatically vertically scale in place, which is suitable for regional server games. The allocation of resources is of great significance. Perhaps, the cloud-native era of gaming is right in front of us.

Related Links:

Click here to view the official OpenKruiseGame documentation.

Broadcom announces the termination of the existing VMware partner program deepin-IDE version update, replacing the old look with a new look Zhou Hongyi: Hongmeng native will definitely succeed WAVE SUMMIT welcomes its tenth session, Wen Xinyiyan will have the latest disclosure! Yakult Company confirms that 95 G data was leaked The most popular license among programming languages ​​in 2023 "2023 China Open Source Developer Report" officially released Julia 1.10 officially released Fedora 40 plans to unify /usr/bin and /usr/sbin Rust 1.75.0 release
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10467190