Dual-machine hot backup solution for social games to prevent single point of failure

 In the middle of the night one day, the hard disk of the single-disk configuration server was damaged, which caused the service provided on the server to stop, so I came up with the idea of ​​developing a dual-machine hot backup service. Come out slowly. For various reasons, the relevant source code cannot be provided here, only the design ideas, the basic implementation ideas and the problems and challenges encountered in the implementation process are provided here. Please forgive me!

  1. Stability thinking

  Without further ado , the server mentioned in this article refers specifically to the application server that uses the middleware role implemented by C/C++, such as DB Proxy (Cache) Server, the central server in the conventional MMORPG architecture, in Its overall application exists in the form of a single point, and its role is extremely important. How to deal with various program problems or Crash caused by uncontrollable factors, and minimize the loss caused by Crash, here This paper introduces a realized and feasible hot-standby solution.

  Explanation of terms:   HS : The current dual-machine hot   standby

  technical architecture is referred to as

  Master: Master server, that is, the current online server

  Slave: Slave server, that is, the backup server, offline server

  Server : Master or Slave In master-slave mode, that is, only one server provides services at the same time. By default, the slave server does not process external request commands, but only processes synchronization requests sent by the master.   In this case, the server provides a unified interface to the LVS. If the LVS detects that the Master is unresponsive or stops working, it will send a command to make the Slave become the server (online) mode, and at the same time transfer all client requests to the Slave. At this time, the Slave The host is also upgraded to become a Master, and the working method is shown in Figure.1:









  Figure.1

  3. The software architecture

  HS is designed as a universal module, which can theoretically be accessed by all application servers, as shown in Figure.2:



  Figure.2

  Since the HS module itself has no network transmission function, in the integration It is also necessary to connect the network I/O interface of the application server to the HS module (under normal circumstances, the network I/O module of the application server will provide a separate interface?).

  The overall architecture is shown in Figure.3. HS runs a thread by itself. For all data that is asynchronously transmitted to the HS module, it first uses queue to queue and cache it, and then transmits it to the configured Slave at one time in this thread. At the same time, this part of the data is It will also write to the module's third-party storage (HS currently only implements Memory storage).

  After receiving the data transmitted by the Master, the Slave updates the relevant memory of the application server (the specific memory type can be determined according to the KEY sent by the Master), and also writes the received data into its own data buffer queue and third-party storage.

  Master and Slave can implement a complete hot-standby server by writing very little code (HOOK when sending, update when receiving).



  Figure.3

  There is a question here, why does the slave also need to maintain a queue and third-party storage? This is because the slave may become a master one day in the future, and needs to synchronize data to a future slave. The flowchart in Figure.4 illustrates the server startup process.



  Figure.4

  4. Modular

  HS can be compiled into dynamic library form (.so) separately for use.

  Some libraries of boost are used in the specific implementation, mainly io_service, thread, bind, shared_ptr, etc.

  However, you can choose to compile the boost library statically at compile time, and you don't need to rely on it separately in actual use.

  It should be noted that, in order to better support the operation of HS, a remote procedure call module has been developed, which will not be described in detail here.

  5. How to deal with the sudden large amount of data in network transmission

  In some extreme cases, when the Master starts a long time, the Slave is late, and the Master may accumulate a large amount of data (more than 4G). Unlimited data transmission will take up a lot of network card resources (but generally dual network cards do not affect external network services), and most importantly, it will take up a lot of network thread resources (HS and application server public network threads).

  In the actual test, it was found that due to the different sizes of 4G small pieces of data, the transmission took more than 1 minute. In order to avoid the network thread congestion of the application server, the flow control of data transmission is added, and only the specified amount of data is transmitted in each interval. The test proves that this method solves all related problems.

  What is the use of third-party storage?

  It is used to store historical backup data after the Slave is disconnected or when the Master starts up for a period of time and then reconnects. That is to say, the Slave is connected after being disconnected for a long time, during which all historical data are stored in third-party storage. After connecting, the data stored by the third party needs to be synchronized first, and then the data in the queue is synchronized in real time after completion.

  6. Socket's ghost property keepalive

  In order to detect network cable disconnection or unknowable network anomalies, HS needs a set of systems that can detect whether the network is really disconnected or falsely disconnected (for example, the network is blocked in a short period of time). KEEPALIVE, after using the test, it is found that this is not a good checking mechanism, at least not what we need. Because KEEPALIVE relies on three properties of the system, as follows:

  system settings

  # cat /proc/sys/net/ipv4/tcp_keepalive_time

  7200

  # cat /proc/sys/net/ipv4/tcp_keepalive_intvl

  75

  # cat /proc/sys/net/ ipv4/tcp_keepalive_probes

  9 In

  brief, these three attributes indicate that the TCP/IP protocol layer will send 10 (9+1) segments if the client message is not received for 7200 seconds, with an interval of 75 seconds between two , if the client does not respond to these messages, the connection is terminated.

  There are two reasons for not using this property, one is that the information belongs to changes at the system property level, and the other is that it is found to be very unstable in the actual test.

  The following is the code for setting keepalive:

  int optval = 1;

  socketlen_t optlen = sizeof(optval);

  setsockopt(socket, SOL_SOCKET, SO_KEEPALIVE, &optval, optlen);

  A more effective heartbeat scheme:

  it is implemented by yourself, the scheme is relatively dirty, and a brief description is given below :

  The server maintains two variables A and B, and a timer. Every time the client receives a request packet, the variable A increases by 1. At the same time, when the timer expires, it checks whether A and B are the same. If they are different, the client is active. , and synchronize A to B at the same time, and start the next round of timer. If the timer for a connection is not active for N times, it is determined that the client is disconnected and the connection is closed.

  7. OVER

  [Editor's note]

  Focus on the development and operation of social games, gradually become more and more familiar with the business of social game products, constantly summarize and analyze, and speed up the development of social game products, our technical team has done a lot of research and analysis. Try, we have developed products for this purpose:

  1) Data middleware: solve the problem of large user concurrency and large data volume, and do not require game development engineers to care about data access, etc.;

  2) Task server: solve many activities or rewards for game products The organization of the event, and the configuration and management of the game's own tasks;

  3) Statistics server: The amount of operation log data of our players is too large to be stored on the server. For this reason, we selectively store the behavior log data of the relevant players, and complete the Data analysis work;

  4) Friends list server: social game products are mainly connected to many social networking website platforms, for which a set of general friends list servers must be implemented;

  5) Other game engines such

  as For the establishment of related general-purpose game engines, we have gradually gained more time and energy to do more meaningful things. For example, we can calmly study and solve the problem of running these servers as a single point, and develop solutions for the establishment of various game engines. The primary and secondary services of the service provide the universal composition of the service, that is, other game engine compositions can solve the problem of single point of failure by introducing the hot-standby composition. This universal configuration can also provide a certain technical reference value for the hot standby mode of the server background program in the non-social game industry. Thank you @ZEROV17 for your support, and welcome all technical friends to leave a message on the site or communicate directly on Sina Weibo. At the same time, I am also very grateful to ZEROVF for its contribution to this project!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326383611&siteId=291194637