The bloody case caused by 300 data changes - a record of a billion-level core mongodb cluster where some requests are unavailable and a fault to step on the pit

        The data volume of a core mongodb cluster online is not large, and the data volume of a single table is billions, but the cluster is relatively core, which affects the company's revenue flow. This article shares the entire failure process by sharing this step on the pit. This failure is a classic mongodb sharding cluster stepping on the pit failure, including the lack of change notification, the deployment architecture, and the careful consideration of changes.

About the author

      Former expert engineer of Didi Chuxing, currently in charge of mongodb, OPPO document database related research and development. Follow-up will continue to share "MongoDB kernel source code design, performance optimization, best operation and maintenance practices", Github account address: https://github.com/y123456yz

Preamble

       This article is the 20th article of the oschina column "mongodb source code implementation, tuning, best practice series ", other articles can refer to the following links:

Qcon-Trillion-level database MongoDB cluster performance has been improved by dozens of times and the practice of multi-active disaster recovery in the computer room

Qcon Modern Data Architecture - "Trillion-level database MongoDB cluster performance improvement and optimization practice dozens of times" core 17 detailed answers

The optimization practice of improving the performance of millions of high-concurrency mongodb clusters by dozens of times (Part 1)

The optimization practice of improving the performance of a million-level high-concurrency mongodb cluster by dozens of times (Part 2)

The performance of Mongodb in specific scenarios has been improved by dozens of times. Optimization practice (remember a mongodb core cluster avalanche failure)

Commonly used high concurrency network thread model design and mongodb thread model optimization practice

Why do secondary development of the open source mongodb database kernel

Inventory 2020 | I want to do something for the promotion and promotion of the distributed database mongodb in China

 Million-level code amount mongodb kernel source code reading experience sharing

Topic Discussion | mongodb has ten core advantages, why is the domestic popularity far less than mysql?

Mongodb network module source code implementation and performance extreme design experience

Network transport layer module implementation 2

The network transport layer module realizes three

The network transport layer module realizes four

command command processing module source code implementation one

command command processing module source code implementation 2

Implementation principle of mongodb detailed table-level operations and detailed delay statistics (quick positioning table-level delay jitter)

[Analysis of the coordination of pictures, text and codes]-Mongodb write (add, delete, modify) module design and implementation

It is enough to build a Mongodb cluster - replica set mode, sharding mode, with authentication, without authentication, etc. (with detailed step instructions)

The bloody case caused by 300 data changes - a record of a billion-level core mongodb cluster where some requests are unavailable and a fault to step on the pit

  1. problem background

     For a core mongodb historical cluster (a cluster that existed before joining the company), in the process of risk sorting out of all the current mongodb clusters, it was found that there are some potential cluster jitter risks in this cluster. The cluster architecture and traffic delay curve are as follows:

      As shown in the figure above, the sharded cluster consists of 3 shards. The read and write traffic of the cluster is very low, the peak QPS is about 4-6W/s, and the average delay is 1ms. Each shard adopts the mongodb replication set architecture to achieve high availability. Through the inspection, it is found that the cluster has the following problems:

  1. The cluster contains only two user libraries, userbucket library and feeds_content library. In the two libraries, only feeds_xxxxxxx.collection1 has the sharding function enabled; the first userbucket library stores cluster routing information, and the second feeds_xxxxxxx library stores about one billion data information ;
  2. Since the cluster mainly reads more and writes less than the cluster, the read traffic is to read the data in the feeds_xxxxxxx library, and the client has separated read and write, so almost most of the read traffic is in shard 1. Shard 2 and Shard 3 have only a small amount of data.

      The library table information is shown in the following table:

    The above description can be summed up in the following figure:

       As can be seen from the above figure, shard 2 and shard 3 hardly play any role; since shard 3 has two nodes with low IO sata disks, which may affect the reading and writing of userbucket library, consider directly removingShard from the cluster Shard 3 and Shard 2 are eliminated from the middle.

2. Operation process

       Since shard 3 is a low IO server, there is a potential jitter cluster to share; at the same time, shard 2 and shard 3 are almost all wasted shards, so the information of shard 3 and shard 2 can be deleted directly through the following removeshad command. Free up useless server resources, as shown in the following figure:

  • Step 1: Log in to any agent, let's say agent mongos1.
  • Step 2: Since shard 3 (that is, shard_8D5370B4 shard) is the primary shard of the userbucket library, an error is reported, and it prompts "you need to drop or movePrimary these databases", which means that we need to remove the primary shard of the library in advance Information is migrated to other shards.
  • Step 3: Use the movePrimary command to migrate the primary shard of the userbucket library from shard 3 to shard 1.
  • Log in to the other two agents in the monitoring list, mongos2 and mongos3, and force the routing information to be refreshed through db.adminCommand({"flushRouterConfig":1}).

 Note: Due to the movePrimary process, other agents will not perceive the change of the primary shard of the library, so it is necessary to forcibly refresh the routing information or restart the mongos of other nodes, refer to the following:

3. User feedback that most of the requested business requests are unavailable

       After changing the userbucket library with 300 pieces of data, when I was still dealing with other cluster performance tuning casually, the user suddenly called me in a hurry and I reported that the entire access to the core cluster was unavailable (Note: the entire 1 billion data is cluster is unavailable).

      After receiving the call, it was very sudden. After detailed connection with the business personnel, it was basically determined that it was caused by the change of these 300 pieces of data. When the business obtains these 300 pieces of data, some requests are successful, and some requests fail, indicating that it must be related to movePrimary.

      Therefore, in addition to flushRouterConfig to force the route refresh of all the agents in the monitoring list, all the agents were restarted, but the business feedback still showed that some requests could not get the data. It's tricky. I check the 300 pieces of data under the userbucket library through all the mongos agents, and I can get the data completely.

      So I wonder if there is a mongos proxy that has not refreshed the route, so I log in to any mongos proxy to get the config.mongos table, and the results are as follows:

The above config.mongos table records all the agent information of the cluster, and also records the detailed time information of the last ping communication between these agents and the cluster. Obviously, the number of agents recorded in this table is not only the number of agents in the cluster monitoring list, but also more than the number in the monitoring list.

Finally, after all the currently online agents listed in the config.mongos table are forced to refresh the route through flushRouterConfig, the business resumes.

4. Summary of the problem

       From the previous analysis, it can be concluded that because some agents were missed in the early cluster monitoring, the userbucket routing information corresponding to this part of the agents was the routing information before movePrimary, that is, it pointed to the wrong shard, so there was a lack of routing. The situation of the data, as shown in the following figure:

  • Why do some data in the corresponding table of the user userbucket library succeed and some fail?

    Because some proxies did not forcibly refresh the routing information in the table after moveprimary, some proxies routed incorrectly when they obtained data.

  • Why does the 300 pieces of data part routing information incorrectly cause partial access to the entire 1 billion cluster to be unavailable?

It is related to the business implementation logic, because the business needs to obtain the routing information of the business before obtaining the 1 billion pieces of data. The business routing information happens to be stored in the corresponding table of the userbucket library, and the business must obtain the business routing information before obtaining the data. If the userbucket data cannot be obtained, the user cannot be sure to point to the feeds_xxxxxxx data .

  • Why are some proxy restarts omitted or route refreshes forced?

Due to historical reasons, some agent business codes are configured, but the server cluster monitoring metadata is missing, that is, some agents are missing from the server cluster monitoring, and these agents are not monitored. It is also possible that the mongos agent is expanded, but no metadata is added to the cluster monitoring list.

  • Safest way to operate movePrimary operation?

The official recommendation is that after the movePrimary operation is successful, it is necessary to force the route to refresh or restart the mongos. However, there is an intermediate state between the successful movePrimary operation and the restart of the mongos. If the intermediate state business reads or some tables under the library to be migrated, there may still be a routing error. Therefore, the best safe moveprimary can be done in two ways:

Method 1: Shut down all agents, leaving only one agent, and restart other mongos agents after the agent moveprimary is successful. Remember not to miss the agent, if there is a situation similar to this article stepping on the pit, check the config.mongos table in advance.

Method 2: Enable the sharding function for all tables of the primary shard in the sharded library that requires removeShard. After enabling the sharding function, there will be chunk information, and mongodb will automatically migrate the chunks of the shard to other shards. The process can ensure that the routing information is consistent.

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324073960&siteId=291194637