Remember the performance optimization practice of data expiration in a tens of billions of mongodb clusters

      A certain 10-billion-level mongodb business only saves the data of the recent 7 days. Due to the large amount of data and high traffic, the data expired and deleted points are relatively concentrated, and the problem cannot be solved by off-peak. Therefore, how to use the minimum physical cost to meet business needs has become a The difficulty of optimizing the performance of this cluster.

      After several rounds of tuning with the business, including storage engine tuning, data deletion method tuning, and business off-peak read and write, the business pain points have been perfectly solved, and ms-level business read and write access has been achieved.

About the author

       Former expert engineer of Didi Chuxing, currently in charge of mongodb, OPPO document database related research and development. Follow-up will continue to share "MongoDB kernel source code design, performance optimization, best operation and maintenance practices", Github account address: https://github.com/y123456yz

Preamble

       This article is the 24th article of the oschina column "mongodb source code implementation, tuning, best practice series", other articles can refer to the following links:

Qcon-Trillion-level database MongoDB cluster performance has been improved by dozens of times and the practice of multi-active disaster recovery in the computer room

Qcon Modern Data Architecture - "Trillion-level database MongoDB cluster performance improvement and optimization practice dozens of times" core 17 detailed answers

The optimization practice of improving the performance of millions of high-concurrency mongodb clusters by dozens of times (Part 1)

The optimization practice of improving the performance of a million-level high-concurrency mongodb cluster by dozens of times (Part 2)

The performance of Mongodb in specific scenarios has been improved by dozens of times. Optimization practice (remember a mongodb core cluster avalanche failure)

Commonly used high concurrency network thread model design and mongodb thread model optimization practice

Why do secondary development of the open source mongodb database kernel

Inventory 2020 | I want to do something for the promotion and promotion of the distributed database mongodb in China

 Million-level code amount mongodb kernel source code reading experience sharing

Topic Discussion | mongodb has ten core advantages, why is the domestic popularity far less than mysql?

Mongodb network module source code implementation and performance extreme design experience

Network transport layer module implementation 2

The network transport layer module realizes three

The network transport layer module realizes four

command command processing module source code implementation one

command command processing module source code implementation 2

Implementation principle of mongodb detailed table-level operations and detailed delay statistics (quick positioning table-level delay jitter)

[Analysis of the coordination of pictures, text and codes]-Mongodb write (add, delete, modify) module design and implementation

It is enough to build a Mongodb cluster - replica set mode, sharding mode, with authentication, without authentication, etc. (with detailed step instructions)

The bloody case caused by 300 data changes - a record of a billion-level core mongodb cluster where some requests are unavailable and a fault to step on the pit

Remember billion-level Es data migration mongodb cost saving and performance optimization practice

Cost saving and performance optimization practice of 100 billion-level data migration mongodb

Remember the cost saving and performance optimization practice of migrating mongodb for a 100 billion-level IOT business (with performance comparison questions and answers)

Remember the performance optimization practice of data expiration in a tens of billions of mongodb clusters

  1. business background

      The data volume of an online business is about 10 billion. During the daytime, the writing traffic peak period is about 14W/s, as shown in the following figure:

      The business produces data every day during the day, and at the same time pulls the data of the past few days in batches for big data analysis in the early morning. The entire cluster only saves the data of the last seven days. A single piece of data is about 800 bytes, as follows:

1.  {  
2.        "_id" : ObjectId("608592008bd3dad61675b491"),  
3.        "deviceSn" : "xxxxxxxxxxxxx",  
4.        "itemType" : 0,  
5.        "module" : "xxxxxx",  
6.        "userId" : "xxxxx",  
7.        "callTimes" : NumberLong(2),  
8.        "capacityAdd" : NumberLong(0),  
9.        "capacityDelete" : ”xxxxxxxx”,  
10.        "capacityDownload" :”xxxxxxxxxxxx”,  
11.        "capacityModify" : ”xxxxxxxxxxxx”,  
12.        "createTime" : NumberLong("1619366400003"),  
13.        "expireAt" : ISODate("2021-05-02T22:53:45.497Z"),  
14.        "numAdd" : NumberLong(2),  
15.        "numDelete" : NumberLong(345),  
16.        "numDownload" : NumberLong(43),  
17.        "numModify" : NumberLong(3),  
18.        "osVersion" : "xxxx",  
19.        "reversedUserId" : "xxxxx",  
20.        "updateTime" : NumberLong("1619366402106")  
21.} 
  1. MongoDB resource evaluation and deployment architecture

      By combing with the business, the scale and business requirements of the cluster are summarized as follows:

  • The amount of data is tens of billions
  • A single piece of data is 800 bytes, 10 billion pieces of data are expected to be 7.5T
  • read-write separation
  • All data is only kept for seven days

2.1 mongodb resource evaluation

      The selection and evaluation process for the number of shards and storage node package specifications is as follows:

  • memory evaluation

      Our company is all containerized deployment. Based on network experience, mongodb does not consume high memory. The maximum memory of a single container in a mongodb cluster with a history of more than 10 billion is basically 64Gb, so the memory specification is determined to be 64G.

  1. Shard evaluation

The peak value of       business

  • Disk evaluation

    10 billion data is 7.5T. Since mongodb has high compression by default, it is expected that the real disk will occupy about 2.5~3T. Three shards, one shard is exactly 1T.

  • CPU Specification Evaluation

      Due to the limitation of container scheduling package, the CPU can only be limited to 16CPU (actually not so many CPUs).

  • mongos agent and config server specification evaluation

       In addition, since the sharded cluster also has mongos agent and config server replica sets, it is also necessary to evaluate the mongos agent and config server node specifications. Since the config server only mainly stores routing-related metadata, it consumes very little disk, CPU, and MEM; the mongos agent only consumes CPU for routing and forwarding, so the memory and disk consumption are not high. In the end, in order to maximize cost savings, we decided to reuse the same container with a proxy and a config server. The container specifications are as follows:

       8CPU/8G memory/50G disk, one agent and one config server node reuse the same container.

Summary of sharding and storage node specifications: 4 shards/16CPU, 64G memory, 1T disk.

Summary of mongos and config server specifications: 8CPU/8G memory/50G disk

3.2 Cluster Deployment Architecture

      The business data is not very important. In order to save costs, we adopt the 2+1 mode of deployment, that is: 2mongod+1arbiter mode, deployed in the same city computer room, and the deployment architecture diagram is shown in the following figure:

       Considering the low importance of data, the 2mongod+1arbiter mode can meet user requirements and maximize cost savings.

4. Performance optimization process

       The cluster optimization process is optimized according to the following two steps: performance optimization before the business uses the cluster, and performance optimization during the business use process.

       The business builds the optimal index corresponding to the query in advance, and creates an expired index at the same time:

db.xxx.createIndex( { "createTime": 1 }, { expireAfterSeconds: 604800} )

4.1 Performance optimization before the business uses the cluster

       Communication with the business confirms that each piece of data in the business carries a device identifier userId, and business query and update are based on the userId dimension to query a single piece or batch of data under the device, so the userId is selected for chip construction.

  • Fragmentation

       In order to fully hash the data into 3 shards, the hash sharding method is selected, so that the data can be hashed to the maximum extent, and at the same time, the same userId data can fall into the same shard to ensure the query efficiency.

  • Pre-sharding

       If mongodb shards are built as hashed shards, pre-sharding can be done in advance, so as to ensure that data is written into multiple shards in a balanced manner. The benefits of pre-sharding can avoid the chunk migration problem in the case of non-pre-sharding, and maximize the write performance.

sh.shardCollection("xxx_xx.xxx", {userId:"hashed"}, false, { numInitialChunks: 8192} )

  • read nearby

       The client adds the secondaryPreferred configuration to read the slave node first.

  • 禁用enableMajorityReadConcern

       After disabling this function, ReadConcern majority will report an error. The function of ReadConcern majority is to avoid dirty reads, and there is no need to communicate with the business, so it can be closed directly.

mongodb enables enableMajorityReadConcern by default, which has a certain impact on performance. Refer to:

MongoDB readConcern principle analysis

OPPO million-level high-concurrency MongoDB cluster performance dozens of times improved optimization practice

  1. Storage engine cacheSize specification selection

       Single container specifications: 16CPU, 64G memory, 7T disk. Considering the pressure on memory and memory fragmentation during full migration, set cacheSize=42G to avoid OOM.

5.2 Problems encountered during business use and performance optimization

5.2.1 The first round of optimization: storage engine optimization

       The business peak period is mainly data writing and updating, and there is a lot of dirty data in the memory. When the proportion of dirty data reaches a certain proportion, the corresponding thread of user read and write requests will be blocked, and the user thread will also eliminate the dirty data pages in the memory, and finally write Performance drops significantly.

       Several configurations related to the cache elimination strategy of the wiredtiger storage engine are as follows:

      Since the full-scale migration data of the business is a continuous high-traffic write, rather than a burst high-traffic write, the configurations of eviction_target, eviction_trigger, eviction_dirty_target, and eviction_dirty_trigger are not very useful. The thresholds of these parameters are only burst in a short time. It is only useful to adjust under traffic conditions.

       However, in the case of continuous long-term high-traffic writing, we can solve the problem of blocking user requests caused by the high proportion of dirty data by increasing the number of background threads of the wiredtiger storage engine. The task of eliminating dirty data is finally handed over to the background threads of the evict module. To be done.

       The storage engine optimization of the full-scale and large-traffic continuous write storage engine is as follows:

db.adminCommand( { setParameter : 1, "wiredTigerEngineRuntimeConfig" : "eviction=(threads_min=4, threads_max=20)"})

5.2.2 Problems after the first round of optimization

       After tuning the number of background threads of the storage engine, the bottleneck of data writing and updating is resolved, and the latency of writing and updating is smooth. However, as the data expires after a week, the business write begins to jitter a lot, as shown below:

        As can be seen from the above figure, the average delay even reaches hundreds of ms in extreme cases, which is completely unacceptable for this business. Through mongostat monitoring, the following phenomena are found:

  • Master node mongostat monitoring statistics

       As can be seen from the above monitoring, each master node of the three shards has only about 4,000 write update operations (note: there are actually about 40,000 delete operations. Since the master node expires delete operations will not be counted, so only By viewing the slave node, see the analysis later for details, the actual write and delete operations of a single fragment master node are 4.4W/s), and the write traffic is very low. However, the proportion of dirty data in the monitoring continues to exceed 20%. After more than 20%, business requests need to eliminate dirty data, which will eventually cause business request blocking and jitter.

       From the previous analysis, it can be seen that the normal peak value of the business is about 10W/s written to 3 shards. One week later, the data seven days ago needs to expire. At this time, the expired and deleted ops also need to be deleted by 10w/s. Since the new data is also written at 10w/s at this time, the cluster needs to support 20w/s ops operation. support.

       Obviously, 3 shards cannot support continuous ops operations of about 20w/s, so how to support business needs without expanding capacity will be a major difficulty.

  • Why the TTL expired master node does not use delete statistics

        The above picture shows the architecture diagram of the mongodb stand-alone module. The main node enables a TTLMonitor thread by default, which causes the module to scan the expired index in real time with the help of query, and then delete the data that meets the conditions. The entire deletion process does not go through the command command processing module, and the command counting operation will only be counted in the command module, so the expired deletion of the master node will not have a delete operation.

     For details on the processing flow of the command processing module and its count statistics, please refer to: mongodb kernel source code modular design and implementation column

      For more mongodb modular source code design and implementation, see:

https://github.com/y123456yz/reading-and-annotate-mongodb-3.6

 

5.3 The second round of optimization (expired deletion and low peak in the early morning) - not feasible

       As can be seen from the previous analysis, our three shards cannot support continuous 20W/S update and deletion operations. Therefore, we consider the business transformation and use method, and determine the expired deletion to the low peak period in the early morning.

       However, there are other problems after the business goes online. Since the business will continuously pull and analyze the data of the past few days in batches in the early morning, if the expired and batch read data are superimposed together, the efficiency of business query will be seriously affected. In the end, this solution is not feasible, and other solutions are required if the capacity is not expanded.

5.4 The third round of program optimization (build tables in the day dimension, delete tables after expiration)

       In order not to increase the cost, and at the same time, the 3 shards cannot support the 20W/s read/write deletion, etc., in order to meet the needs of users through the condition that the 3 shards do not expand as much as possible, the method is changed, and the table is deleted to avoid expiration. , the specific implementation is as follows:

  • The business transformation code builds a table in units of days, and stores the data of each day in the corresponding table.
  • After the table is built, the business enables the pre-sharding function to ensure that the data is evenly distributed to three shards.
  • The business saves data for 8 days, and deletes the table of the first day in the early morning of the ninth day

    The corresponding benefits of the service delay statistics of this solution are as follows:

       As shown in the figure above, through the optimization of the expiration method, the final problem is completely solved, and the read and write delay is controlled within 0.5ms-2ms.

6 Optimization summary

     Through the previous series of optimizations, there was no expansion in the end, and the problem of delay jitter caused by business expiration and a large number of data write updates was solved. The overall benefits are as follows:

  • Delay before optimization: often several hundred ms jitter
  • Delay after optimization: 0.5-2ms

 

 

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324085343&siteId=291194637