Best Practices for Migrating Large MongoDB Databases to Amazon DocumentDB Elastic Cluster

9aede6879737895eec595db48d126235.gif

01

background

At present, document databases are widely used due to their flexible schema and access characteristics close to relational databases. In particular, customers in games, Internet finance and other industries use MongoDB to build a large number of applications. For example, game customers use it to process player attribute information ; Another example is that the stock APP is used to store market data related to the timeline. With the passage of time and business development, the MongoDB library is getting bigger and bigger, and the management of large libraries is a problem that must be faced.

Generally speaking, there are several options for large database governance. The first is to isolate hot and cold data, and divide the data into hot, warm, cold, and frozen levels according to the frequency of use. The cold data that exceeds a certain period of time is dumped to another cold storage or low-cost storage database; the hot storage only retains recent access Frequent data; the second is to do vertical splitting. For example, a large system has multiple collections, which are divided vertically according to modules, and the collections corresponding to different modules are split into different libraries to realize the vertical separation of data volume and access volume; the third is to do Horizontal splitting, such as selecting the hash value of userid, splitting a large collection horizontally into multiple libraries to achieve the expansion of overall storage and computing capabilities. Fourth, there are also some businesses whose mission of historical data is completed and the life cycle can be completed, so they can be deleted directly. These four solutions have their own advantages and disadvantages, and need to be selected according to actual business scenarios. In many scenarios, customers will choose horizontal sharding, the main reasons are as follows:

▪Many  businesses need to query historical data frequently, and horizontal sharding does not need to delete or separate historical data;

▪In  the long run, horizontal sharding has better scalability and can support a larger business scale.

Amazon DocumentDB Elastic Cluster is a cloud database service provided by Amazon that supports horizontal sharding. This article mainly focuses on the problem of how to migrate massive data in the process of customers migrating from MongoDB replica set architecture to DocumentDB Elastic Cluster, conducts research and provides best practices.

02

Optional migration options

As we all know, the migration of databases containing large amounts of data is a challenging problem. The database is constantly being read and written. It is not only necessary to complete the initialization of the current full amount of data in the target database, but also to synchronize the data changes during initialization to the new database. The following is a schematic diagram of the migration plan:

8291e81cae08565518137207fb623bb0.png

We know that MongoDB records document changes in two ways: oplog and change stream. Since the storage space of the oplog or change stream is limited, the migration speed in the full initialization phase must be considered. In addition, the speed of the incremental synchronization phase must also be greater than the change speed of the source database, so as to achieve data consistency between the old and new databases. We need to rely on stable and efficient tools to complete these two stages. Especially when migrating large databases, it is even necessary to cooperate with certain data migration strategies (such as parallelism and compression; cold and hot data are migrated separately; different collections are migrated separately, etc.).

We have 3 possible migration scenarios.

▪  Amazon DMS full + incremental migration;

▪  Mongoshake full + incremental migration;

▪  Mongodump/mongorestore + DMS incremental migration

plan 1:

Amazon DMS Full + Incremental

DMS is a cloud service from Amazon that allows migration of relational databases, MongoDB databases, and other types of data stores. We can use DMS to perform a one-time migration, or to replicate ongoing changes to the source repository to keep the source and target in sync. DMS provides Auto Segmentation and Range Segmentation methods in the full migration phase to accelerate migration in parallel; in the CDC incremental phase, version 3.5 beta also supports concurrent writing to DocumentDB.

Configuration reference:

https://docs.aws.amazon.com/zh_cn/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.TargetMetadata.html。

Scenario 2:

Mongoshake full + incremental

The open source Mongoshake also supports migration and writing to DocumentDB. Since it is an open source product, the advantage is that the community is active, and problems can be customized and developed to solve them, and the migration speed is fast; the disadvantage is that the technical support available for problems is relatively low, and users need to locate themselves or seek help from the community.

Option 3:

Mongodump/mongorestore + DMS incremental

Mongodump is an official backup tool provided by MongoDB. It can read data from the MongoDB database and generate a BSON file, and then restore it to MongoDB through the mongorestore tool. It also supports backing up data from DocuemntDB. The 6.1 version of mongodb-database-tools also supports recovery to DocumentDB Elastic Cluster. The advantage of this solution is that it is stable and fast, and the disadvantage is that the incremental synchronization capability is insufficient. However, you can take advantage of the incremental synchronization capabilities of DMS. The point is to choose the starting point for incremental synchronization to prevent data loss.

The above three options have their own advantages and disadvantages, as shown in the table below.

22ea09045217c0f7418412d7d4c14c94.png

Using the DMS hosting service, it is most convenient for users to configure migration tasks. During the entire migration process, the logs are clear, the speed is intuitive, and the observability is good. Mongoshake is slightly slower in incremental writing to DocumentDB, and is not applicable in scenarios with high TPS; while mongodump and mongorestore are faster than DMS full load in MongoDB large database migration scenarios. A very important factor for the success of large database migration is the migration speed. Therefore, I recommend option 3. The following focuses on the detailed steps of the Mongodump/mongorestore + DMS incremental solution (the steps of other solutions and related migration performance tuning methods will be supplemented in other articles).

03

Mongodump/mongorestore + 

Detailed steps of DMS incremental plan

Environment description:

e25fbacdce3475ba63b3dc2e9c4d00b1.png

3.1  EC2 environment deployment

EC2 is mainly used to deploy mongo tools and DocumentDB-tools, and store Bson files exported from the source library. mongodump will export the data as BSON files, and the BSON files of large tables will also take up a lot of space. Therefore, a larger data disk needs to be configured. In this case, a 3TB data disk is configured.

(1) Select the appropriate operating system


Select Amazon linux 2 as the operating system, as shown in the figure below

649eebdc4db91c002867c285a4174523.jpeg

(2) Choose the right model

Since mongorestore supports concurrent import, users can customize the number of parallels. It is recommended that the number of vCPUs of the model be the same or double the number of parallels. This article intends to use 16 concurrent import tests, so choose the model m5a.4xlarge.

0759f710e3882dd08a91b11fef8e4b1e.png

(3) Select the same AZ, VPC and Subnet as DocDB

To avoid increased latency due to cross-availability zones during import, it is recommended that the EC2 of mongodb tools and DocDB Elastic Cluster be deployed in the same AZ and VPC. By editing the network settings and selecting different subnets, you can choose the appropriate AZ.

c8e50846e83ce3ee10720ee4d70a3071.png

(4) Add a suitable size disk

According to the size of the BSON file exported by mongodump, select an EBS disk of appropriate size. For example, in the figure below, the database we tested this time is about 2.3TB, so we choose a 3000GiB EBS disk. Then click Launch instance to deploy the EC2 instance.

3aa9f191ba7a5adc848fade61937ed27.png

(5) Create a file system for storing dump files and mount the directory

Step (4) only creates an additional data disk. After EC2 is deployed, a file system needs to be created, and a directory for actually storing dump files is created and mounted to the file system. Specific reference:

https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UserGuide/add-instance-store-volumes.html#making-instance-stores-available-on-your-instances。

If it is an already created EC2, you can also add an EBS device as a data disk. For details on how to add it, refer to:

https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UserGuide/ebs-attaching-volume.html。

3.2 mongo tools installation 

Explanation: In this case, a data disk is added, and the directory /backup is created to store dump files. If the source mongoDB is not in the cloud, execute steps 3.2 and 3.3 on the IDC where the source mongoDB is located, and then transfer the dump file to the EC2 deployed in step 3.1. The following steps take the customer's self-built MongoDB in the Amazon cloud as an example.

(1) Log in to EC2 to download the tool

# 进入工具安装目录,自定义一个即可
cd /data/mongo
wget https://fastdl.mongodb.org/tools/db/mongodb-database-tools-amazon2-x86_64-100.6.1.tgz

Swipe left to see more

Note: The version of mongo tools must be mongodb-database-tools-amazon2-x86_64-100.6.1. Other versions of mongoresotre do not support importing DocumentDB Elasti Cluster clusters.

(2) Unzip and install

tar zxvf mongodb-database-tools-amazon2-x86_64-100.6.1.tgz

Swipe left to see more

3.3 Export source database data

Execute mongodump to export data; -d specifies the database, -c specifies the collection name, -o specifies the dump output directory. Detailed parameters can be viewed through  mongodump –help  .

cd /data/mongo/mongodb-database-tools-amazon2-x86_64-100.6.1/bin
nohup ./mongodump -h ip-172-xxxx.ec2.internal:27017 -u <YourUser> -p <YourPassword> -d demodb -c usertable -o /backup > dump.out 2>&1 &


# 查看导出进度
tail -f ycsb_test2.out


2023-04-12T04:27:01 writing demodb.usertable to /backup/demodb/usertable.bson
2023-04-12T08:48:57.031+0000    done dumping demodb.usertable (2000000000 documents)
# 查看备份文件
ls -l /backup/*
-rw-rw-r-- 1 ec2-user ec2-user 2335759066463 Apr 12 08:48 usertable.bson
-rw-rw-r-- 1 ec2-user ec2-user           204 Apr 12 04:27 usertable.metadata.json

Swipe left to see more

When you see the message of done dumping xxxxdb.xxxxtable, it means the export is complete.

Note: It is necessary to record the UTC time 2023-04-12T04:27:01 when the export starts. When configuring the incremental task of DMS, the start time will be set based on this event stamp.

3.4 Create in target DocDB 

Databases and collections with sharding enabled

Since we want to write a non-sharding collection to the sharding collection, we need to design the sharding-key in advance and create the sharding collection in DocumentDB Elastic Cluster, otherwise mongorestore will import it in the one-shard mode and fail to achieve horizontal sharding Effect.

(1) Create a target library

The target library needs to enable enablesharding. Create a library named tempdb and enable sharding syntax:

db.runCommand( { enablesharding :"tempdb"});
{ "ok" : 1 }

Swipe left to see more

(2) Create a collection

sharding-key selects the default _id, and shards according to the hash method. Currently Elastic Cluster only supports hash mode. The creation syntax is as follows:

sh.shardCollection( "demodb.usertable", { "_id": "hashed" } )
{ "ok" : 1 }

Swipe left to see more

3.5  Import data to the target library

Note: If the source mongoDB is in the customer's IDC, or not in the same AZ as DocDB, it needs to be transferred to the EC2 where the target DocDB is located in advance (that is, the EC2 instance deployed in step 3.1). Here is the command to import data using mongorestore:

nohup ./mongorestore -hdocdb-cluster1-xxxx.us-east-1.docdb-elastic.amazonaws.com \
--ssl -u <YourUser> -p <YourPassword> -c usertable -d tempdb \
--dir=/backup/demodb/usertable.bson \
--numInsertionWorkersPerCollection=16 > mongorestore_log.out 2>&1 &

Swipe left to see more

Parameter Description:

–numInsertionWorkersPerCollection can specify the number of workers for concurrent import, which is not directly related to the number of shards in DocDB; -d can specify the name of the target database, which can be different from the source database. –dir specifies the absolute address of the BSON file.

2023-04-12T10:23:53.924+0000    checking for collection data in /backup/demodb/usertable.bson
2023-04-12T10:23:53.924+0000    reading metadata for tempdb.usertable from /backup/demodb/usertable.metadata.json
2023-04-12T10:23:53.982+0000    restoring to existing collection tempdb.usertable without dropping
2023-04-12T10:23:53.982+0000    restoring tempdb.usertable from /backup/demodb/usertable.bson
2023-04-12T10:23:56.921+0000    [........................]  tempdb.usertable  141MB/2175GB  (0.0%)
2023-04-12T10:23:59.920+0000    [........................]  tempdb.usertable  322MB/2175GB  (0.0%)
2023-04-12T10:24:02.921+0000    [........................]  tempdb.usertable  555MB/2175GB  (0.0%)
2023-04-12T10:24:05.920+0000    [........................]  tempdb.usertable  772MB/2175GB  (0.0%)
2023-04-12T10:24:08.921+0000    [........................]  tempdb.usertable  1018MB/2175GB  (0.0%)
2023-04-12T10:24:11.921+0000    [........................]  tempdb.usertable  1.21GB/2175GB  (0.1%)
2023-04-13T07:04:32.920+0000    [#######################.]  tempdb.usertable  2175GB/2175GB  (100.0%)
2023-04-13T07:04:35.920+0000    [#######################.]  tempdb.usertable  2175GB/2175GB  (100.0%)
2023-04-13T07:04:38.117+0000    [########################]  tempdb.usertable  2175GB/2175GB  (100.0%)
2023-04-13T07:04:38.117+0000    finished restoring tempdb.usertable (2000000000 documents, 0 failures)
2023-04-13T07:04:38.117+0000    restoring indexes for collection tempdb.usertable from metadata
 2023-04-13T07:05:43.131+0000    index: &idx.IndexDocument{Options:primitive.M{"name":"idx_name", "ns":"tempdb.usertable", "v":2}, Key:primitive.D{primitive.E{Key:"name", Value:1}}, PartialFilterExpression:primitive.D(nil)}
2023-04-13T07:05:43.132+0000    2000000000 document(s) restored successfully. 0 document(s) failed to restore.

Swipe left to see more

When you see the document(s) restored successfully message, it means the import is complete.

migration index

By default, Mongorestore automatically migrates indexes after migrating data. Users can also choose not to migrate the index during mongorestore (add the option –noIndexRestore after mongorestore), and then use the Amazon DocumentDB index tool to export the index from the source Amazon DocumentDB cluster and restore it to the target library. For details, please refer to the Amazon Developer Guide: https://docs.aws.amazon.com/zh_cn/documentdb/latest/developerguide/docdb-migration.versions.html#docdb-migration.versions-step3.

For indexes, you need to pay attention to the following indexes, which are not supported in Elastic Cluster:

▪ Sparse indexes

▪ TTL indexes

▪ Geospatial indexes

▪ Background index create

3.6 Monitor write indicators 

You can log in to CloudWatch to view the import speed of each shard. Through CloudWatch→All metrics→DocDB Elastic→Search “DocumentsInserted”, check each shard to generate a Dashboard, as shown in the following figure:

a2617d687747dc987ea9a61b2e93c7c7.png

It can be seen that there are 3 shards in this case, and the data writing rate of each shard is about 220,000 rows/second.

3.7  Incremental synchronization

Configure the incremental synchronization task through the DMS console, the specific reference is as follows:

(1) Create an endpoint

Like other migration tasks, the source endpoint and target endpoint need to be created separately; among them, the source endpoint is the endpoint of MongoDB, and the self-built MongoDB chooses the way of manually entering the connection information, as shown in the figure below, and you need to pay attention to the selection of the purple part.

42fb9ed733ab97374ca451640a491438.jpeg

The target endpoint is the endpoint of DocumentDB, roughly consistent with the above figure.

(2) Create a replication instance

The speed of CDC is related to the model of the replication instance, and you can choose different models for testing. Since the instance is only used for migration and not for long-term synchronization, it is recommended to select an instance in a single availability zone (released after the migration is completed), so that the replicated instance and DocumentDB can be deployed in the same availability zone and improve CDC write speed. In addition, if you want to use DMS's parallel CDC to write to DocumentDB, you need to select version 3.5.0 (beta) and above.

1c0e1c2451b33ec4592331cbd6645ec6.jpeg

(3) Create an incremental synchronization task

When creating a task, enter the identifier, select the replication instance and endpoint configured above; then select "Replicate data changes only" in Migration Type to only perform CDC incremental synchronization.

d51ee99c974f76821c27d7823b781855.jpeg

In the next task setting, you can enable the custom startup mode, that is, declare from which timestamp to start grabbing the change stream and start writing to the target DocumentDB. Please enter the correct start time. This time must be the time when Mongodump starts to ensure data consistency. It has been saved above.

62ce4799dcc48c4cd9da3d7d99044c67.jpeg

As shown above, in the Task settings, the Target table preparation mode needs to be set to Do nothing, that is, tell the DMS to choose to ignore if it finds that the target library already exists. The reason is that we have created a collection with sharding enabled. If DMS rebuilds the table, it can only rebuild a normal non-sharding collection, which deviates from our original intention of sharding design. For better monitoring tasks, we recommend checking the option of Turn on CloudWatch logs.

Then configure the Mapping rule of the collection, and declare the synchronized database name and collection name. If you are performing an incremental task on a table in the entire database, you do not need to configure the table name separately. If the name of the schema is different in the source library and the target library, you can also configure the mapping rules here.

66db882d1aaba28417153da36d8a75b3.png

At the bottom of the creation page, there is an option about whether to start the migration task immediately after the task is created. Since we want to modify the parameters of the parallel CDC, we need to choose to start it manually later and check "Manually later". Finally, click "Create Task" to complete the task creation.

889c3cc3ef359155515a6c6e725cfdd9.png

(4) Modify parallel CDC parameters

After the task is created, the task should be in the Ready state (not started). Now we enable the parallely apply function of DMS CDC for DocumentDB. Check the task, then click Modify, find Task settings, and select Json editor (as shown below).

4a9311d0dec7b5348dc80acc540a2aef.png

# 修改 TargetMetadata 中以下 3 个参数(以下参数值是配置示例)
"ParallelApplyBufferSize": 1000,
"ParallelApplyQueuesPerThread": 200,
"ParallelApplyThreads": 16,

Swipe left to see more

Then, click Save.

These three parameters are very important, introduced from DMS 3.5 (beta), the purpose is to increase the speed of CDC writing to DocumentDB, and improve the migration success rate of large-capacity document databases. Parameter explanation:

  • ParallelApplyThreads: Specifies the number of parallel threads that Amazon DMS uses to push data records to the target repository during CDC load.

  • ParallelApplyBufferSize: Specifies the maximum number of records stored in each buffer queue during CDC load.

  • ParallelApplyQueuesPerThread: Specifies the number of queues accessed per thread during CDC.

Specific reference:

https://docs.aws.amazon.com/zh_cn/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.TargetMetadata.htmlj。

(5) Modify CDC start time

As mentioned above, when DMS performs CDC tasks, it needs to initialize the full amount of data. If you did not select the correct time in step (3), you can reset it through Amazon CLI in this step. For data consistency, DMS needs to use the time point of data initialization as the starting checkpoint to pull ChangeStream.

Prerequisite : Configure the Amazon CLI

Step1. You can confirm the scope of the oplog in the source library

rs0:PRIMARY> rs.printReplicationInfo()
configured oplog size:   8961.375MB
log length start to end: 304secs (0.08hrs)
oplog first event time:  Wed Apr 12 2023 10:06:24 GMT+0000 (UTC)
oplog last event time:   Wed Apr 12 2023 10:11:28 GMT+0000 (UTC)
now:                     Wed Apr 12 2023 10:11:30 GMT+0000 (UTC)

Swipe left to see more

Note: The value specified by --cdc-start-time in Step2 must be within the above range, otherwise the task will fail at startup because the oplog cannot be found. Although DMS selects the change stream for CDC migration by default, the recovery token of the change stream is also directly related to the timestamp of the oplog.

Step2. Execute modification

aws dms modify-replication-task \
--replication-task-arn "arn:aws:dms:us-east-1:7301234567:task:xoxxxo" \
--cdc-start-time "2023-04-12T10:06:50"

Swipe left to see more

Note: DMS task must be in the stopped state to execute the above command modification.

--replication-task-arn is the ARN of the DMS migration task.

–cdc-start-time is the start time when mongodump is started. It is recorded when the export is started above. The specific time value here is for reference only.

(6) Start the CDC task

After starting the task, the CDC data synchronization from MongoDB to DocumentDB officially starts. Specific method: Check the CDC task that needs to be started, click "Actions", and select Restart/Resume to start the task.

417827f6b1981d162f4e96abfb8efcc4.png

Note: If the CDC task has failed to suspend once, and the cdc-start-time has been modified, after starting the task, to choose Restart or Resume again, please select Restart. If Resume is selected by mistake, cdc will pull the change stream from the time point of the last suspension or task failure, resulting in the manually set cdc-start-time not taking effect.

143becbee1f3d52b18a196ee63786260.png

(7) Monitor CDC tasks

After the task is started, if you need to know the status of incremental synchronization, you can monitor it through "CloudWatch metrics" or "View CloudWatch Logs" of DMS.

  • CloudWatch metrics

As shown in the figure below, after entering the task, click the CloudWatch metrics tab, select the CDC task, and you can see the indicators of CDC latency source and CDC latency target. If the delay is not reduced, there may be a data synchronization efficiency problem. If the delay is gradually shrinking, it means that the increment is gradually catching up.

28ff674248821b207923e1603cac863e.png

  • View CloudWatch Logs

Through CloudWatch Logs, you can see the logs executed by the specific DMS. For example, you can use Log to confirm whether Parallel Apply is in effect, and whether the target library DocumentDB is applying log changes normally.

f0735a1cb8d3dc6ad521a9fc0bde011e.png

If you see the purple background and the purple box "[TARGET_APPLY]I: Working in bulk apply mode" in the above picture, it means that the parallel apply is enabled successfully. The figure below shows the normal write log, which means that 200 records are taken from the queue and applied to the target DocumentDB library.

d465772c352728a5ab3b9df57efd24be.png

04

Summarize

This article mainly starts with the expansion difficulties encountered by customers when using MongoDB to support massive document-type data, and proposes a further solution to scalability; and aims at the difficulty of migrating MongoDB replica set large databases, and provides high scalability for migrating to horizontal sharding There are three options for permanent DocumentDB Elastic Cluster. Then, the detailed steps of the migration plan of mongodump/mongorestore full + DMS increment are introduced in detail. This migration solution is very efficient and is an effective solution for MongoDB to migrate to Amazon. It also supports the migration of DocumentDB instance-based cluster to Elastic Cluster. It is worth noting that any database will not keep unlimited change stream and oplog. Once the incremental log is deleted during the full migration, the migration process cannot guarantee the final consistency of the data, which will lead to the invalidation of the overall migration. For example, the change stream of DocumentDB Instance-based cluster supports a maximum storage period of 7 days (it is recommended to test the full migration speed according to your actual environment, and estimate the maximum amount of data that can be migrated during the change stream retention period). Therefore, if you find that the storage capacity of MongoDB or DocumentDB of a certain business has reached the TB level, and the data volume is still increasing, please design and transform sharding as soon as possible. Early detection and early optimization prevent the database from being too large, making architecture optimization and data migration more and more troublesome.

05

reference documents

  • DMS:

    Creating a task:

    https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.Creating.html

    Target metadata task settings:

    https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.TargetMetadata.html

  • DocumentDB:

    Migrating to Amazon DocumentDB:

    https://docs.aws.amazon.com/documentdb/latest/developerguide/docdb-migration.html

    Using Amazon DocumentDB elastic clusters:

    https://docs.aws.amazon.com/documentdb/latest/developerguide/docdb-using-elastic-clusters.html

The author of this article

7bbd6e8ace1d0d6f90cb89f02ca790ff.png

Jinchuan

Amazon cloud technology database solution architect, responsible for solution consulting and architecture design of database based on Amazon cloud technology . Before joining Amazon Cloud Technology , he worked in Huawei, Alibaba Cloud and other companies for many years. He has rich technical experience in database selection and architecture design, database optimization, data migration, big data and data warehouse construction. And other industries have rich design and implementation experience.

59b5d257cedc1bcd34d8a44bbf8162b4.gif

254be486bcb24489d8eed6b407554a53.gif

I heard, click the 4 buttons below

You will not encounter bugs!

cd165ed2a4e75f46d3936806ae5e963d.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/131908079