How to ensure MongoDB oplog order?

MongoDB replication set, the active and standby nodes to synchronize data oplog, when data is written on Priamry, logs a oplog, from Primary Secondary oplog pulling node and playback, in order to ensure the same final store the data set.

oplog Key Features

  • Idempotent each oplog, reproducing one or more times, the result is the same; idempotent mongodb to achieve many operations to convert, such as the insert is converted to upsert, $ inc operation is converted to $ set like .
  • Fixed Size (capped collection), oplog fixed storage space when the space is full, it will automatically delete the oldest files.
  • oplog sorted by timestamp, and consistent order on all nodes

This paper describes how oplog MongoDBD ensure the orderly storage and read about oplog Further reading

Concurrent write oplog, how to lock?

When writing a document, first DB additional write to write on the Primary intent lock, and then the collection of write-intent locks, and then call the underlying engine interface written documentation of the local database write-intent locks, plus a collection of oplog.rs write intent locks, written oplog. MongoDB mechanisms for multi-level intent locks, refer to the official documentation .

Write1

DBLock("db1", MODE_IX);
CollectionLock("collection1", MODE_IX);
storageEngine.writeDocument(...);    
DBLock("local", MODEX_IX);
CollectionLock("oplog.rs", MODEX_IX);
storageEngine.writeOplog(...);

Write2

DBLock("db2", MODE_IX);
CollectionLock("collection2", MODE_IX);
storageEngine.writeDocument(...);    
DBLock("local", MODEX_IX);
CollectionLock("oplog.rs", MODEX_IX);
storageEngine.writeOplog(...);

How to ensure oplog order on Primar?

Based on the above concurrency strategy, in the case of multiple concurrent writes, how to ensure oplog order?

oplog is a special capped collection, the document does not _id field, but contains a ts (timestamp field), all the documents in accordance with the oplog ts sequentially stored. The following are a few examples of the oplog.

{ "ts" : Timestamp(1472117563, 1), "h" : NumberLong("2379337421696916806"), "v" : 2, "op" : "c", "ns" : "test.$cmd", "o" : { "create" : "sbtest" } }
{ "ts" : Timestamp(1472117563, 2), "h" : NumberLong("-3720974615875977602"), "v" : 2, "op" : "i", "ns" : "test.sbtest", "o" : { "_id" : ObjectId("57bebb3b082625de06020505"), "x" : "xkfjakfjdksakjf" } }

To wiredtiger for example, when writing oplog documentation, which oplog of ts field as a key, the document content as a value, write a record KV, wiredtiger will ensure storage (btree or lsm way guaranteed) according to key documents sort, this would resolve the "document is stored ts field order" problem. But there are still problems of concurrent disorder, such as:

When multiple concurrent writes oplog, timestamps are ts1, ts2, ts3 (ts1 <ts2 <ts3), ts1, ts3 first successful, this time to pull Secondary oplog 2, and then write ts2 successful, then Secondary and then pull to ts2, oplog Secondary order that is seen as ts1, ts3, ts2, there will be problems oplog out of order.

MongoDB (wiredtiger engine) solution is limited by the time of reading, must be guaranteed to see node Secondary sequence specific implementation mechanism is as follows:

  1. Before writing oplog, you will first lock to oplog assigned a time stamp, and registered to the uncommitted list

    lock();
    ts = getNextOpTime(); // 根据当前时间戳 + 计数器生成
    _uncommittedRecordIds.insert(ts);
    unlock();
    
  2. Formally written oplog, after writing the corresponding oplog never submitted the list removed

    writeOplog(ts, oplogDocument);
    lock();
    _uncommittedRecordIds.erase(ts);
    unlock();
    
  3. When pulling oplog

    if (_uncommittedRecordIds.empty()) {
        // 所有 oplog 都可读
    } else {
        // 只能到未提交列表最小值以前的 oplog
    }
    

By the above rules, according to the final guarantee ts Primary oplog field memory, and can be sequentially read all Secondary oplog.

Secondary oplog on how to ensure that the order is consistent with the Primary?

Secondary to oplog pulled to the local post, it will be multi-threaded playback, the last in a thread will pull into the local local.oplog.rs oplog are written as a collection, thus ensuring Secondary oplog ultimately identical to the Primary.

About the Author

FIELD Zhangyou Dong, Alibaba technical experts, focuses on distributed storage, Nosql database, has been involved in TFS (Taobao Distributed File System) , Redis cloud database projects, mainly engaged in MongoDB cloud database development work, committed to the development of who spend the best MongoDB cloud service.

Reprinted from: http: //www.mongoing.com/archives/3177

Guess you like

Origin www.cnblogs.com/xibuhaohao/p/11226337.html