Concept and practice analysis of WAL

Part 1 - Write-Ahead Logging

In computer science, Write-Ahead logging [1] ( WAL for short ) is a family of techniques used in relational database systems to provide atomicity and durability ( A and D in the ACID property of a transaction ). In systems using WAL , all changes are written to the log file before they take effect . The log file usually includes redo and undo information. Suppose a program loses power in the middle of some operation. On restart, the program may need to know whether the operation performed at that time was successful, partially successful, or failed. If WAL is used , the program can examine the log file and compare what was planned to be performed with what was actually performed in the event of a sudden power loss. On the basis of this comparison, the program can decide whether to undo what has been done, continue with what has been done, or leave it as it is. 

WAL allows the database to be updated in-place . Another method used to achieve atomic updates is shadowpaging . The main advantage of doing updates in-place is to reduce index and block list modifications. ARIES [2] is a commonly used algorithm for WAL series technologies. Its core strategy is: 1 ) Write ahead logging , any changes to an object must be logged first, and the log must be written to disk before the object; 2 ) Repeating history during Redo , when restarting after a crash , ARIES restores the database to the state it was at the moment before the crash by re-executing the behavior of the database before the crash, and then undoes the transactions that were still executing during the crash; 3 ) Logging changes during Undo , will be in undoA log of changes made to the database at the time of the transaction, ensuring that it will not be redone on repeated restarts. In filesystems, WAL is often referred to as journaling [3] .

Every update operation in RocksDB [4] is written to two places: 1) WAL log on disk ; 2) An in-memory data structure called memtable , which will be flushed to the SST file later. The WAL serializes the operations of the memtable and stores them in the persistent medium in the form of log files. In the event of a database crash, the WAL file can be used to rebuild the memtable , helping the database restore the database to a consistent state. When a memtable is safely placed on persistent media, the related WAL logs will become expired and then archived. The final archived log will be deleted from the hard disk after a certain period of time.

 

1.1 WAL Manager

WAL files are generated to the WAL folder with an incrementing sequence number. To rebuild the state of the database, these files are read in sequential order. The WAL manager provides an abstract interface for viewing WAL files as an independent read. Internally use the Writer or Reader abstract interface to open and read the file.

Writer provides an abstract interface for appending data to the end of the log file. Internal details related to storage media are handled through the WriteableFile interface. Similarly, Reader provides an abstract interface for sequentially reading log records from a log file. The internal storage media related details are handled by the SequentialFile interface.

 

1.2 WAL file format

A WAL file consists of a series of variable-length records. Records are clustered together by kBlockSize (32KB by default). If a particular record cannot fit into the remaining space, the remaining space will be filled with empty data. When the writer writes and the reader reads data, it reads in blocks of size kBlockSize.

The records are arranged in the following format:

A WAL file contains a series of 32KB blocks, with the only exception that the end of the file may contain a fragmented block. Each block consists of a series of records, and a record does not start in the last 6 Bytes of a block (after all, it can't fit). Any remaining data constitutes the tailer, which consists of all 0s and should be skipped when reading. If there are exactly 7 bytes left in the current block, and a new non-zero length record is added, then write must add a FIRST record (without any user data in it) to fill the remaining 7 bytes, and then in the next block Submit user data again.

FULL type records hold complete user data. FIRST, MIDDLE, LAST are used when user data has to be split into multiple shards (mostly because of block boundary issues). FIRST is the type for the first fragment of user data, LAST is the record type for the last fragment of user data, and MIDDLE is the record type for all other data in between.

Example: Consider a sequence of user records:

A will be stored as a FULL record in the first block. B will be divided into three shards: the first shard occupies the remaining space of the first block, the second shard occupies the full space of the second block, and the third shard occupies the beginning of the third block part. There are 6 Bytes left in the third block, as a tailer, leave it empty. C will store the FULL record in the fourth block.

 

1.3 Life cycle of WAL

Use an example to illustrate the life cycle of a WAL. A RocksDB db instance creates two column families: "new_cf" and "default". Once the db is opened, a new WAL is created on disk to guarantee write durability.

Add some data to the column family:

At this time the WAL needs to record all write operations. The WAL remains open and keeps track of subsequent writes until the size reaches the threshold set by DBOptions::max_total_wal_size.

If the user decides to drop the data of the column family "new_cf", the following things will happen:

  • The data of new_cf (key1 and key3) will be placed in a new SST file;

  • A new WAL will be created, and now subsequent writes will be written to the new WAL;

  • The old WAL no longer accepts new writes, but deletes are deferred.

At this time, there will be two WAL files, the old one holds the contents from key1 to key4, and the new one holds key5 and key6. Because the old ones still have online data, that is, the "defalut" column family, which cannot be deleted yet. Only when the user finally decides to put the "default" column family data to disk, the old WAL can be archived and then automatically deleted from disk.

In general a WAL file is created when:

  • When DB is open;

  • When placing data on a column family.

A WAL will be deleted (or archived, if archiving is allowed) after the maximum requested sequence number of data it holds for all column families, in other words all data in the WAL is pinned to the SST file. Archived WALs are moved to a separate location and then purged from storage. The actual delete action may be delayed due to copying.

 

1.4 WAL configuration

 

These configurations can be found in options.h.

  • DBOptions::wal_dir: used to set the directory where RocksDB stores WAL files, which allows users to store WAL and actual data separately (the Yunxi database is stored in the same directory by default);

  • DBOptions::WAL_ttl_seconds, DBOptions::WAL_size_limit_MB: These two options affect the time of WAL file deletion. The non-zero parameter represents the threshold of time and hard disk space. If this threshold is exceeded, it will trigger the deletion of archived WAL files (default 0 in Yunxi database);

  • DBOptions::max_total_wal_size: If you want to limit the size of the WAL, RocksDB uses it as a trigger for column family placement. Once the WAL exceeds this size, RocksDB will start forcing column families to disk to ensure that the oldest WAL files are deleted. This configuration is useful when the column family is updated at a variable frequency. If there is no size limit, users may need to save very old WAL files (default 0 in Yunxi database) if the data of some very infrequently updated column families in this WAL is not placed on disk.

  • DBOptions::avoid_flush_during_recovery: The option name has explained its purpose (avoid disk drop during recovery) (default false in Yunxi database);

  • DBOptions::manual_wal_flush: Determines whether the WAL is automatically flushed or purely manually flushed after each write operation (the user must call FlushWAL to trigger a WALflush) (default false in Yunxi database);

  • DBOptions::wal_filter: By changing the parameter, the user can provide a filter object that is called when processing the WAL file during the recovery process (disabled by default in the Yunxi database);

  • WriteOptions::disableWAL: This parameter is very useful if the user relies on other log writing methods, or is not worried about data loss (default false in Yunxi database).

     

    1.5 WAL filter 12

Transaction log iterators provide a way to replicate data between RocksDB instances. Once a WAL is archived because the column family is dropped, the WAL is not immediately deleted. This is to allow the transaction log iterator to continue reading the WAL file before sending it to the slave. Transaction log iterators provide a way to replicate data between RocksDB instances. Once a WAL is archived because the column family is dropped, the WAL is not immediately deleted. This is to allow the transaction log iterator to continue reading the WAL file before sending it to the slave.

 

Part 2 - Implementation of WAL

 

2.1      etcd[5]

The overall architecture of etcd is shown in the following figure:

The etcd write request process can be roughly summarized as the following process: 1) First, the client selects an etcd node through the load balancing algorithm and initiates a gRPC call; 2) After the etcd node receives the request, it goes through the gRPC interceptor and the Quota module, and then enters the KVServer module; 3) The KVServer module submits a proposal to the Raft module; 4) The proposal is then forwarded to each node in the cluster through the RaftHTTP network module; 5) After the majority of nodes in the cluster are persisted, the status will become submitted; 6) etcdserver from Raft The module gets the submitted log entries, 7) passes to the Apply module; 8) the Apply module executes the proposal content through the MVCC module; 9) updates the state machine.

In process 5, after the Raft module receives the proposal, if the current node is a follower, it will forward it to the leader, and only the leader can process the write request. After the Leader receives the proposal, it outputs the message to be forwarded to the Follower node and the log entry to be persisted through the Raft module. The log entry encapsulates the content of the proposal. After etcdserver obtains the above messages and log entries from the Raft module, as a leader, it broadcasts the put proposal message to each node in the cluster, and also needs to persist the cluster leader term number, voting information, submitted index, and proposal content to a WAL (Write Ahead Log) log file to ensure the consistency and recoverability of the cluster.

The figure above is the WAL structure, which consists of multiple types of WAL records sequentially appended and written, and each record consists of type, data, and cyclic redundancy check code. Different types of records are distinguished by the Type field, Data is the content of the corresponding record, and CRC is the cyclic check code information. There are currently 5 WAL record types supported, namely file metadata record, log entry record, status information record, CRC record, and snapshot record. The file metadata record contains node ID and cluster ID information, which is written when the WAL file is created; the log entry record contains Raft log information, such as the content of the put proposal; the state information record contains the cluster term number, node voting information, etc. There will be multiple entries in a log file, and the last record shall prevail; the CRC record contains the last CRC (Cyclic Redundancy Check) information of the previous WAL file, which is used as the first record when creating or cutting a WAL file Write to a new WAL file to verify the integrity and accuracy of the data file; the snapshot record contains the snapshot term number and log index information to check the accuracy of the snapshot file.

How does the WAL module persist the log entry type record of a put proposal? First look at how put write requests are encapsulated in Raft log entries. The following is the data structure information of the Raft log entry, which consists of the following fields: Term is the leader term number, which increases with the leader election; Index is the index of the log entry, which increases monotonically; Type is the log type, such as a common command log (EntryNormal) or the cluster configuration change log (EntryConfChange); Data saves the content of the put proposal.

After understanding the Raft log entry data structure, look at how the WAL module persists Raft log entries. It first serializes the content of the Raft log entry (including term number, index, and proposal content) and saves it to the Data field of the WAL record, then calculates the CRC value of the Data, sets the Type to Entry Type, and the above information constitutes a complete WAL Record. Finally, calculate the length of the WAL record, write the WAL length (Len Field) in sequence, then write the record content, call fsync to persist it to the disk, and complete the log entry to the persistent storage. When more than half of the nodes persist this log entry, the Raft module will inform the etcdserver module through the channel that the put proposal has been confirmed by most nodes in the cluster, the proposal status is submitted, and the content of the proposal can be executed. Then enter the process 6, etcdserver module takes out the proposal content from the channel, adds it to the first-in, first-out (FIFO) scheduling queue, and then executes the proposal content asynchronously and sequentially through the Apply module in the order of entry.

 


[1] https://zh.wikipedia.org/wiki/%E9%A2%84%E5%86%99%E5%BC%8F%E6%97%A5%E5%BF%97

[2] https://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics

[3] https://en.wikipedia.org/wiki/Journaling_file_system

[4] https://rocksdb.org.cn/doc/Write-Ahead-Log-File-Format.html

[5] https://github.com/etcd-io/etcd/blob/release-3.4/wal/doc.go

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324162951&siteId=291194637