MyRocks Write Analysis

Original link: http://click.aliyun.com/m/26466/
Abstract: --- title: MySQL · myrocks · myrocks writing analysis author: Zhang Yuan--- # Writing process The writing process of myrocks can be simple It is divided into the following steps to complete 1. Write the parsed record (kTypeValue/kTypeDeletion) to WriteBatch 2. Write the WAL log to the log file 3. Write the content of WriteBatch to memtabl

title: MySQL · myrocks · myrocks writing analysis

author: Zhang Yuan
Writing process
The writing process of myrocks can be simply divided into the following steps to complete: Write

the parsed record (kTypeValue/kTypeDeletion) into WriteBatch Write
the WAL log to the log file Write
the WriteBatch The content in the transaction is written to memtable, and the transaction is completed.
The second and third steps are completed at the time of submission.

WriteBatch is closely related to Myrocks transaction processing. The records in the transaction are stored in the form of strings in WriteBatch->rep_ before submission, or Both commit, or both roll back. The logic of rollback is relatively simple, you only need to clean up WriteBatch->rep_. See TransactionImpl::Rollback for details
. A simple insert is written to the WriteBatch stack as follows

#0 rocksdb::WriteBatchInternal::Put
#1 rocksdb::WriteBatch::Put
#2  myrocks::ha_rocksdb::update_pk
#3  myrocks::ha_rocksdb::update_indexes
#4  myrocks::ha_rocksdb::update_write_row
#5  myrocks::ha_rocksdb::write_row
#6  handler::ha_write_row
#7  write_record
#8  mysql_insert
#9  mysql_execute_command
#10 mysql_parse
#11 dispatch_command
#12 do_command
#13 do_handle_one_connection
一个简单的insert commit堆栈如下

#0  rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::Insert
#1  rocksdb::(anonymous namespace)::SkipListRep::Insert
#2  rocksdb::MemTable::Add
#3  rocksdb::MemTableInserter::PutCF
#4  rocksdb::WriteBatch::Iterate
#5 rocksdb::WriteBatch::Iterate
#6 rocksdb::WriteBatchInternal::InsertInto
#7 rocksdb::DBImpl::WriteImpl
#8 rocksdb::DBImpl::Write
#9 rocksdb::TransactionImpl::Commit
#10 myrocks:: Rdb_transaction_impl::commit_no_binlog
#11 myrocks::Rdb_transaction::commit
#12 myrocks::rocksdb_commit
#13 ha_commit_low
#14 TC_LOG_MMAP::commit
#15 ha_commit_trans
#16 trans_commit_stmt
#17 mysql_execute_command
#18 mysql_parse
#19 dispatch_command
#20 do_command
#21 do_handle_one_connection
commit Process and optimization Only the submission process of the rocksdb engine is analyzed
here . In actual MyRocks submission, binlog needs to be written first (binlog is turned on).

When the rocksdb engine is submitted, two things are done.
1. Write WAL log (rocksdb_write_disable_wal=off when WAL is enabled)
2. Write the previous WriteBatch to memtable

However , writing to WAL is a serial operation. In order to improve the efficiency of submission, rocksdb introduces a group commit mechanism.

Transactions to be submitted are added to the submitted writer queue in turn. The writer queue is divided into groups. Each group has a leader and the others are followers. The leader is responsible for writing WAL in batches. Each group is linked by a doubly linked list link_older, link_newer. As shown in the picture

below Screenshot 2017-07-11 7.46.22.png

The possible states of each writer are as follows

Init: The initial state of the
writer Header: The writer is selected as the leader
Follower: The writer is selected as the follower
LockedWating: The writer is waiting for itself Change to the specified state
Completed:
The state transition of the writer after the writer operation is completed is related to whether the group writes the memtable concurrently.
When the concurrent memtable writing is enabled (rocksdb_allow_concurrent_memtable_write=on) and there are at least two writers in the group, the group will write concurrently.

The state transition diagram of the writer when the group writes concurrently is as follows:

Screenshot 2017-07-14 pm 1.25.27.png

The state transition diagram of the writer when the group is not concurrently writing is as follows:

Screenshot 2017-07-11 PM 7.46.50.png

Source structure diagram is as follows (picture from Lin Qing)
Screenshot 2017-07-14 PM 1.44.46.png

Above The figure is the situation where the writer concurrently writes to the memtable within the group.

When non-concurrent memtable is written, there is no LaunchParallelFollowers/CompleteParallelWorker, and Insertmemtable is written serially by the leader.

Here the group commit has the following points
1. There is only one leader at the same time, and the next leader is set after the leader completes the operation
2. It needs to wait for one group to complete before proceeding to the next group
3. The last completed writer in the group is responsible for Complete the submission and set the next leader
4. The leader is responsible for writing WAL in batches
5. Only the leader will adjust the doubly linked list link_older, link_newer.

Note that 2 and 3 should be optimized and improved so that there is

no need wait for one group to complete before proceeding to the next group
. Followers of a group can be executed concurrently.
Only leader is responsible for completing the submission and setting the next leader
to write control
. Rocksdb needs to consider the following situations when submitting writes. For details, see PreprocessWrite

When the WAL log is full and the WAL log exceeds rocksdb_max_total_wal_size, the column family containing the oldest log (the earliest log containing a prepared section) will be found from all the column families and flushed to release the WAL log space
. The buffer is full and the global write buffer exceeds When rocksdb_db_write_buffer_size, it will find the first created memtable from all colomn families and switch. For details, see HandleWriteBufferFull.
Certain conditions will trigger delayed writing

. max_write_buffer_number > 3 and the total number of immutable memtables that have not been flushed >= max_write_buffer_number-1
When automatic compact is enabled, level0 The total number of files >= level0_slowdown_writes_trigger
Some conditions will trigger write stop. The total number of immutable memtables

that are not flushed >= max_write_buffer_number When
automatic compaction is enabled, the total number of files at level0 >= level0_stop_writes_trigger For
details, please refer to RecalculateWriteStallConditions to

summarize
the rocksdb writing process and there is room for optimization, and Facebook also has related Optimization.

If you find any content suspected of plagiarism in this community, please send an email to: [email protected] to report and provide relevant evidence. Once verified, this community will immediately delete the allegedly infringing content.
Original link: http://click.aliyun.com/m/26466/

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327097650&siteId=291194637