MyRocks and its usage scenario analysis

Source: https://zhuanlan.zhihu.com/p/45652076

Author: Wen positively Lake (worked at NetEase Hangzhou Research Institute, in the database kernel development)

MyRocks is a MySQL database optimized for space and write performance, providing a reliable choice for your business database selection. This article mainly introduces what MyRocks is, including its features, focusing on the advantages of MyRocks compared to InnoDB, and detailed analysis of various scenarios that MyRocks is applicable to.

RocksDB is implemented by FaceBook based on Google's open source LevelDB, using LSM (Log-Structure Merge) tree to store data. Facebook development engineers have carried out a lot of development on RocksDB to make it meet the requirements of MySQL's plug-in storage engine framework. It is ported to MySQL and called MyRocks. MyRocks supports most of MySQL features such as SQL-based data reading and writing, locking mechanism, MVCC, transactions, and master-slave replication. In terms of usage habits, there is not much difference between using MyRocks or MySQL/InnoDB.

After more than 4 years of development, MyRocks has matured. The open source MySQL branch versions Percona and MariaDB have migrated MyRocks to their own MySQL branches. InnoSQL is the MySQL branch of NetEase and currently supports MyRocks. The specific version is InnoSQL 5.7.20 -v4, based on the open source MyRocks code, we have made functional optimization enhancements, bugfixes, and support for local and remote online physical backups. Let's briefly introduce the features of MyRocks, so that everyone has a basic understanding of it. Since MyRocks only replaces InnoDB with RocksDB, the logic of the MySQL Server layer has not changed much, including SQL parsing and execution plans, and Binlog-based multi-threaded replication mechanism. The focus of our discussion is mainly the storage engine layer, which is on RocksDB.

This article mainly includes three parts: First, RocksDB is introduced through the read and write process to introduce its overall framework, storage backend and functional characteristics; then it is analyzed in multiple dimensions to analyze the differences from InnoDB and the benefits of these differences; finally, the analysis What business scenarios can these advantages of RocksDB be used in? The article is longer, so you can pick the part you are interested in.

RocksDB read and write process

Writing process

The above figure shows a schematic diagram of RocksDB's write request. The modification of a transaction is written to the WriteBatch of the transaction thread itself before submission (in the example above, the transaction only performs a Put operation, so the WriteBatch only has this Put). When submitted, it is written to RocksDB's MemTable in memory. MemTable is essentially a SkipList, and the cached records are in order. Like InnoDB, the data changed by the transaction (WriteBatch) will also be written to the Write Ahead Log (WAL) before submission. After the transaction is submitted, you only need to ensure that the WAL has been persisted, and the data in MemTable does not need to be written to the data on the disk. File. When the size of MemTable reaches the threshold (for example, 32MB), RocksDB will generate a new MemTable, and the original MemTable will become read-only (Immutable) and no longer receive new write operations. Immutable MemTable will be dumped into an sst file by the background Flush thread. On the disk, RocksDB saves data through sst files, and each log file saves WAL logs. On the disk, the sst file is layered. Each layer has one or more sst files. The file size is basically fixed. The larger the layer, the more files in the layer, which means the larger the total size allowed for the layer, as follows As shown in the figure.

In general, the files dumped from the memory are placed in Level0, and the recorded intervals of each sst file in the Level0 layer may overlap. For example, sst1 saves 1.4.6.9 and sst2 saves 5.6.10.20. Because the LSM tree technology is used to store data, a record will have multiple versions. For example, both sst1 and sst2 have record 6, but the version in sst2 is updated. Similarly, different versions of the same record will exist between different levels. Unlike Level0, sst files of Level1 and higher levels will not have the same records among the sst files of the same level.

Compaction mechanism

Since there are multiple different record versions, there needs to be a mechanism for version merging, and this mechanism is Compaction.

The figure above is a Level0 Compaction, the process of compacting one or more Level0 files with Level1 files. Whether it is dumping the MemTable of the memory to the sst file or the compaction between the sst files, it is sequential read and write from the IO point of view. This is advantageous for both SSD and HDD. The sequential performance of HDD is much higher than random. Performance characteristics, for SSD, can avoid the Flash media write amplification effect caused by random write.

Reading process

After talking about the RocksDB writing process, let's look at the components related to reading. As follows:

The read in the database can be divided into current read and snapshot read. The so-called current read is to read the latest version data of the record, and the snapshot read is to read the data of the specified version. Here we only discuss current reads, and snapshot reads can be analyzed similarly. Because of the LSM tree storage structure, the read operation of RocksDB is quite different from InnoDB. This is because there may be multiple recorded versions of LSM (and unlike InnoDB that has pointers connected to the previous and subsequent versions), and cannot pass (strict meaning) Above) binary search. Therefore, Bloom Filter is introduced in RocksDB to optimize the read path. In RocksDB, Bloom Filter can choose three different methods, which are data block-based, partition-based and sst file-based, Bloom Filter can be used to determine that the key to be searched must not be in a block/partition/sst. RocksDB is based on data block by default, which has the smallest granularity.

Next, we will briefly analyze the RocksDB reading process based on the above two figures. A Get(key=bbb) request is first searched in the current MemTable through Bloom Filter. If it is not hit, then it goes to the read-only MemTable. If it is not hit, it means that the key-vaule is either in the disk sst file or does not exist. Therefore, it is necessary to search the metadata information of each sst file to find all sst files whose key interval contains the requested key value. And according to the level of inquiry from small to large. For each sst file, further search through Bloom Filter, if it hits, read the data block in the sst file into BlockCache, traverse the search inside the block through dichotomy, and finally return the corresponding key or NotFound, as shown in the following figure.

RocksDB column family

In RocksDB, a column family is a logically independent LSM tree. Each column family has its own independent MemTable, and all column families share a WAL log. The compaction of the sst file is performed at the granularity of the column family.

By default, a MyRocks instance includes 2 column families, namely _system_ for storing system metadata and default for storing table data created by all users. Of course, when the user defines the table, he can declare the name of the column family used by the index by adding a comment after the index. The following example will put the primary key and unique index of the rdbtable on the independent column families cf_pk and cf_uid .

CREATE TABLE "rdbtable" (
"id" bigint(11) NOT NULL COMMENT '主键',
"userId" bigint(20) NOT NULL DEFAULT '0' COMMENT '用户ID',
PRIMARY KEY ("id") COMMENT 'cf_pk',
UNIQUE KEY "uid" ("userId") COMMENT 'cf_uid',
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8

Main features of MyRocks

Concurrency control

MyRocks implements transaction concurrency control based on row locking, and lock information is stored in memory. MyRocks supports shared and exclusive row locks. MyRocks uses the RocksDB library for lock management when performing updates in a transaction. You can set unique_check=0 to shield row locks and unique key checks, which will improve performance when importing data in batches, but you should pay attention to whether the data keys are duplicated when using them. Therefore, the uniqueness check of the general high-availability instance is turned off from the library to Speed ​​up Binlog playback speed. At present, MyRocks has not implemented gap locks, and there is a phantom read problem. This is the same as the standard RR isolation level, but weaker than InnoDB's RR.

Transaction isolation level

MyRocks currently supports two transaction isolation levels: read committed (RC) and repeatable reads (RR). MyRocks uses snapshots to achieve these two isolation levels. In repeatable reads, snapshots are held in the entire transaction, and statements in the transaction will see consistent data. In the read committed isolation level, the snapshot will be held by each statement, so the SQL statement can see the modification to the database before the statement is executed. Like most database implementations, the snapshot is obtained when the transaction executes the first SQL in the RR isolation level, not when the transaction begins (begin/start).

Like InnoDB, MyRocks supports MVCC-based snapshot reading, and snapshot reading does not require locking. MVCC is implemented through RocksDB snapshots, similar to InnoDB's read view.

Backup and restore

Like InnoDB, MyRocks supports online physical backup and logical backup. Logical backup is through existing MySQL backup tools such as mysqldump or mydumper. The physical backup is performed remotely through the myrocks_hotbackup tool implemented by MyRocks, or the mariabackup tool provided by mariadb is used for local backup.

Comparative advantages with InnoDB

Those who are familiar with MySQL know that InnoDB is currently the dominant storage engine on MySQL. It has most of the features that a relational database storage engine should have, such as a powerful and complete transaction mechanism. The MySQL official has made InnoDB an inseparable part of MySQL. The newly added MySQL system tables use InnoDB instead of MyISAM. . So why didn't Facebook use InnoDB and started to develop MyRocks based on RocksDB. Obviously, RocksDB must have its own advantages. The following will compare and analyze from multiple dimensions.

Smaller storage space

Let's take a look at the problems that InnoDB has in the utilization of storage space. We know that InnoDB is based on B+ trees and cannot avoid SMO operations on tree nodes. The following is a schematic diagram of leaf node splitting.

After inserting user_id=31, the leaf node Block1 triggers the node split condition and is split into 2 blocks from the middle. Each block occupies about half of the original Block1 space. Obviously, the filling rate of each block is less than 50%, which means At this time there are half of the internal fragments.

For scenes inserted sequentially, the filling rate of the blocks is higher. But for random scenes, the space utilization rate of each block drops sharply. Reflected on the whole, the storage space occupied by a table is much larger than the space required for actual data.

However, RocksDB based on the LSM tree does not have this problem. Every time data is inserted, updated and deleted, it is appended to a new sst file. It only needs to be in order within the file and does not need to be found by searching. Insert or update a certain migration position of the global order of the B+ tree, which solves the problem of the filling rate of the B+ tree node and improves the space utilization rate.

Furthermore, RocksDB's sst file is hierarchical, and the total size ratio of the upper and lower layers is up to 10. In the case of large data volume, the worst is only about 10% of the space enlargement, which is a big improvement compared to InnoDB.

In addition, as shown in the figure above, RocksDB uses prefix encoding for record columns when storing. A similar approach is adopted for each row of metadata. This further reduces the required storage space.

More efficient compression method

In the previous article, we introduced InnoDB's record-based compression mechanism. The approximate implementation is to compress some fields of each record in a 16KB page (Page), and then store all the compressed records according to the specified page size. . For example, if the key_block_size is set to 8, that is, it is stored as 8KB after compression. If the page size is 5KB after compression, 3KB of storage space is wasted. InnoDB introduced transparent page compression in MySQL 5.7, but the above problems still exist.

RocksDB is not page-based during record compression, and does not need to be aligned according to key_block_size. It only needs to be aligned according to the file system block size (usually 4KB) after each sst file is compressed, and the alignment overhead of each sst file of several MB does not exceed 4KB, far less than the alignment overhead of InnoDB compression.

Comprehensive comparison, MyRocks can save more than half of the storage space compared to InnoDB.

Old version recycling optimization

In scenarios where records are frequently updated, if there is a long-term consistent snapshot read, InnoDB will cause the undo space to increase sharply because the old version of the record cannot be purged. But RocksDB can effectively alleviate the problem. The following is an example to illustrate.

Suppose that a consistent logical backup is performed on MySQL, and the transaction is started but the record whose primary key is 1 and the value is 0 is performed 1 million increment operations before the select operation is performed on the table t. According to the principle, this backup needs to read the original record with a value of 0.

For InnoDB, because the backup transaction id is less than the transaction id that is incremented by 1 million times, the 1 million old version records (ie undo) will not be purged, which means that when the record is backed up, 100 The 10,000 version backtracking, each time the undo page is randomly read based on the undo pointer on the record, which is very inefficient.

RocksDB is optimized for the old version of InnoDB's record purge problem. Assuming that the sequence number of the original record is 2, this version is the visible version of the backup transaction. For a larger version, dump MemTable as an sst file in RocksDB, or When compacting the sst file, the intermediate version is deleted, and only the visible version of the current active transaction and the latest version of the record are retained. This not only satisfies the MVCC requirements, but also improves the snapshot read efficiency, while also reducing the storage space required.

Smaller write magnification

On InnoDB, a record update operation needs to write the current record version to the undo log for transaction rollback and MVCC (you also need to write the undo redo before writing the undo page), and then write a redo recorded after the update It is used for crash recovery, and then the update operation can be written to the corresponding data page (may trigger the B+ tree node split). In order to avoid the data page damage caused by the crash during flashing, it is necessary to write another copy to Doublewrite Disk cache.

It can be seen that a lot of things need to be written in one update. Especially, if it is a random update scenario, when writing data pages and Doublewrite, the ratio of write amplification is page size/record size, which is very amazing.

RocksDB write amplification is related to the total level of its sst file. The worst write amplification is about (n-2)*10, where n is the total number of layers. Obviously, it will be much better than InnoDB.

The reduction in write amplification means that the limited storage and writing capabilities can be used more efficiently. It can be said that RocksDB can write more records when the storage IO performance bottleneck is reached.

On the other hand, every time RocksDB data is inserted, updated, and deleted, it is written in addition rather than updated in place. In this way, the storage backend is all written sequentially, not random. For the SSD based on NAND Flash, without considering the optimization of write amplification within the SSD, the same SSD can be used longer under RocksDB than under InnoDB.

Faster write performance

As mentioned earlier, InnoDB updates records in place, which means that in random DML scenarios, each record operation is randomly written (even if the secondary index is deleted and then written to the new record, it is also random ),As shown below.

Unlike RocksDB, random writing is converted to sequential writing, and multi-threaded compaction that records the merge of the old and new versions in the background is also a batch sequential writing operation. For batch insertion scenarios, RocksDB can also turn off the record uniqueness check to further accelerate the data import speed.

On HDD, such optimization can take advantage of the sequential read and write performance of mechanical disks that are far superior to random read and write. Even on SSD, such optimization is helpful to database performance.

Smaller master-slave delay

Compared with InnoDB, RocksDB also provides more DML optimization options from the library.

Since the transactions that can be played back in parallel on the slave database are definitely free of conflicts, that is to say, there is no lock waiting relationship between transactions, RocksDB introduced an optimization parameter rpl_skip_tx_api to adjust the lock on the record to ensure transaction isolation The operation speeds up the transaction playback speed.

Similarly, for the transaction characteristics from the database, you can skip the unique key constraint check of the record insertion operation, and for the update and delete operations, you can skip the record lookup operation, because as long as there is no implementation bug, the operation record must be satisfied Transaction bound.

Other features that InnoDB does not have

MyRocks has implemented reverse-order index in MySQL 5.6/5.7, based on reverse-order column family implementation. Obviously, reverse-order indexes cannot use the default default column family. Based on the LSM feature, MyRocks also implements TTL index at a very low cost, similar to HBase. Compared with the TTL implementation of MongoDB traversing records for batch deletion, the TTL feature under LSM storage requires no additional maintenance performance loss except for the need to save the timestamp, and it can be merged and processed directly during Compaction.

MyRocks applicable scenarios

Based on the above description, the business scenarios applicable to MyRocks can be summarized, including:

Big data business

Compared with InnoDB, RocksDB occupies less storage space and has higher compression efficiency, which is very suitable for large data volume business. The picture below shows the space occupation comparison of RocksDB and InnoDB disclosed by Facebook.

The picture below shows the compressed data of RocksDB, InnoDB and TokuDB on the Internet

In combination with the above figure, it can be found that the storage space required by RocksDB is much smaller than InnoDB, and even better than TokuDB, which is known for its high compression ratio.

It has also been verified in NetEase’s internal business tests that the DDB instance of a popular business has grown rapidly due to the rapid growth of data, and the DBA has to perform frequent table expansion operations. Using MyRocks to replace InnoDB found that the 165GB InnoDB single table with compression enabled (key_block_size=8) is only 51GB under MyRocks compression . This DDB instance has a total of 8 MySQL highly available instances, and each DBN contains 10 InnoDB tables. Statistics After replacing MyRocks, the storage space required for the instance dropped from 26TB to less than 9TB. This saves two-thirds (about 17TB) of storage overhead, and also prolongs the period of time that DBA needs to expand by table. Assuming that DBA needs to expand once every quarter, now it only needs to expand once every three quarters That's it.

Write-intensive business

MyRocks uses an additional method to record DML operations, turning random writes into sequential writes, which is very suitable for business scenarios where there are batch inserts and frequent updates. The following figure is a performance comparison chart in the batch insert scenario released by Alibaba Cloud. Compared with InnoDB-based AliSQL, MyRocks has achieved nearly double the performance improvement.

In a certain update-intensive business scenario within NetEase, it also achieved better performance. In addition to the write performance that is not weaker than the KV storage system, it also occupies a certain advantage in read performance. The comparison is as follows:

The above picture is the result of a 10-minute test under the read-only, 1:1 and 2:1 mixed read and write conditions. It can be found that MyRocks has good performance in both performance and latency.

The figure above shows the write performance and latency in a 1:1 mixed read-write and write-only scenario. It can be found that MyRocks also performs well in the case of 20 write concurrency.

Cache persistence scheme

Because MyRocks has efficient space utilization, compared to InnoDB, the same size of memory can cache more data; compared to Redis alternatives such as pika, it has a mature failure recovery mechanism and a master-slave replication architecture; in addition, its lower The replication delay facilitates the expansion of read capacity. Therefore, MyRocks is also a more suitable Redis cache alternative.

Replace TokuDB

Compared to TokuDB, RocksDB/LevelDB has no inferior write performance and compression ratio, and has better read performance; as a storage engine, it is used by mainstream database systems such as MySQL, MongoDB, Kudu and TiDB, and has a better open source community. Support, faster problem location and BugFix possibility, more readable source code. When TokuDB is becoming less and less promising, MyRocks can be used to replace the current online TokuDB instance.

Low cost and low latency slave library

The better write performance of MyRocks, coupled with targeted parameter optimization from the library, can achieve lower replication latency than InnoDB. Coupled with the advantage of smaller storage space, it is suitable for building special-purpose slave libraries, such as delayed slave libraries to prevent accidental deletion of online data, and slave libraries for big data statistics and analysis.

to sum up

In general, compared to InnoDB, MyRocks occupies less storage space, which can reduce storage costs and improve hotspot caching efficiency; it has a smaller write amplification ratio, which can more efficiently use storage IO bandwidth; it changes random writes into sequential writes , Improve the write performance and extend the service life of the SSD; Optimize the parameters to reduce the master-slave replication delay. Therefore, it is very suitable for business scenarios such as large data volume and write-intensive. In addition, as the same MySQL writing and space optimization solution, MyRocks has a better community ecology and is suitable for replacing TokuDB instances. MyRocks' high-efficiency cache utilization, mature failure recovery and master-slave replication mechanism make it also a Redis persistence solution.

Reference materials:

1. RocksDB implementation analysis http://ks.netease.com/blog?id=10818

2、RocksDB wiki https://github.com/facebook/rocksdb/wiki

3. RocksDB related documents and PPT published by Facebook, Percona, and Alibaba

The full text is over.

Enjoy Linux & MySQL :)

"MySQL Core Optimization" course has been upgraded to MySQL 8.0, scan the code to start the journey of MySQL training

Guess you like

Origin blog.csdn.net/n88Lpo/article/details/108970551