ClickHouse and his friends (10) MergeTree Write-Ahead Log

Original source: https://bohutang.me/2020/08/18/clickhouse-and-friends-merge-tree-wal/

Last Update: 2020-09-18

In order to improve the write performance, the database system will write the data to the memory first, and then write it back to the disk after "saving" to a certain extent, such as the buffer pool mechanism of MySQL.

Because the data is written to the memory first, for data security, we need a Write-Ahead Log (WAL) to ensure the security of the memory data.

Today we will take a look at the new MergeTreeWriteAheadLog (https://github.com/ClickHouse/ClickHouse/pull/8290) module added by ClickHouse, and what problem it solves.

High-frequency writing problem

For ClickHouse MergeTree engine, each write (even one piece of data) will generate a partition directory (part) on the disk, waiting for the merge thread to merge.

If there are multiple clients, each client writes a small amount of data and more frequently, DB::Exception: Too many parts an error will occur .

This places certain requirements on the client, such as batch writing.

Or, write to the Buffer engine and periodically flush back to MergeTree. The disadvantage is that data may be lost during downtime.

MergeTree WAL

1. Default mode

Let's take a look at how MergeTree is written without WAL:

Each time MergeTree is written to, a partition directory is created directly on the disk and partition data is generated. This mode is actually the fusion of WAL + data.

Obviously, this mode is not suitable for frequent write operations, otherwise it will generate a lot of partition directories and files and cause Too many parts errors.

2. WAL mode

Set SETTINGS: min_rows_for_compact_part=2and execute 2 write SQL respectively, and the data will be written to the wal.bin file first:

When satisfied min_rows_for_compact_part=2 after, merger trigger thread merge operation to generate 1_1_2_1 partition, which is completed in the wal.bin 1_1_1_0 and 1_2_2_0 merge operations two partitions. When we execute the third SQL write:

insert into default.mt(a,b,c) values(1,3,3)

The data block (partition) will continue to be appended to the end of wal.bin:

At this time, 3 pieces of data are distributed in two places: the partition 1_1_2_1, in wal.bin 1_3_3_0.

So there is a question: when we execute the query, how is the data merged?

MergeTree uses a global structure to data_parts_indexes maintain partition information. When the service starts, the MergeTreeData::loadDataPartsmethod:

1. data_parts_indexes.insert(1_1_2_1)
2. 读取 wal.bin，通过 getActiveContainingPart 判断分区是否已经merge到磁盘：1_1_1_0 已经存在, 1_2_2_0 已经存在，data_parts_indexes.insert(1_3_3_0)
3. data_parts_indexes:{1_1_2_1,1_3_3_0}

In this way, it can always maintain global partition information.

to sum up

The WAL function is implemented in PR#8290 (https://github.com/ClickHouse/ClickHouse/pull/8290), and the master branch is already enabled by default.

MergeTree uses WAL to protect the client's high-frequency and low-volume write mechanism, reduce the number of server directories and files, and make client operations as simple and efficient as possible.

The full text is over.

Enjoy ClickHouse :)

Teacher Ye's "MySQL Core Optimization" class has been upgraded to MySQL 8.0, scan the code to start the journey of MySQL 8.0 practice