【ClickHouse source code】Introduction to ReadIndirectBufferFromRemoteFS

The literal translation of ReadIndirectBufferFromRemoteFS is the indirect ReadBuffer created for the remote file system. This is because the remote file system cannot directly operate files like the local file system, so the necessary interface is abstracted through ReadBuffer, so that various ReadBuffers can be implemented. ReadIndirectBufferFromRemoteFS is one of them.

The header file is as follows:

class ReadIndirectBufferFromRemoteFS : public ReadBufferFromFileBase
{

public:
    explicit ReadIndirectBufferFromRemoteFS(std::shared_ptr<ReadBufferFromRemoteFSGather> impl_);

    off_t seek(off_t offset_, int whence) override;

    off_t getPosition() override;

    String getFileName() const override;

    void setReadUntilPosition(size_t position) override;

    void setReadUntilEnd() override;

private:
    bool nextImpl() override;

    std::shared_ptr<ReadBufferFromRemoteFSGather> impl;

    size_t file_offset_of_buffer_end = 0;
};

The functions marked with override are the functions that need to be rewritten.

For this kind of packaging, the ReadBuffer will contain a corresponding ReadBuffer that can read remote files, such as its member variable impl, whose type is ReadBufferFromRemoteFSGather. Take a look at the main nextImpl function:

bool ReadIndirectBufferFromRemoteFS::nextImpl()
{
    /// Transfer current position and working_buffer to actual ReadBuffer
    swap(*impl);

    assert(!impl->hasPendingData());
    /// Position and working_buffer will be updated in next() call
    auto result = impl->next();
    /// and assigned to current buffer.
    swap(*impl);

    if (result)
    {
        file_offset_of_buffer_end += available();
        BufferBase::set(working_buffer.begin() + offset(), available(), 0);
    }

    assert(file_offset_of_buffer_end == impl->file_offset_of_buffer_end);

    return result;
}

The first is to exchange the buffer of Impl to ReadIndirectBufferFromRemoteFS through swap to ensure that the buffer of ReadIndirectBufferFromRemoteFS is the current state of Impl buffer. Then read the data into the Impl buffer through impl->next(), and then exchange the state of the Impl buffer to the buffer of ReadIndirectBufferFromRemoteFS after the reading is completed, and update the file_offset_of_buffer_end to complete a next read. In fact, the general next process is roughly the same.

The next of ReadIndirectBufferFromRemoteFS actually calls the next of Impl. ReadBufferFromRemoteFSGather is a layer of proxy encapsulation for remote files, because a data file of ClickHouse may be very large. For the remote file system, this file may correspond to multiple remote files. , so ReadBufferFromRemoteFSGather encapsulates the reading of a ClickHouse data file, shields the actual reading details, and makes the upper layer feel that it is reading a remote file.

ReadBufferFromRemoteFSGather

The purpose of ReadBufferFromRemoteFSGather is to read the corresponding remote files one by one.

Constructor

ReadBufferFromRemoteFSGather(
    const std::string & common_path_prefix_,
    const BlobsPathToSize & blobs_to_read_,
    const ReadSettings & settings_);

Looking at its constructor, the input parameters include common_path_prefix_, blobs_to_read_ and settings_. common_path_prefix_ is the path prefix corresponding to the remote file system. For S3, common_path_prefix_ is {bucket}/{path_pre}/. BlobsPathToSize is a vector that stores all the remote files and their sizes. This vector is also ordered. If you use the S3 analogy, then here are multiple objects and object_size.

createImplementationBufferImpl

This is the virtual function provided in ReadBufferFromRemoteFSGather, because the remote file system is not only S3, but also HDFS, BLOB, etc., so the ImplementationBuffer is to directly interact with the remote file system and other buffers need to have different implementations, such as S3 has its implementation class ReadBufferFromS3Gather. Look at the code specifically:

SeekableReadBufferPtr ReadBufferFromS3Gather::createImplementationBufferImpl(const String & path, size_t file_size)
{
    auto remote_path = fs::path(common_path_prefix) / path;
    auto remote_file_reader_creator = [=, this]()
    {
        return std::make_unique<ReadBufferFromS3>(
            client_ptr, bucket, remote_path, version_id, max_single_read_retries,
            settings, /* use_external_buffer */true, /* offset */ 0, read_until_position, /* restricted_seek */true);
    };

    if (with_cache)
    {
        return std::make_shared<CachedReadBufferFromRemoteFS>(
            remote_path, settings.remote_fs_cache, remote_file_reader_creator, settings, query_id, read_until_position ? read_until_position : file_size);
    }

    return remote_file_reader_creator();
}

There are two main parts of logic:

If cache is enabled, CachedReadBufferFromRemoteFS will be used; if cache is not used, bare ReadBufferFromS3 will be used directly. Even though two implementations are used here, when CachedReadBufferFromRemoteFS is used, if there is no local cache, ReadBufferFromS3 will still be used to request remote file data.

Among them, the nextImpl implementation of ReadBufferFromS3 is still somewhat complicated. Here we only need to briefly understand that ReadBufferFromS3 initializes a stream of S3 objects through initailize, and a next call will obtain the data of the stream and fill it into the buffer. Read at most buffer_size data for the upper layer to use.

nextImpl

Look at the nextImpl function of ReadBufferFromRemoteFSGather. The next of ReadIndirectBufferFromRemoteFS mentioned above is actually the nextImpl called. In nextImpl, another layer of readImpl is encapsulated for convenience, the code is as follows:

bool ReadBufferFromRemoteFSGather::readImpl()
{
    // step 1
    swap(*current_buf);

    bool result = false;

    /**
     * Lazy seek is performed here.
     * In asynchronous buffer when seeking to offset in range [pos, pos + min_bytes_for_seek]
     * we save how many bytes need to be ignored (new_offset - position() bytes).
     */
    // step 2
    if (bytes_to_ignore)
    {
        total_bytes_read_from_current_file += bytes_to_ignore;
        current_buf->ignore(bytes_to_ignore);
        result = current_buf->hasPendingData();
        file_offset_of_buffer_end += bytes_to_ignore;
        bytes_to_ignore = 0;
    }

    // step 3
    if (!result)
        result = current_buf->next();

    if (blobs_to_read.size() == 1)
    {
        file_offset_of_buffer_end = current_buf->getFileOffsetOfBufferEnd();
    }
    else
    {
        /// For log family engines there are multiple s3 files for the same clickhouse file
        file_offset_of_buffer_end += current_buf->available();
    }

    // step 4
    swap(*current_buf);

    /// Required for non-async reads.
  
    // step 5
    if (result)
    {
        assert(available());
        nextimpl_working_buffer_offset = offset();
        total_bytes_read_from_current_file += available();
    }

    return result;
}

Step1

Here S3 is taken as an example to illustrate the whole process, so current_buf is ReadBufferFromS3Gather, which is initialized by calling initialize. initialize will traverse the size of all remote files according to file_offset_of_buffer_end (because the size is stored in BlobsPathToSize, this logic does not need to access the remote file system), knowing which file to start reading from, as shown in the figure:

clickhouse file:  [-------------------------------------------------]
                      file1          file2        file3       file4
remote files:    {
    
    [------------][------------][-----------][--------]}

need_to_read:                       [_________________________]
                                    ^
                                    file_offset_of_buffer_end
need_to_seek                     [__]

As shown in the figure above, a bin file will contain multiple s3 objects, and each object records the path and size. When the file_offset_of_buffer_end is given, it traverses from the first object through a for loop, and goes to a temporary variable offset = file_offset_of_buffer_end, if the size of the first object is greater than offset, it means that the first object needs to be read from the offset to obtain data; if the size of the first object is smaller than offset, it means that the first object needs to be skipped to judge the second object Whether the specified offset can be included, and the first object is skipped at this time, then the offset will naturally subtract the size of the first object, which is the relative offset of the second object. When traversing to file2, the starting point to be read has been found, select file2, and seek need_to_seek bytes, construct current_buf and return.

Step2

Continue to look at bytes_to_ignore, lazy seek is used here, because ClickHouse supports prefetch pre-reading capability, pre-reading will use an additional external buffer to read, when calling next here, if the required data has been pre-read, then it can Take it directly and use it, but the external buffer read-ahead cannot accurately read-ahead according to the required file_offset, so bytes_to_ignore is needed to adjust the position of file_offset_of_buffer_end, so lazy seek is used here.

Step3

Call current_buf->next() to get remote file data. The specific implementation depends on whether the cache is used to correspond to CachedReadBufferFromRemoteFS or ReadBufferFromS3 respectively, and the details have been introduced above.

Step4

Do swap.

Step5

更新file_offset_of_buffer_end和total_bytes_read_from_current_file。

In fact, here you will find that readImpl only realizes the reading of a remote file, because there is no place to construct a new remote file buffer in the whole function, which is actually why a separate readImpl function is encapsulated. In nextImpl, if you call readImpl and find that there are still files that need to be read, you will call moveToNextBuffer. This function will construct a new current_buf and continue reading until all files are read.

At this point, we can know that current_buf is constructed and used. We also know that S3's current_buf is constructed through the ReadBufferFromS3Gather::createImplementationBufferImpl function. How is it constructed? code show as below:

So far, the main process of ReadIndirectBufferFromRemoteFS has been introduced.