Detailed explanation of JuiceFS data read and write process

For the file system, its read and write efficiency has a decisive impact on the overall system performance. In this article, we will introduce the read and write request processing flow of JuiceFS, so that everyone can have a further understanding of the characteristics of JuiceFS.

write process

JuiceFS splits large files at multiple levels (see how JuiceFS stores files ) to improve read and write efficiency. When processing a write request, JuiceFS first writes data into the client's memory buffer, and manages it in the form of Chunk/Slice. Chunk is a continuous logical unit divided by 64 MiB according to the offset in the file, and different chunks are completely isolated. Each Chunk will be further divided into Slices according to the actual situation of the application write request; when the new write request is continuous or overlapped with the existing Slice, it will be updated directly on the Slice, otherwise, a new Slice will be created.

Slice is a logical unit that starts data persistence. When flushing, it will first split the data into one or more consecutive blocks according to the default size of 4 MiB, and upload them to the object storage, each block corresponds to an object; then update Once metadata, write new slice information. Obviously, in the case of applying sequential write, only one continuously growing slice is needed, and only one flush is required at the end; at this time, the write performance of object storage can be maximized.

Taking a simple JuiceFS benchmark test as an example, the first stage is to use 1 MiB IO to write a 1 GiB file sequentially. The form of the data in each component is shown in the following figure:

Note : Compression and encryption in the picture are not enabled by default. To enable related functions, you need to add the or option --compress valuewhen formatting the file system .--encrypt-rsa-key value

Here is another indicator chart recorded with the statscommand , so that you can see the relevant information more intuitively:

Stage 1 in the image above:

  • The average IO size of object storage writes is object.put / object.put_c = 4 MiBequal to the default size of Block
  • The ratio of the number of metadata transactions to the number of object storage writes is approximately meta.txn : object.put_c ~= 1 : 16, corresponding to 1 metadata modification and 16 object storage uploads required for slice flush. It also shows that the amount of data written in each flush is 4 MiB * 16 = 64 MiB , the default size of Chunk
  • The average request size of the FUSE tier is approx fuse.write / fuse.ops ~= 128 KiB., in line with its default request size limit

Compared with sequential writing, the situation of random writing in large files is much more complicated; there may be multiple discontinuous slices in each chunk , which makes it difficult for data objects to reach 4 MiB in size, and requires more metadata on the other hand. updates. At the same time, when there are too many Slices written in a Chunk, Compaction will be triggered to try to merge and clean up these Slices, which will further increase the burden on the system. Therefore, JuiceFS will have a more obvious performance degradation than sequential writes in such scenarios.

The writing of small files is usually uploaded to the object storage when the file is closed, and the corresponding IO size is generally the file size. It can also be seen from phase 3 of the indicator graph above (creation of small 128 KiB files):

  • The size of the object storage PUT is 128 KiB
  • The number of metadata transactions is roughly twice the PUT count, corresponding to one Create and one Write per file

It is worth mentioning that for such objects with less than one Block, JuiceFS will also try to write to the local Cache ( --cache-dirspecified , in order to improve the speed of subsequent possible read requests. It can also be seen from the indicator graph that when creating small files, there is the same write bandwidth under blockcache, and when reading (stage 4), most of the hits are in the Cache, which makes the reading speed of small files look special. quick.

Since the write request can be returned by writing to the client memory buffer, JuiceFS usually has a very low Write latency (tens of microseconds), and the actual upload to the object storage is automatically triggered internally (a single Slice is too large, the Slice Too many, too long buffer time, etc.) or the application is actively triggered (close file, call fsync, etc. ). The data in the buffer can only be released after being persisted. Therefore, when the write concurrency is relatively large or the object storage performance is insufficient, the buffer may be filled and the write may be blocked.

Specifically, the size of the buffer is --buffer-sizespecified , which defaults to 300 MiB; its real-time value can be seen in the usage.buf column of the indicator graph. When the usage exceeds the threshold, JuiceFS Client will actively add a wait time of about 10ms to Write to slow down the writing speed; if the usage exceeds twice the threshold, new writes will be suspended until the buffer is released. Therefore, it is usually necessary to try to set a larger value when the Write latency is observed to rise and the Buffer exceeds the threshold for a long time --buffer-size. In addition, by increasing the --max-uploadsparameter (the maximum number of concurrent uploads to object storage, the default is 20), it is also possible to increase the bandwidth of writing to object storage, thereby speeding up the release of buffers.

Writeback mode

When the requirements for data consistency and reliability are not high, it can also be added during mounting --writebackto further improve system performance. After the write-back mode is enabled, Slice flush only needs to write to the local Staging directory (shared with the Cache) to return, and the data is asynchronously uploaded to the object storage by the background thread. Please note that the write-back mode of JuiceFS is different from the commonly understood write-first memory, which requires writing data to the local Cache directory (the specific behavior depends on the hardware and local file system where the Cache directory is located). From another perspective, the local directory is the cache layer of the object storage at this time.

After the write-back mode is enabled, the size check of the uploaded object is skipped by default, and all data is aggressively kept in the Cache directory as much as possible. This is especially useful in scenarios that generate a lot of intermediate files (such as software compilation, etc.).

In addition, JuiceFS v0.17 also added a --upload-delayparameter to delay the upload of data to object storage and cache it locally in a more aggressive way. If the data is deleted by the application within the waiting time, there is no need to upload it to the object storage, which not only improves performance but also saves costs. At the same time, compared with local hard disks, JuiceFS provides back-end guarantees. When the capacity of the Cache directory is insufficient, it will still automatically upload data to ensure that the application side will not perceive errors due to this. This function is very effective when dealing with scenarios such as Spark shuffle that require temporary storage.

read process

When JuiceFS processes read requests, it generally reads the object storage according to the alignment of 4 MiB blocks to achieve a certain read-ahead function. At the same time, the read data will be written to the local Cache directory for later use (such as the second stage in the indicator diagram, blockcache has a high write bandwidth). Obviously, during sequential read, these data obtained in advance will be accessed by subsequent requests, and the cache hit rate is very high, so the read performance of object storage can also be fully utilized. At this point, the flow of data in each component is shown in the following figure:

Note : After the read object reaches the JuiceFS Client, it will be decrypted and then decompressed, as opposed to when it is written. Of course, if the relevant functions are not enabled, the corresponding process will be skipped directly.

When doing random small IO reads in large files, this strategy of JuiceFS is not efficient, but will reduce the actual utilization of system resources due to read amplification and frequent writes and evictions of the local Cache. Unfortunately, general caching strategies are rarely profitable enough in such scenarios. One direction that can be considered at this time is to increase the overall capacity of the cache as much as possible, in order to achieve the effect of almost completely caching the required data; the other direction is to directly close (set --cache-size 0) the cache and maximize the read access of the object storage performance.

Reading small files is relatively simple, usually reading the entire file in one request. Since small files are directly cached when they are written, access patterns like JuiceFS bench that are read shortly after writing will basically hit the local Cache directory, and the performance is very impressive.

Summarize

The above is the content related to the JuiceFS read and write request processing process briefly described in this article. Due to the difference in characteristics of large files and small files, JuiceFS implements different read and write strategies for files of different sizes, thereby greatly improving the overall performance and availability. , which can better meet the needs of users for different scenarios.

Recommended reading: How to play with Fluid + JuiceFS in a Kubernetes cluster

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324133366&siteId=291194637