Think OSS queries are too slow? See how we made it 10x faster!

background

HDFS is the default storage system in the Hadoop ecosystem, and many data analysis and management tools are designed and implemented based on its API. However, HDFS is designed for traditional computer rooms. Maintaining HDFS on the cloud is not easy at all. It requires a lot of manpower to monitor, optimize, expand, and recover from failures, and it is also expensive. The cost may be that object storage is more than ten times.

Under the general trend of separation of storage and computing, many people try to use object storage to build data lake solutions. Object storage also provides connectors for the Hadoop ecosystem. However, due to the limitations of object storage itself, its functions and performance are very limited. These problems become more prominent when the data grows to a certain scale.

JuiceFS is designed to solve these problems. While retaining the cloud-native characteristics of object storage, it is better compatible with the semantics and functions of HDFS, and significantly improves the overall performance. This article takes Alibaba Cloud OSS as an example to introduce how JuiceFS comprehensively improves the performance of object storage in cloud big data scenarios.

metadata performance

In order to be fully compatible with HDFS and provide the ultimate metadata performance, JuiceFS uses full memory to manage metadata and uses OSS as data storage. All metadata operations do not require access to OSS to ensure ultimate performance and consistency. The response time of most metadata operations is within 1ms, while OSS usually takes tens to 100ms or more. Here are the results of metadata stress testing using NNBench:

The rename operation in the above figure is only for a single file, because it is very slow to copy data. In the actual task of big data, the directory is usually renamed. OSS has O(N) complexity, which will become significantly slower as the number of files in the directory increases, while the complexity of JuiceFS's rename is O(1). , is just an atomic operation on the server side, and it can always be so fast no matter how big the directory is.

Similar to the du operation, it looks at the total size of all files in a directory, which is very useful when managing capacity or understanding the size of data. The figure below is a comparison of the duplication time of a directory with 100GB of data (including 3949 subdirectories and files). JuiceFS is 76 times faster than OSS! This is because the du of JuiceFS is returned based on the real-time statistics in the server-side memory, and OSS needs to traverse all the files in the directory through the client and then accumulate the sum. If there are more files in the directory, the performance gap will be larger. bigger.

Sequential read and write performance

In big data scenarios, a lot of raw data is stored in text format, data is written in append mode, and read is mainly read sequentially (or one of the blocks is read sequentially). Throughput is a key metric when accessing such files. In order to better support such scenarios, JuiceFS will first cut them into 64MB logical chunks, and then divide them into 4MB (configurable) data blocks to write to object storage, so that multiple data blocks can be read and written concurrently to improve throughput quantity. OSS also supports multi-block upload, but there are restrictions on the size and number of blocks, while JuiceFS does not have these restrictions, and a single file can reach 256PB.

At the same time, files in this type of text format are also very easy to be compressed. The built-in LZ4 or ZStandard compression algorithm of JuiceFS can compress/decompress while reading and writing in parallel, which can not only reduce storage costs, but also reduce network traffic and further improve the order. read and write performance. For data that has been compressed, these two algorithms can also automatically identify and avoid repeated compression.

Combined with the intelligent read-ahead and write-back algorithms of JuiceFS, it is easy to make full use of network bandwidth and multi-core CPU capabilities to push the processing performance of text files to the extreme. The figure below is the single-threaded sequential I/O performance test results, showing that JuiceFS's read and write speedup for large files (using random data that cannot be compressed) is very significant.

Random read performance

For analytical data warehouses, the raw data is usually stored in a more efficient columnar storage format (Parquet or ORC) after cleaning, which greatly saves storage space and significantly improves the speed of analysis. The data in the column storage format is very different from the text format in the access mode. Most of them are read randomly, which has higher requirements on the comprehensive performance of the storage system.

JuiceFS has made a lot of optimizations for the access characteristics of these column storage format files, and the core point of which is to cache the data in blocks on the SSD disks of the computing nodes. In order to ensure the correctness of the cached data, JuiceFS uses a unique ID to identify the data blocks in the OSS for all written data, and never modifies them, so that the cached data does not need to be invalidated, and only follows the LRU algorithm when the space is insufficient. Just clean up. Parquet and ORC files usually only have local columns that are hot spots. Caching the entire file or a 64MB Chunk will waste space. JuiceFS adopts a 1MB block (configurable) cache mechanism.

There is usually only one cache copy in a computing cluster. The location of the cache is determined by the consistent hashing algorithm, and the local optimization mechanism of the scheduling framework is used to schedule computing tasks to nodes with data caches to achieve data localization with HDFS. The same or even better effect, because the three copies of HDFS are usually scheduled randomly, the utilization of the operating system page cache will be relatively low, the data cache of JuiceFS will try to be scheduled to the same node, and the utilization of the system page cache will be higher. high.

When the scheduling system cannot do localized scheduling, for example, when SparkSQL reads small files, it will randomly merge multiple small files into the same task, and the localization feature will be lost, even if HDFS is used. The distributed cache of JuiceFS solves this problem well. When the computing task fails to be scheduled to the node where the cache is located, the JuiceFS client will access the cached data through the internal P2P mechanism, which greatly improves the cache hit rate and performance.

We select q2, which is a more representative query time, to test the acceleration effect of different block sizes and cache settings:

When the cache is not enabled, using 1MB chunks performs better than 4MB chunks, because 4MB chunks will generate more read amplification, resulting in slower random reads, and will waste a lot of network bandwidth and cause network congestion.

After enabling the cache, Spark can perform random reads directly from the cached data blocks, which greatly improves the random read performance. Because SparkSQL will randomly merge small files into one task, most of the files cannot be scheduled to the node with the cache, the cache hit rate is very low, and some read requests that miss the cache can only read the object storage, which is seriously slowing down. the entire mission.

After the distributed cache is enabled, JuiceFS clients can read the cache through a fixed node no matter where the computing task is scheduled, the cache hit rate is very high, and the speed is also very fast (usually the second query can obtain significant acceleration effect).

JuiceFS also supports random writing, but big data scenarios do not need this capability, and OSS does not support it, so we will not make a comparison.

Comprehensive performance

TPC-DS is a typical test set for big data analysis scenarios. We use it to test the performance improvement effect of JuiceFS on OSS, including different data formats and different analysis engines.

test environment

We built a cluster on Alibaba Cloud using CDH 5.16 (estimated to be the most widely used version). The detailed configuration and software version are as follows:

    Apache Spark 2.4.0.cloudera2
    Apache Impala 2.12
    Presto 0.234
    OSS-Java-SDK  3.4.1
    JuiceFS Hadoop SDK 0.6-beta

    Master: 	4 CPU 32G 内存,1台
    Slave:      4 CPU 16G 内存,200G 高效云盘 x 2,3台

    Spark 参数:
        master                          yarn
        driver-memory                   3g
        executor-memory			9g
        executor-cores 			3
        num-executors 			3
        spark.locality.wait		100
        spark.dynamicAllocation.enabled	false

The test dataset uses the 100GB TPC-DS dataset with various storage formats and parameters. It takes too much time to complete the 99 test statements. We selected the first 10 statements as representatives, including various types of queries.

Write performance

To test write performance by reading and writing to the same table, the SQL statement used is:

INSERT OVERWRITE store_sales SELECT * FROM store_sales;

We compared the unpartitioned text format to the date-partitioned Parquet format, and JuiceFS showed significant performance improvements, especially for the partitioned Parquet format. Through analysis, it is found that OSS spends a lot of time on Rename, it needs to copy data and cannot be concurrent, and Rename is an atomic operation in JuiceFS, which is completed in an instant.

SparkSQL query performance

Apache Spark is widely used. We use SparkSQL to test the speed-up effect of JuiceFS under three file formats: text, Parquet, and ORC. The text format is unpartitioned, and the Parquet and ORC formats are partitioned by date.

For the unpartitioned text format, all text data needs to be scanned, and the main bottleneck is the CPU. JuiceFS has a limited speed-up effect, up to 3 times. It should be noted that if you use HTTPS to access OSS, the Java TLS library is much slower than the Go TLS library used by JuiceFS. At the same time, JuiceFS compresses the data, and the network traffic will be much smaller. Therefore, HTTPS is enabled for both access. When using OSS, JuiceFS works better.

The above graph illustrates that with HTTPS, the performance of JuiceFS is almost unchanged, while the performance of OSS drops a lot.

For interactive queries, it is often necessary to repeatedly query hot data. The above picture is the result of the same query repeated 3 times. JuiceFS relies on cached hot data to greatly improve performance, and 8 of the 10 queries have several times the performance. The q4, which has the least improvement, also increased by 30%.

The speed-up effect of the ORC format data set is similar to that of the Parquet format, with a maximum speedup of 11 times and a minimum speedup of 40%.

For all data formats, JuiceFS can significantly improve the query performance of OSS by up to 10 times.

Impala query performance

Impala is an interactive analysis engine with very good performance. It has very good optimization for I/O localization and I/O scheduling. It can achieve good results without using the distributed cache of JuiceFS: 42 times faster for OSS!

Presto is a query engine similar to Impala, but because the OSS configured in the test environment cannot work with Presto (the reason is unknown), JuiceFS cannot compare with OSS.

Summarize

Summarizing the above test results, JuiceFS can significantly speed up OSS in all scenarios, especially when the storage format is the column storage format such as Parquet and ORC, the write speed is increased by 8 times, and the query speed is increased by more than 10 times. This significant performance improvement not only saves valuable time for data analysts, but also greatly reduces the use of computing resources and costs.

The above is just a performance comparison with Alibaba Cloud's OSS as an example. The speed-up capability of JuiceFS is applicable to all cloud object storages, including Amazon's S3, Google Cloud's GCS, Tencent Cloud's COS, etc. JuiceFS can significantly improve their performance in data lake scenarios. In addition, JuiceFS provides better Hadoop compatibility (such as permission control, snapshots, etc.) and full POSIX access capabilities, making it an ideal choice for data lakes on the cloud.

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/5415574