How to use JuiceFS to optimize the storage performance of Kylin 4.0 on the cloud?

Apache Kylin 4.0 uses Spark as the build engine and Parquet as the storage, which makes deployment and scaling on the cloud easier. However, compared to using the HDFS on the local disk, there may be some compatibility and performance issues with object storage on the cloud. Faced with such problems, today we bring you JuiceFS optimized solutions. The powerful query engine of Kylin 4.0 and the efficient local cache of JuiceFS can achieve a win-win situation for compatibility and performance! For more details, take a look at this good article jointly produced by Kylin and Juicedata!

What is Apache Kylin?

Apache Kylin is an open source, distributed analysis engine designed for ultra-large-scale data. It provides SQL query interfaces and multi-dimensional online analysis (OLAP) capabilities on Hadoop/Spark. It was originally developed by eBay and contributed to the Apache Software Foundation. In the face of massive data, Kylin can also achieve sub-second query response.

[Apache Kylin architecture diagram]

As an ultra-high performance query engine, Kylin can download various data sources, such as Hive and Kafka, and connect various BI systems such as Tableau and Superset. It also provides JDBC/ODBC/REST API for various application integration. Since open source, Kylin has been widely used, such as Meituan, Xiaomi, 58.com, Shell Search, Huawei, Autohome, Ctrip, Tongcheng, Vivo, Yahoo Japan, OLX Group, etc., with daily visits ranging from tens of thousands Ranging from tens of millions, most queries can be completed within 1-3 seconds.

If your product/business side finds you and says that you need to do flexible summary queries on billions or even hundreds of billions of records, the response should be fast, the concurrency should be high, and the resource usage should be low; in order to support application development, it It also fully supports SQL syntax and can seamlessly integrate BI, then Apache Kylin is your best choice.

Kylin's core idea is pre-calculation, which calculates all possible query results (that is, cube, multi-dimensional cube) based on specified dimensions and indicators, and uses space for time to accelerate OLAP queries with fixed query modes. Each combination of dimensions is called Cuboid, and the collection of all Cuboids is a Cube. The Cuboid composed of all dimensions is called Base Cuboid, and other Cuboids can be aggregated from Base Cuboid. When querying, Kylin will automatically select the most suitable Cuboid that meets the conditions. Compared with calculating from the user's original table, taking data from Cuboid for calculation can greatly reduce the amount of scanned data and calculation.

[A four-dimensional Cube example]

Kylin chose HBase as the storage engine from the beginning of its birth, which basically meets the requirements of query performance; however, there are a series of pain points based on the HBase solution, such as the complexity of HBase operation and maintenance, the single-point problem of query nodes, and HBase is not a pure columnar storage IO efficiency. higher. Apache Kylin v4 uses the combination of Parquet + Spark and no longer uses HBase, which separates computing and storage. It is a major architecture upgrade and is more adapted to the trend of cloud native technology.

Kylin on Parquet's challenges on the cloud

Compared with before, based on the new generation of Kylin 4, users can quickly and easily deploy high-performance, low-TCO data analysis services on the cloud. The separation of computing and storage, as well as the reduction of architecture complexity, make Kylin one of the best choices for data analysis on the cloud. However, the huge difference between the file system abstracted based on object storage on the cloud and the traditional HDFS brings a series of issues that need attention, such as data locality, restriction on the frequency of object storage API calls, and difficulty in consistency of data movement operations. Guarantees, etc., bring some stability and performance challenges to Kylin's construction and query. Regarding how to alleviate and even achieve the excellent performance experience of native HDFS, we can see some successful solutions, JuiceFS is one of them.

What is JuiceFS?

JuiceFS is a distributed file system designed for a cloud native environment. It is fully compatible with POSIX and HDFS. It is suitable for big data, machine learning training, Kubernetes shared storage, and massive data archive management scenarios. It supports all public cloud service providers in the world and provides fully managed services. Customers do not need to invest in any operation and maintenance efforts, and immediately have an elastically scalable file system that can be expanded to 100PB capacity.

As can be seen in the architecture diagram below, JuiceFS already supports various public cloud object storage products, as well as open source object storage, such as Ceph, MinIO, Swift, etc. The FUSE client is provided on Linux and macOS, and the native client is also provided on the Windows system. Both can mount the JuiceFS file system to the system, and the experience is exactly the same as the local disk. Provide Java SDK in the Hadoop environment, the experience is the same as HDFS. JuiceFS's metadata service has deployed fully managed services on all public clouds. Customers do not need to maintain any services themselves, and the threshold for learning and use is extremely low.

[JuiceFS architecture diagram]

Why should Kylin and JuiceFS be used together?

If a customer uses Kylin on a public cloud and wants to store data on object storage, they will encounter two problems:

The first issue is compatibility. Kylin supports HDFS and Amazon S3 by default, and other public clouds also provide "S3 compatible" object storage. However, in actual tests, we found that, in addition to AWS and Azure, other public cloud objects Storage is incompatible. For example, if we run Kylin on Alibaba Cloud based on OSS, whether it is a self-built cluster based on Alibaba Cloud EMR and CDH, it will fail during the Cube construction stage.

The second issue is performance. From a user's point of view, when HDFS is replaced with object storage in a big data scenario, the performance degradation can be felt. There are several reasons for the performance degradation:

1. Increased network overhead: The use of HDFS for storage has data locality characteristics. After switching to object storage, all data transmission will pass through the network, which will increase a certain amount of overhead and cause performance degradation;

2. Metadata performance degradation: During the Cube construction process, there are a lot of file metadata operations, especially Listing and Rename. The performance of these two operations on object storage is very poor compared to HDFS, which will lead to the time consumption of the entire job. Increase, resulting in performance degradation;

3. Performance degradation caused by read amplification: When Kylin's data is changed to Parquet file format, it is often not necessary to read the complete Parquet file when querying the data, only the header or footer needs to be read, which requires a good storage system to provide Random read capability, which is precisely the shortcoming of object storage, will cause read amplification, increase the I/O of the entire query task, and cause performance degradation.

JuiceFS can completely solve compatibility and performance issues in big data scenarios. Let's talk about how to do it.

First of all, let’s talk about compatibility. The metadata service of JuiceFS provides a Java SDK. Its function is equivalent to the Java SDK of HDFS. It implements the interface of all file interface APIs of HDFS. The behavior is guaranteed to be consistent with HDFS, as long as it supports HDFS. All of the computing engines can use JuiceFS without any compatibility issues. Moreover, JuiceFS supports all public cloud services around the world, providing a consistent experience, and users no longer need to care about the differences in object storage from different cloud vendors.

Next, let’s talk about performance, and explain how JuiceFS solves the performance degradation caused by the above three aspects:

1. The computing cluster using JuiceFS is also a storage and computing separation architecture. It also loses the data locality feature of HDFS, but JuiceFS provides data caching capabilities on the client. All data read from JuiceFS will be automatically cached to the node where the client is located ( On the local storage of a virtual machine or container, the next time you access this data, it will be read directly from the local storage and no longer go through the network. In big data query and analysis scenarios, data usually has hot spots. With the support of JuiceFS cache, performance can be significantly improved (see the test results below). You may also be concerned about the management, expiration, and consistency of the cache. JuiceFS has a complete set of processing mechanisms, which is worthy of a separate article, and this article will not expand.

2. Metadata performance. JuiceFS has its own independent metadata service. Listing and Rename operations are all responded to by JuiceFS metadata. Performance is dozens of times faster than object storage, and it is also improved by more than 50% compared to HDFS. See JuiceFS for details. Test cases.

3. JuiceFS cache can effectively reduce the delay of random reads and reduce read amplification. It has obvious performance advantages in query analysis scenarios based on Parquet and ORC data formats.

In summary, JuiceFS can obtain performance equivalent to HDFS, while providing perfect compatibility support for Hadoop ecological products. More importantly, no matter which public cloud the customer uses, they can use JuiceFS to get a consistent experience.

Performance comparison

The above explained the benefits of using Kylin on Parquet and JuiceFS together, let's take a look at the results of the performance test.

As mentioned above, there are compatibility issues in Cube construction based on OSS, and Cube cannot be constructed correctly. But it is possible to copy the Cube data built on JuiceFS to OSS to execute Query, so we tested Query1 to Query22 based on the TPC-H 10GB data set . JuiceFS is faster than OSS in total execution time. That’s 38%.

• JuiceFS uses 70,423ms

• OSS uses 113,670ms

The following table shows the detailed test environment configuration and execution time of all test Query:

Machine configuration

Use CDH 5.16 to build a cluster on Alibaba Cloud. The detailed configuration and software version are as follows:

All test query execution time

to sum up

Kylin 4.0 introduces an architecture that separates computing and storage, which makes it easier to deploy and scale Kylin on the cloud. However, compared with HDFS using local disks, using object storage on the cloud has problems with docking development and compatibility. On the other hand, performance will decrease. Using JuiceFS with Kylin, you can use cloud storage services for big data calculations in EMR or self-built Hadoop clusters without special adaptation on all public clouds. JuiceFS allows your cluster to achieve a storage and computing separation architecture, while reducing the cost of each network IO through efficient local caching. In the Parquet format-based query analysis scenario, it can effectively reduce the delay for random reads and reduce read amplification , To obtain performance close to HDFS. In our test scenario, using JuiceFS improves the performance by 38% compared to directly using object storage.

If you plan to use Kylin on the public cloud to complete data analysis requirements, and use JuiceFS for storage with object storage, you can achieve a win-win situation in compatibility and performance.

Guess you like

Origin blog.csdn.net/ZabeNbRdit36243qNJX1/article/details/111055364