The exploration of mobile cloud using JuiceFS to support Apache HBase to increase efficiency and reduce costs

About the author: Chen Haifeng, a developer of Apache HBase, a mobile cloud database, has a strong interest in Apache HBase, RBF, and Apache Spark.

background

Apache HBase is a large-scale, scalable, distributed data storage service in the Apache Hadoop ecosystem. It is also a NoSQL database. It is designed to provide random, strongly consistent, real-time queries for billions of rows of records containing millions of columns. By default, HBase data will be stored on HDFS, and HBase has made many optimizations for HDFS to ensure stability and performance. However, maintaining HDFS itself is not easy at all. It requires continuous monitoring, operation and maintenance, tuning, capacity expansion, disaster recovery and a series of things, and the cost of building HDFS on the public cloud is quite high. To save money and reduce maintenance costs, some users use S3 (or other object storage) to store HBase data. Using S3 saves the trouble of monitoring operation and maintenance, and also realizes the separation of storage and computing, making it easier to expand and shrink HBase.

However, HBase data access to object storage is not an easy task. On the one hand, object storage has limited functions and performance due to its own characteristics. Once data is written to object storage, the data object cannot be changed. On the other hand, using file system semantics to access block storage has natural limitations. When using Hadoop's native AWS client to access object storage, the directory renaming operation will traverse the entire directory for copying and deleting files, and the performance is very low. In addition, the renaming operation will also cause atomicity problems, that is, the original renaming operation is decomposed into two operations, copy and delete, which is prone to inconsistency in user data views in extreme cases. Similarly, there is also a query for the total size of all files in a directory. The principle is to sequentially obtain all file information of a directory through traversal iteration. If there are a large number of subdirectories and files in the directory, querying the total size of all files in the directory is more complicated and has worse performance.

Scheme selection

After a lot of solution research and community issue tracking, there are currently three solutions for cloud HBase data access object storage.

The first is that HBase uses the Hadoop native AWS client to access object storage , namely S3AFileSystem. The HBase kernel code needs only minor changes to use S3AFileSystem. A pain point that needs to be solved in this HBase direct docking object storage solution is the rename of the directory. HBase involves directory rename in Hlog file management, MemStore Flush, table creation, region compaction, and region split. The community has optimized StoreFIle to solve some of the rename performance issues. To completely solve the problem of directory operation performance requires drastic changes to the HBase kernel source code.

The second solution is to introduce Alluxio as a cache acceleration , which not only greatly improves the read and write performance, but also introduces file metadata management, which completely solves the problem of low performance of directory operations. The seemingly happy ending has many constraints behind it. When Alluxio is configured to use memory only, the time taken for directory operations is ms. If Alluxio's UFS is configured, metadata manipulation in Alluxio has two steps: the first step is to modify the state of the Alluxio master, and the second step is to send a request to UFS. As you can see, the metadata operation is still not atomic, and its state is unpredictable when the operation is executing or any failure occurs. Alluxio relies on UFS for metadata operations, such as renaming a file, which becomes a copy and delete operation. The data in HBase must be placed on the disk, and Alluxio cannot solve the performance problem of directory operations.

The third option is to introduce the JuiceFS shared file system between HBase and object storage . Using JuiceFS to store data, the data itself will be persisted in object storage (for example, mobile cloud EOS), and the corresponding metadata can be persisted in Redis, MySQL and other databases as needed. In this solution, the directory operation is completed in the Metadata Engine, without interaction with the object storage, and the operation time is at the ms level, which can solve the pain point of HBase data access to the object storage. However, since the JuiceFS kernel is written in Go language, it brings certain challenges to later performance tuning and routine maintenance.

After weighing the pros and cons of the above three solutions, JuiceFS is finally adopted as the solution for cloud HBase to support object storage. The following focuses on the practice and performance tuning of JuiceFS in cloud HBase-supported object storage.

An Introduction

First, let's introduce the architecture of JuiceFS. JuiceFS consists of two main parts: JuiceFS Metadata Service and Object Storage. JuiceFS Java SDK is fully compatible with HDFS API, and also provides FUSE-based client mount, fully POSIX compatible. As a file system, JuiceFS will process data and its corresponding metadata separately, the data will be stored in the object storage, and the metadata will be stored in the metadata engine. In terms of data storage, JuiceFS supports almost all public cloud object storages, as well as OpenStack Swift, Ceph, MinIO and other open source object storages that support private deployment. In terms of metadata storage, JuiceFS adopts a multi-engine design, and currently supports Redis, TiKV, MySQL/MariaDB, PostgreSQL, SQLite, etc. as metadata service engines.

Any file stored in JuiceFS will be split into fixed-size "chunks", with a default size cap of 64 MiB. Each Chunk consists of one or more "Slices", and the length of the slices is not fixed, depending on how the file is written. Each Slice is further split into fixed-size "Blocks", which default to 4 MiB. Finally, these Blocks are stored in the object store. At the same time, JuiceFS will store each file and its Chunks, Slices, Blocks and other metadata information in the metadata engine.

With JuiceFS, files are eventually split into Chunks, Slices, and Blocks for storage in object storage. Therefore, the source files stored in JuiceFS cannot be found in the object storage platform, and there is only a chunks directory and a bunch of numerically numbered directories and files in the bucket.

The following configuration is required for HBase components to use JuiceFS. First put the compiled client SDK in the HBase classpath. Next, write the JuiceFS related configuration into the configuration file core-site.xml, as shown in the following table. Finally format the filesystem using the juicefs client.

configuration item Defaults describe
fs.jfs.impl io.juicefs.JuiceFileSystem Specifies the storage implementation to use, defaults to jfs://
fs.AbstractFileSystem.jfs.impl io.juicefs.JuiceFS
juicefs.meta Specifies the metadata engine address of the pre-created JuiceFS file system.

In terms of metadata storage, MySQL is used as the metadata storage. The format file system command is as follows. It can be seen that formatting the file system requires the following information:

  • --storage: Set the storage type, such as mobile cloud EOS;
  • --bucket: Set the Endpoint address of object storage;
  • --access-key: Set the object storage API access key Access Key ID;
  • --secret-key: Set the Object Storage API Access Key Secret.
juicefs format --storage eos \
--bucket https://myjfs.eos-wuxi-1.cmecloud.cn \
--access-key ABCDEFGHIJKLMNopqXYZ \
--secret-key ZYXwvutsrqpoNMLkJiHgfeDCBA \
mysql://username:password@(ip:port)/database NAME

Scheme verification and optimization

After introducing how to use Juicefs, start testing. A server with 48 cores and 187G memory was selected in the test environment. In the HBase cluster, there are one HMaster, one RegionServer and three zookeepers. The three-node MySQL with master-slave replication is used in the Meta data engine. The object storage uses the mobile cloud object storage EOS, and the network strategy adopts the public network. Juicefs configures the chunk size to 64M, the physical storage block size to 4M, no cache, and 300M for MEM. We built two sets of HBase clusters, one is HBase directly placed on the mobile cloud object storage, the other is the introduction of Juicefs between HBase and mobile cloud object storage. Sequential writing and random reading are two key performance indicators of the HBase cluster, and the PE testing tools are used to test these two performance indicators. The test read and write performance is shown in the table below.

Cluster environment

Cluster environment HBase-juicefs-EOS (row/s) Cluster environment HBase-EOS (row/s)
write sequentially 79465 33343
random read 6698 6476

According to the test results, using the Juicefs solution, the sequential write performance of the cluster is significantly improved, but the random read performance is not improved. The reason is that the write request can be returned by writing to the client memory buffer, so generally speaking, the Write latency of JuiceFS is very low (tens of microseconds). When JuiceFS processes read requests, it generally reads the object storage according to the 4M block alignment method to achieve a certain pre-reading function. At the same time, the read data will be written to the local Cache directory for later use. During sequential reading, the data obtained in advance will be accessed by subsequent requests, and the cache hit rate is very high, so the read performance of object storage can also be fully utilized. However, during random reading, the pre-caching efficiency of JuiceFS is not high, but the actual utilization of system resources will be reduced due to read amplification and frequent writes and evictions of the local Cache.

To improve random read performance, two directions can be considered. One is to increase the overall capacity of the cache as much as possible, in order to achieve the effect of almost completely caching the required data. In the scenario of using massive data, this optimization direction is not feasible. Another direction is to deeply cultivate the JuiceFS kernel and optimize the logic of reading data.

At present, the optimizations we have done include: 1) Turn off the pre-reading mechanism and cache function to simplify the data read logic; 2) Avoid caching the entire block data as much as possible, and use Range HTTP request data more; 3) Set a smaller block size ; 4) Improve the read performance of object storage as much as possible. After testing in the R&D environment, the random read performance is improved by about 70% after optimization.

Combined with the previous test work, after using object storage as the underlying data storage system, Cloud HBase achieves the same read and write performance as data stored in HDFS, but the user spends less than half of the data stored in HDFS. HBase's support for object storage is a research and development practice that has both the cake and the paw.

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/5533644