Fluid 0.5 is released: Open the road to online elastic expansion and shrinking of data set cache

Head picture.png

Author | Gu Rong Nanjing University PASALab, Fluid Project co-founder
Source | Alibaba Cloud Native Official Account

Introduction : In order to solve the pain points of data-intensive applications such as big data and AI in the cloud-native scenario, the access to heterogeneous data sources is complicated, the storage and calculation separation I/O speed is slow, the scene awareness is weak and the scheduling is inefficient. PASALab, Nanjing University , Alibaba, and Alluxio jointly launched the open source project Fluid in June 2020.

Fluid is an efficient support platform for data-intensive applications in a cloud native environment. Since the open source release of the project, it has attracted the attention of many experts and engineers in related fields. The community has continued to evolve under the positive feedback from everyone. Recently, Fluid 0.5 was officially released. In this version, Fluid mainly added and improved the following three aspects:

  • Rich data set operation functions, support online elastic expansion and contraction, metadata backup and recovery.

  • Supports configuration and deployment of diverse environments to meet the user's personalized deployment and configuration requirements.

  • The implementation of a new data caching engine has been added to increase the user's choice of engines on the public cloud.

Fluid open source project address :https://github.com/fluid-cloudnative/fluid

The development requirements for these three main functions come from the actual production feedback of many community users. In addition, Fluid v0.5 has also carried out some bug fixes and document updates. Welcome to experience Fluid v0.5!

Fluidv0.5 download link :https://github.com/fluid-cloudnative/fluid/releases

The following is a further introduction to the functions of this new version release.

Rich data set operation function

In this version, Fluid focuses on enriching the core abstract object-Dataset (data set) related operation functions, so that data-intensive applications can better use the basic functions of cloud native elasticity, observability, etc., and enhance User flexibility in data set management.

1. Data set online elastic cache expansion and contraction

This is a feature that community users have been looking forward to! Before Fluid v0.5, if users want to adjust the caching capabilities of the data set, they need to uninstall all the cache engines and redeploy them. This method is time-consuming and labor-intensive, and the high cost of all data cache loss must also be considered. Therefore, in the new version, we provide support for the elastic expansion and shrinkage of the cache for the data set. Users can increase the cache capacity of a certain data set on-the-fly according to their own scenario requirements. Accelerate data access (capacity expansion) or reduce the cache capacity (capacity) of an infrequently used data set, so as to achieve more refined and elastic resource allocation and improve resource utilization. Fluid's built-in controller will select the appropriate scaling node according to the strategy. For example, it will combine the running tasks on the node and the node cache ratio as a filter condition when scaling down.

To perform the elastic expansion and contraction of the caching capacity of the elastic data set, the user only needs to run the following command:

kubectl scale alluxioruntimes.data.fluid.io {datasetName}  --replicas={num}

The datasetName corresponds to the name of the dataset, and replicas specifies the number of cache nodes.

Demonstration video about manual scaling of data set and its effect:http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/302459823704.mp4

For more details about the manual expansion and shrinking of the data set, please refer to the sample document on Github .

2. Metadata backup and recovery

This feature enhances the flexibility of fluid data set metadata management. The previous Fluid v0.4 already supports loading the metadata of the dataset (for example, the file system inode tree) to the local, and will record some key statistics of the dataset (for example, the size of the data and the number of files). However, once the user destroys the local data set, these metadata information will also be lost, and the data set needs to be retrieved from the underlying storage system again when rebuilding the data set.

Therefore, in Fluid v0.5, we have added a new K8s custom resource object-DataBackup, which provides users with a declarative API interface to control the behavior of data backup. A simple example of DataBackup custom resource object construction is as follows:

apiVersion: data.fluid.io/v1alpha1
kind: DataBackup
metadata:
  name: hbase-backup
spec:
  dataset: hbase
  backupPath: pvc://<pvcName>/subpath1/subpath2/

When creating the data set again, you only need to add a new field that specifies the location of the backup file:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hbase
spec:
  dataRestoreLocation:
    path: pvc://pvc-local/subpath1/
  mounts:
    - mountPoint:  https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.6/

At this time, Fluid will first load metadata and data set statistics from the backup file, thereby greatly improving the metadata loading speed.
 
For more details on the operation of data set metadata backup and recovery, please refer to the sample document on Github .

3. Observability optimization of the data set

Fluid v0.5 further enhances the observability of the data set, including two parts:

1) Combine with Prometheus1.jpg

This feature can support the availability of data sets and the collection of performance indicators, and can be visualized through Grafana. Currently, the implementation of AlluxioRuntime has been supported. Users can easily understand the current cacheable nodes, cache space, existing cache ratio, remote read, short-circuit read and other performance indicators. The entire configuration process is very simple, achieving the effect of "out of the box" for the data set monitoring system.

For specific usage, please refer to the sample document on Github .

2) New data set cache hit rate index

This feature can identify how many of the total accesses to the data set in the past 1 minute hit the distributed cache. On the one hand, this indicator can help users analyze the performance bottlenecks in their data-intensive applications and quantify the effect of Fluid in the workflow of the entire application; on the other hand, it can help users improve application performance and cache resource usage. Perform trade-offs and make reasonable expansion and contraction decisions.

This indicator is added Fuild v0.5 of Dataset.Status.CacheStatesthe Dataset CRD state resources, specifically including:

  • Cache Hit Ratio: The percentage of accesses hit by the distributed cache in the past minute.

  • Local Hit Ratio: The percentage of local cache hits in the past minute.

  • Remote Hit Ratio: The percentage of remote cache hits in the past minute.

Note: For distributed cache, there are two different cache hit situations for data hits. Local cache hit means that the initiator of the visit can directly access the cached data at the same node. The remote cache hit refers to that the access initiator needs to access the cached data on other nodes through the network.

In Fluid v0.5, users can use the following commands to conveniently view the cache hit ratio indicators:

kubectl get dataset <dataset-name> -o wide
NAME        ...  CACHE HIT RATIO   AGE
<dataset-name> ...  86.2%           16m

Support multiple environment configuration deployment

Since the release of Fluid 0.4, we have added more support for the deployment and configuration of Fluid in diverse environments based on the problems and needs reported by community users in actual deployment.

1. Support Fuse's global mode

In Fluid, the remote files defined in the Dataset resource object are schedulable, which means that you can manage the location of remote files cached on the Kubernetes cluster just like managing Pods. The Pod performing the calculation can access the data file through the Fuse client. In the previous version of Fluid, the Fuse client always schedules to the node where the cache is located, but users cannot freely control the scheduling of Fuse.

In Fluid v0.5, we added a global deployment mode for Fuse. In this mode, Fuse will be globally deployed to all nodes by default. Users can also influence the scheduling result of Fuse by specifying the nodeSelector of Fuse. At the same time, the cache will be prioritized to be deployed on nodes with a large number of computing Pods.

The specific use is very simple, you can refer to the sample document on Github .

2. Supports user-level configuration of HDFS

Many community users use the distributed caching system Alluxio  as the caching engine for Fluid data sets. In the case that the data set is persistently stored in the HDFS file system, in order for Alluxio to access the underlying HDFS normally, the Alluxio cluster needs to obtain various configuration information of the HDFS in advance.

In Fluid v0.5, we use Kubernetes' native resources to provide support for the above scenarios. Users first need to HDFS related configuration files (eg hdfs-site.xmland core-site.xml) in a ConfigMapway to create Kubernetes environment, followed by creation of AlluxioRuntimea reference resource objects created above in ConfigMaporder to achieve the above functions.

AlluxioRuntime An example of a resource object is as follows:

apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: my-hdfs
spec:
  ...
  hadoopConfig: <configmap-name>
  ...

At this point, the created Alluxio cluster will be able to access the data in the HDFS cluster normally. For more information, please refer to the sample document on Github .

New data caching engine implementation

The default distributed cache Runtime used by Fluid is Alluxio Runtime. In order to support the needs of users in different environments for the cache system, Fluid has made the distributed cache Runtime access framework into a pluggable architecture in previous versions. In Fluid v0.5, community contributors from Alibaba Cloud developed JindoRuntime based on this framework, and added an execution engine implementation that supports Fluid Dataset data management and caching. Users can use JindoFS's Cache mode in Fluid to access and cache remote files through JindoRuntime. The process of using and deploying JindoRuntime on Fluid is simple, compatible with the native K8s environment, and can be used out of the box.

to sum up

In Fluid v0.5, we have enriched and enhanced Fluid's features and user experience.

First of all , Fluid v0.5 further increases the functional operation of the data set:

  • Provide online elastic expansion and contraction of data sets to achieve more flexible and finer cluster resource allocation control.

  • The newly added DataBackup CRD realizes the backup and restoration of metadata and other information of datasets, helping to complete the quick restart of the dataset caching system.

  • A new cache hit rate indicator is added to help users better quantify and analyze the acceleration effect provided by Fluid.

Secondly , Fluid supports more environment modes and configurations to meet the deployment requirements of more real scenarios.

Finally , Fluid has added JindoFS-based distributed cache Runtime-JindoRuntime, which provides users with different cache engine choices in diverse deployment environments.

We will continue to pay attention to and adopt community suggestions to promote the long-term development of the Fluid project. We look forward to hearing more feedback from you.

Thanks

Thanks to the community friends who contributed to this version, including Wang Tao from Alibaba Cloud, Xie Yuandong from Tencent Cloud, Qiu Lingwei from China Telecom, Xu Zhihao, Hou Haojun, Chen Guowang and Chen Yuquan from Nanjing University PASALab.

About the Author

Dr. Gu Rong, associate researcher of the Computer Department of Nanjing University, a member of Fluid open source project co-founder, Alluxio open source project PMC, research direction big data processing system, has published papers in cutting-edge journal conferences such as TPDS, ICDE, JPDC, IPDPS, ICPP, etc. 30 For the remainder of the article, he presided over a number of general projects/youth projects of the National Natural Science Foundation of China, and special funding projects of the China Postdoctoral Science Foundation. The research results were applied to Alibaba, Baidu, Bytedance, Sinopec, Huatai Securities and other companies and open source projects Apache Spark, Alluxio, won the first prize of Jiangsu Science and Technology Award in 2018, and Youth Science and Technology Award of Jiangsu Computer Society in 2019, serving as a member of the System Software Committee of the Chinese Computer Society/Communication Committee of the Big Data Committee, and Big Data of the Jiangsu Computer Society Secretary-General of the Special Committee.

Guess you like

Origin blog.51cto.com/13778063/2668406