The new version of Fluid 0.4 is officially released: support data preheating and optimize small file scenarios

A7325037-1073-4606-AFA7-F0AC90A4E115.png

Author | Gu Rong Photo Creidt @ 轻零

Introduction : In order to solve the problems of high data access latency, difficulty in joint analysis, and complex multi-dimensional management in data-intensive applications such as big data and AI in the cloud-native computing and storage separation scenario, Nanjing University PASALab, Alibaba, Alluxio are in In September 2020, the open source project Fluid was jointly launched.

Recently, Fluid 0.4 was officially released, and the following four important functions were added, namely:

  • Customize resources through DataLoad to provide easy-to-use and customizable data preheating capabilities

  • Enhance support for massive small file data sets, and expand Fluid's support scenarios for AI applications

  • Open HDFS file system compatible interface, support data access of Spark and other frameworks

  • Supports mixed deployment of multiple data sets and single nodes to adapt to the shared cluster environment in the production environment

Fluid project address : https://github.com/fluid-cloudnative/fluid

Similar to  Fluid 0.3  , the development requirements for the above functions also come from the actual production feedback of many community users. In addition, Fluid v0.4 has also carried out some bug fixes and document updates. Welcome to experience Fluid v0.4! Thanks to the community partners who have contributed to this version. In the next iteration of the version, we will continue to pay attention to and adopt community suggestions to promote the development of the Fluid project. We look forward to hearing more feedback from you! The following is a further introduction to the functions of this new version release.

Support active data warm-up

When training models for AI applications, data preheating is a common optimization method. Data preheating refers to pulling the data required by the application from the remote storage system to the local computing cluster in advance before the application is running, for later use when the application is running. Data preheating uses a sequential and regular parallel data reading mode to avoid a lot of unnecessary communication overhead caused by random data reading when data-intensive applications directly consume remote storage system data .

Therefore, in Fluid 0.4, we have implemented a new Kubernetes custom resource-DataLoad, which provides users with a declarative API interface in the form of Kubernetes resources to control data preheating related behaviors . A simple example of DataLoad custom resource is shown below:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: imagenet-dataload
spec:
  dataset:
    name: imagenet
    namespace: default

In addition, with a small amount of additional configuration, DataLoad can also implement many customizable functions such as sub-directory loading, cache copy number control, metadata synchronization, etc. For more details about the use of DataLoad, please refer to the sample document on Github .

The demonstration video about the use and optimization of DataLoad is as follows: http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/287213603893.mp4

Enhance the support capacity of massive small file data sets

Fluid is an efficient support platform for data-intensive applications in a cloud native environment. Therefore, we have been paying close attention to the applicability of the data set support capabilities provided by Fluid in different scenarios. Before Fluid 0.4, Fluid had provided a series of data set support capabilities such as abstraction, management, acceleration, observability, etc. However, according to the feedback from community members, the support for the above capabilities in the massive small file scenario is still very rudimentary .

Considering the ubiquity of massive small file data sets in real production environments, especially AI application scenarios, we have conducted an in-depth exploration of the problems caused by massive small files, and proposed such as asynchronous metadata loading query, streaming data processing, etc. Solutions, these solutions are currently integrated into Fluid 0.4 version to enhance Fluid's ability to support massive small file data sets .

The following are some of the optimized performance comparison evaluation results of Fluid using Alluxio Runtime in the 4 million small file scenario :

Dingtalk_20201117100516.jpg

The storage management of massive and small files is a thorny problem that many storage systems will encounter. In subsequent versions, we will continue to pay attention to this scenario and the problems it brings.

Convenient for big data computing frameworks such as Spark to provide data access support

In addition to AI applications, Fluid 0.4 also supports big data applications such as Spark to run on it. By exposing the Hadoop File System Compatibility Interface (HCFS) of Alluxio distributed cache engine to users, data analysis applications written by big data computing frameworks such as Hadoop MapReduce and Apache Spark can run directly on Fluid without modifying the application code. Above, and enjoy the distributed cache acceleration and other capabilities provided by Fluid .

For more details about accessing data through the HCFS interface, please refer to the sample document on Github .

Multi-data set single node hybrid deployment

In a real production environment, users will train multiple tasks on GPU nodes in a Kubernetes cluster to use multiple data sets. Before Fluid 0.4, a single node could not simultaneously deploy multiple data sets. Therefore, if there are more Each user expects to access the data set they need on the same node at the same time, and a certain user's data set cannot be created.

In the Fluid 0.4 version, we added the ability to deploy multiple data sets and single node hybrid deployment for Fluid, which means that as long as the resources on the node are sufficient, the problem of multiple data set deployment conflicts from different users will no longer occur. This ability will make Fluid more adaptable to the needs of the actual production environment. On the other hand, hybrid deployment can effectively use idle resources, increase the utilization of cluster resources of each node in the cluster, and further increase the cost and benefit of Fluid.

For a brief introduction to multi-data set single-node hybrid deployment, please refer to the sample document on Github .

Thanks

  • Zhihao Xu (PASALab, Nanjing University)'s contribution to supporting small file scenarios and data warm-up function

  • Xie Yuandong (Yun Zhisheng) Function development and scenario verification for multi-data set single-node hybrid deployment

  • Qiu Lingwei (China Telecom)'s contribution to the separation of Fluid architecture, he split the runtime and dataset controllers to support the parallel evolution of the two components in the future

to sum up

Fluid 0.4 version will continue to work to solve the problems and needs reported by community users in the actual production environment, expand the applicability of Fluid in various scenarios, and improve user experience:

  • First of all, the optimization of the support for massive small file data sets enables Fluid to better cope with different usage scenarios;

  • Secondly, the new DataLoad custom resource provides users with a simple data warm-up solution;

  • Furthermore, the support for data access to big data applications such as Spark enables Fluid to provide support for different types of data-intensive applications;

  • Finally, the mixed deployment of multiple data sets makes Fluid more adaptable to the needs of the actual production environment.

If you have any questions or suggestions, please join the DingTalk exchange group to participate and discuss: https://img.alicdn.com/tfs/TB1Cm4ciNvbeK8jSZPfXXariXXa-452-550.png

About the Author

Rong Gu   Ph.D., Department of Computer Science, Nanjing University associate professor, research direction of the large data processing system, has been published in journals meeting the forefront of the field of TPDS, ICDE, JPDC, IPDPS, ICPP and so on more than 20 papers, presided over the National Natural Science Foundation of China / Youth A number of projects and the China Postdoctoral Science Foundation special funded projects. The research results were applied to Alibaba, Baidu, ByteDance, Sinopec, Huatai Securities and other companies and open source projects Apache Spark, Alluxio, and won the 2018 Jiangsu Science and Technology First Class Award, 2019 Young Science and Technology Award of Jiangsu Computer Society, serving as a member of the System Software Committee of the Chinese Computer Society/Communication Committee of the Big Data Committee, the Secretary-General of the Big Data Committee of the Jiangsu Computer Society, Fluid open source project co-founder, Alluxio open source project PMC member

Guess you like

Origin www.oschina.net/news/120890/fluid-0-4-released