How to play Fluid + JuiceFS in a Kubernetes cluster

The Yunzhisheng Atlas team began to contact and follow up with JuiceFS storage in early 2021, and has accumulated rich experience in using Fluid in the early days. Recently, the Yunzhisheng team and the Juicedata team developed the Fluid JuiceFS acceleration engine, which enables users to better use the JuiceFS cache management capabilities in the Kubernetes environment. This article explains how to play with Fluid + JuiceFS in a Kubernetes cluster.

Background introduction

Introduction to Fluid

CNCF Fluid is an open source Kubernetes-native distributed data set orchestration and acceleration engine, mainly serving data-intensive applications in cloud-native scenarios, such as big data applications, AI applications, etc. For more information about Fluid, please refer to the address .

Instead of full storage acceleration and management, Fluid accelerates and manages the datasets used by applications. Fluid provides a more cloud-native way to manage data sets. The cache acceleration engine is used to cache the data of the underlying storage system in the memory or hard disk of the computing node, which solves the problem of data transmission bandwidth limitations in the separation architecture of computing and storage. And problems such as underlying storage bandwidth and IOPS capacity limitations, resulting in low IO efficiency. Fluid provides cached data scheduling capabilities, and the cache is incorporated into kubernetes' extended resources. When scheduling tasks, kubernetes can refer to the cache to allocate scheduling policies.

Fluid has 2 important concepts: Dataset and Runtime

  • Dataset: A dataset is a logically related set of data, with consistent file characteristics, that will be used by the same computing engine.
  • Runtime: The interface of the execution engine that implements data set security, version management and data acceleration capabilities, and defines a series of life cycle methods.

Fluid's Runtime defines a standardized interface. The Cache Runtime Engine can connect to multiple cache engines, providing users with more flexible choices. Users can make full use of the cache engine to accelerate corresponding scene applications for different scenarios and needs.

Introduction to JuiceFS

JuiceFS is a high-performance open source distributed file system designed for cloud environment. It is fully compatible with POSIX, HDFS, and S3 interfaces. It is suitable for scenarios such as big data, AI model training, Kubernetes shared storage, and massive data archive management.

Using JuiceFS to store data, the data itself will be persisted in object storage (for example, Amazon S3), and the metadata corresponding to the data can be persisted in Redis, MySQL, TiKV and other database engines according to the needs of the scene. The JuiceFS client has data caching capability. When reading data through the JuiceFS client, the data will be intelligently cached in the local cache path (which can be memory or disk) configured by the application, and the metadata will also be cached in the local cache path configured by the application. Client node local memory.

For AI model training scenarios, subsequent calculations after the first epoch can directly obtain training data from the cache, which greatly improves training efficiency. JuiceFS also has the ability to read ahead and read data concurrently. In AI training scenarios, it can ensure the generation efficiency of each mini-batch and prepare data in advance. Data preheating can transfer the data on the public cloud to the local node in advance. For AI training scenarios, it can ensure that after GPU resources are applied for, there will be preheated data for operation, saving time for the use of precious GPUs.

Why use JuiceFSRuntime

As the underlying infrastructure, Yunzhisheng Atlas supercomputing platform supports the company's model training and inference services in various fields of AI. Yunzhisheng started to build the industry-leading GPU/CPU heterogeneous Atlas computing platform and distributed file storage system very early. This computing cluster can provide high-performance computing and storage access capabilities for massive data for AI computing. The Yunzhisheng Atlas team started to contact and follow up with JuiceFS storage in early 2021, and conducted a series of POC tests. The adaptation of data reliability and business scenarios met our current needs.

In the training scenario, we make full use of the caching capabilities of the JuiceFS client to accelerate data for AI model training, but we found some problems during use:

  • To train a Pod to mount through hostpath, you need to mount the JuiceFS client on each computing node. The mount requires administrator operation, and the mount parameters are fixed and not flexible enough.
  • Users cannot manage the cache of the computing node client, and the cache cannot be manually cleaned and expanded.
  • Cached datasets cannot be scheduled by kubernetes like Kubernetes custom resources.

Since we have accumulated a certain amount of experience in using Fluid in the production environment, we cooperated with the Juicedata team to design and develop JuiceFSRuntime, which combines Fluid's data orchestration and management capabilities with JuiceFS's caching capabilities.

What is Fluid + JuiceFS (JuiceFSRuntime)

JuiceFSRuntime is a Runtime customized by Fluid, in which the worker, fuse image and corresponding cache parameters of JuiceFS can be specified. Its construction method is consistent with other Fluid Runtimes, that is, it is constructed through CRD, and JuiceFSRuntime Controller monitors JuiceFSRuntime resources to realize the management of cache Pods.

JuiceFSRuntime supports data affinity scheduling (nodeAffinity), selects appropriate cache nodes, supports lazy startup of Fuse pods, supports users to access data through POSIX interface, and currently only supports one mount point.

Its architecture diagram is shown in the figure above. JuiceFSRuntime consists of Fuse Pod and Worker Pod. Worker pod mainly implements cache management, such as cache cleaning when Runtime exits; Fuse pod is mainly responsible for parameter setting and mounting of JuiceFS client.

How to use JuiceFSRunime

Let's take a look at how to use JuiceFSRuntime for cache acceleration.

Preliminary preparation

To use JuiceFSRuntime you first need to prepare the metadata engine and object storage.

Building a metadata engine

Users can easily purchase cloud Redis databases of various configurations on the cloud computing platform. If it is used for evaluation and testing, Docker can be used to quickly run a Redis database instance on the server:

$ sudo docker run -d --name redis \
	-v redis-data:/data \
	-p 6379:6379 \
	--restart unless-stopped \
	redis redis-server --appendonly yes

Prepare object storage

Like Redis databases, almost all public cloud computing platforms provide object storage services. Because JuiceFS supports object storage services of almost all mainstream platforms, users can deploy according to their own situation.

Here is the minio instance that should be run using Dokcer for evaluation tests:

$ $ sudo docker run -d --name minio \
    -p 9000:9000 \
    -p 9900:9900 \
    -v $PWD/minio-data:/data \
    --restart unless-stopped \
    minio/minio server /data --console-address ":9900"

The initial Access Key and Secret Key of the object store are both minioadmin.

Download and install Fluid

Follow the documentation steps to install Fluid chart values.yaml, runtime.juicefs.enableset to true in Fluid's installation, and install Fluid. Make sure the Fluid cluster is up and running:

kubectl get po -n fluid-system
NAME                                         READY   STATUS              RESTARTS   AGE
csi-nodeplugin-fluid-ctc4l                   2/2     Running             0          113s
csi-nodeplugin-fluid-k7cqt                   2/2     Running             0          113s
csi-nodeplugin-fluid-x9dfd                   2/2     Running             0          113s
dataset-controller-57ddd56b54-9vd86          1/1     Running             0          113s
fluid-webhook-84467465f8-t65mr               1/1     Running             0          113s
juicefsruntime-controller-56df96b75f-qzq8x   1/1     Running             0          113s

Make sure that juicefsruntime-controller, dataset-controller, fluid-webhookof , podand several csi-nodeplugin podare functioning properly.

Create Dataset

Before using JuiceFS, you need to provide parameters for metadata services (such as redis) and object storage services (such as minio), and create the corresponding secret:

kubectl create secret generic jfs-secret \
    --from-literal=metaurl=redis://$IP:6379/1 \  # redis 的地址 IP 为 redis 所在节点的 IP
    --from-literal=access-key=minioadmin \ # 对象存储的 ak
    --from-literal=secret-key=minioadmin  #对象存储的 sk

Create the Dataset yaml file

cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: jfsdemo
spec:
  mounts:
    - name: minio
      mountPoint: "juicefs:///demo"
      options:
        bucket: "<bucket>"
        storage: "minio"
      encryptOptions:
        - name: metaurl
          valueFrom:
            secretKeyRef:
              name: jfs-secret
              key: metaurl
        - name: access-key
          valueFrom:
            secretKeyRef:
              name: jfs-secret
              key: access-key
        - name: secret-key
          valueFrom:
            secretKeyRef:
              name: jfs-secret
              key: secret-key
EOF

Since JuiceFS uses a local cache, the corresponding Dataset only supports one mount, and JuiceFS does not have UFS. The subdirectory to be mounted can be specified in the mountpoint ("juicefs:///" is the root path), which will be mounted as the root directory into the container.

Create a Dataset and view the Dataset status

$ kubectl create -f dataset.yaml
dataset.data.fluid.io/jfsdemo created
 
$ kubectl get dataset jfsdemo
NAME      UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE      AGE
jfsdemo                                                                  NotBound   44s

As shown above, the value of the phase property in status is NotBound, which means that the Dataset resource object is not currently bound to any JuiceFSRuntime resource object. Next, we will create a JuiceFSRuntime resource object.

Create JuiceFSRuntime

Create the yaml file of JuiceFSRuntime

$ cat<<EOF >runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
  name: jfsdemo
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /cache
        quota: 40960   # JuiceFS 中 quota 的最小单位是 MiB,所以这里是 40GiB
        low: "0.1"
EOF

Create and view JuiceFSRuntime

$ $ kubectl create -f runtime.yaml
juicefsruntime.data.fluid.io/jfsdemo created

$ kubectl get juicefsruntime
NAME      WORKER PHASE   FUSE PHASE   AGE
jfsdemo   Ready                       Ready        72s

View the status of JuiceFS related components Pod

$$ kubectl get po |grep jfs
jfsdemo-worker-mjplw                                           1/1     Running   0          4m2s

JuiceFSRuntime does not have a master component, and the Fuse component implements lazy startup and will be created when the pod is used.

Create a cache acceleration job

Create an application that needs to be accelerated, in which the Pod uses the Dataset created above to specify a PVC with the same name

$ cat<<EOF >sample.yaml
apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: demo
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: demo
  volumes:
    - name: demo
      persistentVolumeClaim:
        claimName: jfsdemo
EOF

Create Pod

$ kubectl create -f sample.yaml
pod/demo-app created

View pod status

$ kubectl get po |grep demo
demo-app                                                       1/1     Running   0          31s
jfsdemo-fuse-fx7np                                             1/1     Running   0          31s
jfsdemo-worker-mjplw                                           1/1     Running   0          10m

You can see that the pod has been created successfully, and the Fuse component of JuiceFS has also been successfully started.

Enter the Pod and execute to df -hTcheck whether the cache directory is mounted:

$ kubectl exec -it demo-app  bash -- df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          20G   14G  5.9G  71% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
JuiceFS:minio   1.0P  7.9M  1.0P   1% /data

You can see that the cache directory has been successfully mounted at this time.

Next, let's test the write function in the demo-app pod:

$ kubectl exec -it demo-app bash
[root@demo-app /]# df
Filesystem         1K-blocks     Used     Available Use% Mounted on
overlay             20751360 14585944       6165416  71% /
tmpfs                  65536        0         65536   0% /dev
tmpfs                3995028        0       3995028   0% /sys/fs/cgroup
JuiceFS:minio  1099511627776     8000 1099511619776   1% /data
/dev/sda2           20751360 14585944       6165416  71% /etc/hosts
shm                    65536        0         65536   0% /dev/shm
tmpfs                3995028       12       3995016   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                3995028        0       3995028   0% /proc/acpi
tmpfs                3995028        0       3995028   0% /proc/scsi
tmpfs                3995028        0       3995028   0% /sys/firmware
[root@demo-app /]#
[root@demo-app /]# cd /data
[root@demo-app data]# echo "hello fluid" > hello.txt
[root@demo-app data]# cat hello.txt
hello fluid

Finally, let's look at the caching function, create a 1G file in the mount directory /datain , and then cp it out:

$ kubectl exec -it demo-app  bash
root@demo-app:~# dd if=/dev/zero of=/data/test.txt count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.55431 s, 164 MB/s
root@demo-app:~# time cp /data/test.txt ./test.txt
real	0m5.014s
user	0m0.003s
sys	0m0.702s
root@demo-app:~# time cp /data/test.txt ./test.txt
real	0m0.602s
user	0m0.004s
sys	0m0.584s

Judging from the execution results, the first cp took 5s, and the cache was established at this time. In the second cp, because the cache already existed, it only took 0.6s. The powerful caching capability provided by JuiceFS enables a file to be cached in the local cache path as long as it is accessed once, and all subsequent repeated accesses obtain data directly from JuiceFS.

Follow-up planning

At present, JuiceFSRuntime does not support many functions, and we will continue to improve it in the future, such as Fuse Pod running in Nonroot mode, and Dataload data preheating function.

Recommended reading: Zhihu x JuiceFS: Using JuiceFS to Accelerate Flink Container Startup

If it is helpful, please follow us on Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324069460&siteId=291194637