Nydus' container image acceleration practice on Yomiao platform

insert image description here

Text | Xiang Shen
Yuemiao platform operation and maintenance engineer pays attention to the field of cloud native

The number of words in this article is 9574 and the reading time is 24 minutes

This article is shared by Xiang Shen, and introduces his practice of deploying Nydus in the K8s production environment cluster.

Nydus is an open source container image acceleration project jointly established by Ant Group, Alibaba Cloud, and Byte. It is a sub-project of CNCF Dragonfly. Nydus redesigned the image format and underlying file system on the basis of OCI Image Spec to accelerate container startup speed. Improve the success rate of container startup in large-scale clusters. For detailed documents, please refer to the following address:

Nydus official website: https://nydus.dev/
Nydus Github: https://github.com/dragonflyoss/image-service

The concept of PART.1
container image

  1. container image

There is an official analogy for container images, "containers common in life". Although they have different specifications, the boxes themselves are immutable (Immutable), but the contents in them are different.

For mirroring, the invariant part contains all the elements needed to run an application (such as MySQL). Developers can use some tools (such as Dockerfile) to build their own container images, sign them and upload them to the Internet, and then those who need to run these software can download, verify and run them by specifying a name (such as example.com/my-app) these containers.

  1. OCI standard image specification

Before the introduction of the OCI standard image specification, there were actually two sets of widely used image specifications, namely Appc and Docker v2.2. Gradually assimilated, so the OCI organization launched the OCI Image Format Spec on the basis of Docker v2.2, which stipulates that for the image that conforms to the specification, it allows developers to package and sign the container once, and it can be used in all container engines. run the container on.

This specification gives the definition of OCI Image:

This specification defines an OCI Image, consisting of a manifest, an Image Index (optional), a set of filesystem layers, and a Configuration.

  1. Container Workflow

insert image description here

A typical container workflow starts with developers making a container image (Build), then uploading it to the image storage center (Ship), and finally deploying it in the cluster (RUN).

PART.2
OCI image format

The so-called image file actually refers to a "package" that contains multiple files. These files in the "package" provide all the information needed to start a container, including but not limited to, the Data files such as the file system, configuration files such as the platform to which the image is applied, and data integrity verification information. When we use Docker pull or Nerdctl pull to pull an image from the image center, we are actually pulling the files contained in the image in sequence.

Nerdctl sequentially pulled an Index file, a Manifest file, a Config file and several Layer data files. In fact, a standard OCI image is usually composed of these parts.

Among them, the Layer file is generally a tar package or a compressed tar package, which contains the specific data files of the image. These Layer files will together form a complete file system (that is, the file system seen in the container after starting the container from the image).

The Config file is a JSON file. It contains some configuration information of the mirror, such as mirror time, modification records, environment variables, mirror startup commands, and so on.

The Manifest file is also a JSON file. It can be regarded as a list of image files, that is, which Layer files and which Config files are included in the image.

The following is a typical example of a Manifest file:

"schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
   "mediaType": "application/vnd.oci.image.config.v1+json",
   "digest": "sha256:0584b370e957bf9d09e10f424859a02ab0fda255103f75b3f8c7d410a4e96ed5",
   "size": 7636
 },
  "layers": [
 {
    "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
    "digest": "sha256:214ca5fb90323fe769c63a12af092f2572bf1c6b300263e09883909fc865d260",
    "size": 31379476
 },
 {
    "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
    "digest": "sha256:50836501937ff210a4ee8eedcb17b49b3b7627c5b7104397b2a6198c569d9231",
    "size": 25338790
 },
 {
    "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
    "digest": "sha256:d838e0361e8efc1fb3ec2b7aed16ba935ee9b62b6631c304256b0326c048a330",
    "size": 600
 },
 {
    "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
    "digest": "sha256:fcc7a415e354b2e1a2fcf80005278d0439a2f87556e683bb98891414339f9bee",
    "size": 893
 },
 {
    "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
    "digest": "sha256:dc73b4533047ea21262e7d35b3b2598e3d2c00b6d63426f47698fe2adac5b1d6",
    "size": 664
 },
 {
    "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
    "digest": "sha256:e8750203e98541223fb970b2b04058aae5ca11833a93b9f3df26bd835f66d223",
    "size": 1394
  }
 ]
}

The Index file is also a JSON file. It is optional and can be thought of as a Manifest of Manifests. Just imagine, a tag-identified image, such as Docker.io/library/nginx:1.20, will have different image files for different architecture platforms (such as Linux/amd, Linux/arm64, etc.), and each image for a different platform All files are described by a Manifest file, then we need a higher-level file to index these multiple Manifest files.

For example, the Index file of Docker.io/library/nginx:1.20 contains a Manifests array, which records the basic information of Manifests of multiple different platforms:

{
 "manifests": [
 {
   "digest": "sha256:a76df3b4f1478766631c794de7ff466aca466f995fd5bb216bb9643a3dd2a6bb",
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "platform": {
     "architecture": "amd64",
     "os": "linux"
  },
   "size": 1570
 },
 {
    "digest": "sha256:f46bffd1049ef89d01841ba45bb02880addbbe6d1587726b9979dbe2f6b556a4",
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "platform": {
      "architecture": "arm",
      "os": "linux",
      "variant": "v5"
   },
   "size": 1570
 },
 {
    "digest": "sha256:d9a32c8a3049313fb16427b6e64a4a1f12b60a4a240bf4fbf9502013fcdf621c",
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "platform": {
       "architecture": "arm",
       "os": "linux",
       "variant": "v7"
   },
   "size": 1570
 },
 {
    "digest": "sha256:acd1b78ac05eedcef5f205406468616e83a6a712f76d068a45cf76803d821d0b",
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "platform": {
       "architecture": "arm64",
       "os": "linux",
       "variant": "v8"
   },
   "size": 1570
 },
 {
    "digest": "sha256:d972eee4f12250a62a8dc076560acc1903fc463ee9cb84f9762b50deed855ed6",
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "platform": {
       "architecture": "386",
       "os": "linux"
   },
   "size": 1570
 },
 {
    "digest": "sha256:b187079b65b3eff95d1ea02acbc0abed172ba8e1433190b97d0acfddd5477640",
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "platform": {
       "architecture": "mips64le",
       "os": "linux"
   },
    "size": 1570
 },
 {
    "digest": "sha256:ae93c7f72dc47dbd984348240c02484b95650b8b328464c62559ef173b64ce0d",
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "platform": {
      "architecture": "ppc64le",
      "os": "linux"
   },
    "size": 1570
 },
 {
    "digest": "sha256:51f45f5871a8d25b65cecf570c6b079995a16c7aef559261d7fd949e32d44822",
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "platform": {
       "architecture": "s390x",
       "os": "linux"
  },
   "size": 1570
  }
 ],
 "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
 "schemaVersion": 2
}

PART.3
Problems faced by OCI mirroring

insert image description here

  1. Slow to start containers

We noticed that the container can only see the entire file system view after all the image layers are stacked, so the container needs to wait until each layer of the image is downloaded and decompressed before it can start. There is a research analysis of a FAST paper [1] that image pull accounts for about 76% of the container startup time, but only 6.4% of the data will be read by the container. This result is very interesting, and it inspires that we can improve the container startup speed by loading on demand. In addition, in the case of a large number of layers, there will be Overlay stacking overhead at runtime.

Generally speaking, container startup is divided into three steps:

Download the image;
decompress the image;
use Overlayfs to aggregate the writable layer of the container and the read-only layer in the image to provide a container operating environment.

  1. High local storage cost

Each layer of mirroring is composed of metadata and data, so as long as there is a change in the metadata of a file in a certain layer of mirroring, such as modifying the permission bit, it will cause the layer's Hash to change, and then the entire mirroring layer needs to be changed. be re-stored, or re-downloaded.

  1. A large number of similar mirrors exist

Mirroring uses a layer as the basic storage unit, and data deduplication is done through the hash of the layer, which also results in a coarser granularity of data deduplication. From the perspective of the entire Registry storage, there is a large amount of duplicate data between layers in the image and between images, which occupies storage and transmission costs.

PART.4
Nydus mirroring solution

The Nydus image acceleration framework is a sub-project of Dragonfly[2] (a CNCF incubating project). It is compatible with the current OCI image construction, distribution, and runtime ecology. The Nydus runtime is written in Rust, which has great advantages in terms of language-level safety and performance, memory and CPU overhead, and is also safe and highly scalable.

insert image description here

  1. Nydus infrastructure

Nydus mainly includes a new image format and a FUSE user-mode file system process responsible for parsing container images.

insert image description here

  1. Nydus workflow

insert image description here

The Nydus image format does not modify the architecture of the OCI image format, but mainly optimizes the data structure of the Layer data layer.

Nydus separates the file data and metadata (the directory structure of the file system, file metadata, etc.) originally stored in the Layer layer, and stores them in the "Blob Layer" and "Bootstrap Layer" respectively. And divide the file data stored in the Blob Layer into chunks (Chunk) to facilitate lazy loading (when you need to access a certain file, you only need to pull the corresponding Chunk instead of the entire Blob Layer).

At the same time, the block information, including the location information of each Chunk in the Blob Layer, is also stored in the metadata layer of the Bootstrap Layer. In this way, when the container starts, it only needs to pull the Bootstrap Layer layer. When the container specifically accesses a file, it can pull the corresponding Chunk in the corresponding Blob Layer according to the meta information in the Bootstrap Layer.

  1. Advantages of Nydus

Container images are downloaded on demand, and users no longer need to download a complete image to start a container.
Deduplication of mirrored data at the block level saves storage resources to the greatest extent for users.
The mirror only has the data that is finally available, and there is no need to save and download expired data.
End-to-end data consistency verification provides users with better data protection.
Compatible with OCI distribution standards and artifacts standards, ready to use out of the box.
It supports different mirror storage backends. Mirror data can not only be stored in the mirror warehouse, but also can be placed on NAS or object storage like S3.
Nice integration with Dragonfly.

PART.5
The practical application of Nydus in seedling production

As a leading disease prevention information and service platform in China, Yomiao Platform takes vaccine appointment service as the core and provides professional and comprehensive disease prevention information and Serve.

As of February 2023, the Yuemiao platform has accumulated 37 million+ registered users, covering 28 provinces and municipalities directly under the Central Government, 200+ prefecture-level cities, and is associated with 4,000+ community public health service agencies across the country, providing vaccine appointment & subscription services for 110 million+ times .

Yuemiao business is all based on Kubernetes for micro-service construction. It has been running smoothly on the Kubernetes platform for more than 4 years, and it has been updated in time following the iteration of Kubernetes versions. Yue Miao's cluster size exceeds 60 Node nodes. At present, the relevant service container POD has exceeded 1000+, and tens of thousands of temporary Cronjob-type PODs are created and destroyed every day. There are high requirements for the efficiency of platform operation and maintenance release.

  1. question

The time for Kubernetes to pull the image is very slow. Through observation when using the OCI image, the time to pull the image can reach 30s.

  1. Containers are slow to start

Through online observation, it takes 30s or more for a POD to be ready from creation to ready, even if the node does not have a cache, the time will be longer.

  1. update iteration block

In the update iteration, multiple services are updated in batches each time, the iteration cycle is short and frequent, and the mirror warehouse is under great pressure when updating multiple services.

With the emergence of the above problems, after various investigations and related tests, the company decided to use the open source project Nydus to optimize the current business.

PART.6
Nydus deployment practice

Nydus image acceleration can be directly connected to OCI images. At the same time, Containerd also supports Nydus plug-ins to identify Nydus images. Generally, in microservice scenarios, using CICD, we need to deploy Nydus conversion image services on Docker packaged images. After image conversion, it will directly Generate the image of Nydus in the Harboar warehouse. Here we use Jenkins for CICD. Here I will directly deploy the service on the physical machine of Jenkins.

  1. Download related components

Download link: https://github.com/dragonflyoss/image-service/releases

cd /nydus-static
sudo install -D -m 755 nydusd nydus-image nydusify nydusctl nydus-overlayfs /usr/bin
  1. OCI mirror conversion Nydus
nydusify convert --source dockerharboar/nginx:1.2 --target dockerharboar/nginx:1.2-nydus

Notice:

Source here indicates the image of the source Docker-Harboar repository, and this image must already exist in the private repository.
Target here means to convert the source warehouse image into a Nydus image.

After using this command, the mirror warehouse will generate two mirrors at the same directory level, one source OCI mirror and one Nydus mirror.

PART. 7
Nydus docking K8s cluster

The runtime used by the K8s cluster is Containerd, and Containerd also supports the use of the plug-in Nydus Snapshotter to identify the Nydus image. At the same time, when using the Nydus function, Nydus also supports the native OCI image, but it does not load related functions on demand.

  1. Deploy Nydus on K8s cluster nodes

Official description: https://github.com/dragonflyoss/image-service/blob/master/docs/containerd-env-setup.md

Note: To use the Nydus function, each Node node of K8s needs to deploy Nydus Snapshotter, except the K8s-Master node.

Download the installation package:

https://github.com/dragonflyoss/image-service/releases
https://github.com/containerd/nydus-snapshotter/releases

tar -xf nydus-snapshotter-v0.5.1-x86_64.tgz
tar -xf nydus-static-v2.1.4-linux-amd64.tgz

Install related software

sudo install -D -m 755  nydusd nydus-image nydusify nydusctl nydus-overlayfs /usr/bin
sudo install -D -m 755 containerd-nydus-grpc /usr/bin

Create the necessary directories

mkdir -p /etc/nydus  && mkdir -p /data/nydus/cache  && mkdir -p $HOME/.docker/

Create nydus configuration file

sudo tee /etc/nydus/nydusd-config.fusedev.json > /dev/null << EOF
{
  "device": {
    "backend": {
      "type": "registry",
      "config": {
        "scheme": "",
        "skip_verify": true,
        "timeout": 5,
        "connect_timeout": 5,
        "retry_limit": 4
      }
    },
    "cache": {
      "type": "blobcache",
      "config": {
        "work_dir": "/data/nydus/cache"
      }
    }
  },
  "mode": "direct",
  "digest_validate": false,
  "iostats_files": false,
  "enable_xattr": true,
  "fs_prefetch": {
    "enable": true,
    "threads_count": 4
  }
}
EOF


增加docker-harboar认证
sudo tee $HOME/.docker/config.json << EOF
{
  "auths": {
    "docker-harboarxxx": {
      "auth": "xxxxxx"
    }
  }
}
EOF
增加docker-harboar认证
sudo tee $HOME/.docker/config.json << EOF
{
  "auths": {
    "docker-harboarxxx": {
      "auth": "xxxxxx"
    }
  }
}
EOF
chmod 600 $HOME/.docker/config.json
docker-harboarxx  #私有仓库地址
auth 里是 base64 编码的 user:pass
  1. Start the Nydus service
cd /data/nydus
nohup /usr/bin/containerd-nydus-grpc --config-path /etc/nydus/nydusd-config.fusedev.json --log-to-stdout &
  1. Verify that Containerd supports Nydus
验证nydus是否支持
ctr -a /run/containerd/containerd.sock plugin ls | grep nydus
  1. Modify Containerd configuration to support Nydus
containerd配置文件新增
[proxy_plugins]
  [proxy_plugins.nydus]
    type = "snapshot"
    address = "/run/containerd-nydus/containerd-nydus-grpc.sock"
[plugins."io.containerd.grpc.v1.cri".containerd]
   snapshotter = "nydus"
   disable_snapshot_annotations = false
  1. Restart Containerd
sudo systemctl restart containerd

PART.8
Final Data Test Results

Use native OCI image

insert image description here

Use the Nydus mirror

insert image description here

POD from Create to Ready: OCI -> 20s
POD from Create to Ready: Nydus -> 13s

At present, the size of the business image is not large, about 200MB. Using Nydus has already improved the effect. In the scenario of using a very large image, such as AI computing, the acceleration effect brought by Nydus will be very obvious.

PART.9
Summary and Future Expectations

Nydus is an excellent open source project from CNCF. Furthermore, Yomiao will continue to invest more in the project and cooperate deeply with the community to make the Yomiao platform more powerful and sustainable. Cloud-native technology is a revolution in the field of infrastructure, especially in terms of elasticity and serverless, and we believe that Nydus will definitely play an important role in the cloud-native ecosystem.

Related Links

[1] 《Fast Distribution With Lazy Docker Containers》

https://www.usenix.org/conference/fast16/technical-sessions/presentation/harter
[2] Dragonfly
https://github.com/dragonflyoss/Dragonfly2

learn more…

Nydus Star ✨:
https://github.com/dragonflyoss/image-service

Guess you like

Origin blog.csdn.net/SOFAStack/article/details/129281516