Breaking Through Large Models | Alluxio Helps AI Large Model Training - Success Stories (1)

For more details, please refer to "Alluxio Helps AI Large Model Training Winning Guide"

[Case 1: Zhihu] Exploration of multi-cloud cache in Zhihu: from UnionStore to Alluxio

Author: Hu Mengyu - Zhihu Big Data Infrastructure Development Engineer (content reprinted from InfoQ)

1. Background

With the rapid development of cloud native technology, the cloud services provided by major public cloud vendors have become more and more standard, reliable and easy to use. With cloud-native technology, users can not only deploy their business on different clouds at low cost, but also enjoy the advantages and services of each cloud vendor in specific technical fields, so the multi-cloud architecture is very popular.
Zhihu currently adopts a multi-cloud architecture, mainly based on the following considerations:


  • Multi-active services:  Deploy the same service to different data centers to prevent a single data center from being unable to provide services normally due to force majeure, resulting in "one-pot end" of business;
  • Capacity expansion:  Generally speaking, when the company's server scale reaches 10,000 units, it is difficult for a single data center to support the subsequent expansion needs of the business;
  • Cost reduction and efficiency increase:  For the same service, different cloud vendors have different pricing and operation and maintenance capabilities for the same service. We hope to achieve an ideal state and enjoy as low as possible on the premise that cloud services meet our needs. s price.
    Zhihu currently has multiple data centers, and the main computer rooms are as follows:
  • Online computer room:  mainly to deploy direct-to-user services (such as comments, answers, etc.) on the Zhihu main site, which are sensitive to delay
  • Offline computer room: It is mainly to deploy some offline storage and computing related services, which are not sensitive to delay, but have high requirements on throughput.


The two data centers are connected by a dedicated line, and many important services rely on the dedicated line for cross-computer room calls, so it is very important to maintain the stability of the dedicated line. The leased line traffic is one of the important indicators to measure the stability of the leased line. If the leased line traffic reaches the rated bandwidth of the leased line, it will cause a large number of timeouts or failures in calls between services across dedicated lines.
Generally speaking, the throughput of the service is not particularly high, and it is far from the upper limit of the bandwidth of the dedicated line, even less than half of the bandwidth of the dedicated line. However, there are some special situations in our algorithm scenario: Algorithm model training is in the offline computer room, relying on massive datasets on HDFS, as well as Spark clusters and machine learning platforms for large-scale distributed training. The results of the trained model are stored on HDFS, and a model can even reach tens or hundreds of GB; When the model goes online, the algorithm service will read the model files on the offline HDFS from the online computer room across dedicated lines, and the algorithm service generally has dozens or even hundreds of containers. When these containers concurrently read files on HDFS, they can easily The dedicated line bandwidth is fully utilized, thereby affecting other cross-dedicated line services.

2. Multiple HDFS clusters

In the early days, we solved the problem of cross-computer room reading of algorithm models in a very simple and crude way. We deployed a new set of HDFS clusters to the online computer room for algorithm business use. The process of business use models is as follows: 1) Output model: the model is generated by the Spark cluster
or The training output of the machine learning platform is stored in the offline HDFS cluster;
2) Copying the model: After the model is produced, the offline scheduling task regularly copies the model that needs to be online to the online HDFS cluster;
3) Reading the model: the algorithm container is loaded from the online HDFS cluster Read the model online.

Although the multi-HDFS cluster architecture solves the problem of private line traffic, there are still some problems:

  1. Multiple HDFS clusters are not easy to maintain and increase the burden on operation and maintenance personnel;
  2. The copy script needs to be implemented by the business itself. Every time a new model is launched, the copy script must be modified synchronously, which is inconvenient to maintain;
  3. The files of the online HDFS cluster need to be manually deleted regularly by the business to reduce costs, and the operation risk is high;
  4. The file views between online HDFS and offline HDFS are inconsistent. When users use HDFS, they need to know which HDFS they are using, and they need to save multiple addresses, which has a high mental burden;
  5. In the case of ultra-high concurrent reading, for example, when the algorithm starts hundreds of containers at a time to read a certain model file, it will cause the DataNode load to be too high. Although it can be solved by adding copies, it will also bring higher storage costs. .

Based on the above pain points, we self-developed a multi-cloud caching service—UnionStore.

3. Self-developed component UnionStore

3.1 Introduction
UnionStore, as its name suggests, means joint storage. It provides a standard S3 protocol to access data on HDFS, and uses object storage as a cross-computer room cache. UnionStore currently has two usage scenarios in Zhihu:
model online scenario:  deployed to an online computer room, used as a cross-computer room cache:
when a user requests to read a file from UnionStore, it will first check whether the file has been uploaded to the object storage:

  • If the file already exists in the object storage, read the file directly from the object storage and return it to the user;
  • If the file does not exist in the object storage, UnionStore will first upload the file on the offline HDFS to the object storage in the online computer room, then read the file from the object storage, and return it to the user. During the cache period, the user's request is blocked. This is equivalent to using object storage to create a layer of cross-computer room caching.


Model training scenario:  Deployed in an offline computer room and used as an HDFS proxy, the purpose is to provide HDFS access methods of the S3 protocol for the business. Through s3fs-fuse, the business can mount HDFS to a local directory and read training data for model training.
The model training scenario is an extended scenario after our UnionStore goes online. We have tried many ways to mount POSIX on HDFS before, but the results are not ideal, mainly reflected in the retry aspect. UnionStore just provides the S3 protocol, and the s3fs-fuse heavy The trial was good, so we finally chose UnionStore + s3fs-fuse to mount the local directory on HDFS.
Its workflow is as follows:

Compared with the previous multi-HDFS cluster solution, the advantages of UnionStore are as follows:
1) UnionStore provides the S3 protocol, and each programming language supports the S3 protocol better than the HDFS protocol, and the tools are relatively richer;
2) UnionStore will automatically Cache files, no need for users to manually copy the model, eliminating the development and maintenance of copy scripts;
3) Provide a unified file view, because the metadata requests HDFS in real time, so the file view is strongly consistent with HDFS;
4) One offline In HDFS cluster, the file storage capability is provided by object storage, which saves a lot of server costs;
5) File expiration can rely on the capabilities provided by object storage itself, and there is no need to implement it yourself;
6) UnionStore provides services in a cloud-native manner and is deployed on k8s , each container is a stateless node, which can be easily scaled up and down. In a high-concurrency scenario, since the storage capacity is transferred to the object storage, when the performance of the object storage is sufficient, it will not encounter a similar DataNode load overload. high question.
3.2 Implementation Details
The complete architecture diagram of UnionStore is as follows:


When using object storage as a cache, UnionStore has three core components:
UnionStore Server:  a stateless node, each node can provide services independently, and generally multiple deployments are used to share traffic
Object Storage:  object storage, used to cache HDFS Generally, the data on the cloud provider uses the object storage provided by the corresponding cloud provider, and the traffic cost is almost negligible;
Task Manager:  task manager, used to store and cache tasks, can be implemented with MySQL and Redis.
Based on these three components, we implemented a series of useful functions on UnionStore.
File verification:  After the file is cached to the object storage, if the file on HDFS is modified, UnionStore needs to check the file change to ensure that the user will not read the wrong file. Here, when we upload the HDFS file to the object storage, we will store the HDFS file size, last modification time, checksum and other meta information on the UserMetadata of the object storage file. When the user reads the file, he will check this part of the information. Only when the information verification passes, the file on the object storage will be returned. If the verification fails, the file will be re-cached and the cache on the object storage will be updated.
Read and write acceleration:  The single-thread read and write speed of object storage is about 30-60MB/sec, which is much lower than the throughput of HDFS. Without special processing, it is difficult to meet the read and write needs of the business. In terms of reading, we use the RangeRead interface of the object storage to read the data on the object storage with multiple threads and return it to the user, achieving the same reading speed as HDFS. In terms of writing, we use the MultiPartUpload interface of object storage to upload files on HDFS with multiple threads, which can also achieve the same writing speed as HDFS.
The file is only cached once: Because UnionStore Server is designed as a stateless node, they cannot perceive each other. If there are multiple requests to different server nodes at the same time to request uncached files, the files may be cached by different servers multiple times, causing greater pressure on the leased line. We introduced the Task Manager component to solve this problem:


  1. When the server node receives a request to read an uncached file, it will first block the user's request asynchronously, generate a cache task, and submit it to the waiting queue of the Task Manager;
  2. All Server nodes will continue to compete for tasks in the waiting queue, and only one node will succeed in the competition. At this time, the node will put the cached task into the running queue and start executing it. During execution, it will report heartbeats to the task queue;
  3. Each Server node will periodically check its stuck user requests to check the corresponding tasks in the Task Manager. If the task is successfully executed, it will wake up the user request and return the cached file to the user; at the same time, each Server will periodically check For tasks that are running in the Task Manager, if the task has not updated the heartbeat for a long time, the task will be taken out of the running queue, put back into the waiting queue, and executed again.


All state change operations here occur on the Server node, and the Task Manager is only responsible for storing task information and providing atomic operations on the queue.


3.3 Limitations
The UnionStore project has been running in Zhihu for two years, and there were no problems in the early stage. However, with the continuous expansion of the algorithm business scale, the following problems appeared: 1
) There is no metadata cache, and the metadata is strongly dependent on HDFS. HDFS jitter Sometimes, some model files that need to be updated frequently will be affected and cannot be updated. Online services should not strongly rely on offline HDFS;
2) Reading and writing acceleration uses multi-threading technology, which consumes a lot of CPU. When it was large, UnionStore only needed a few hundred Cores to support the entire company's algorithm team to read data, but as the business volume continued to increase, the number of CPUs required also rose to thousands; 3) There is an upper limit to the object storage capacity, and single
file Thousands of concurrent reads will also face performance bottlenecks;
4) UnionStore only achieves caching, but not high-performance caching. The large model of the business side often needs to be read for more than ten minutes, which greatly affects the update speed of the model. Restricting business development;
5) It is impossible to return files while caching, resulting in too long time to read files for the first time.
In addition, there is another key point. In order to ensure more liveness, the machine learning platform also adopts a multi-cloud architecture and supports the deployment of multiple computer rooms. When reading training data, it uses the direct proxy of UnionStore to HDFS and does not use the caching process, because Most of the training data are small files, and the number is huge. Going through the cache for small files will cause the cache task to queue up in the task queue for a long time, and it is difficult to guarantee the timeliness of reading. Therefore, we directly proxy HDFS. According to this usage method, dedicated line bandwidth will still become a bottleneck when the scale of training data expands.

The above pain points make us face two choices: one is to continue to iterate UnionStore, so that UnionStore has high-performance caching capabilities, such as supporting local SSD and memory cache; the other is to find a suitable open source solution to perfectly replace the usage scenarios of UnionStore. Based on the preciousness of human resources, we chose the second option.

4. Use Alluxio to replace UnionStore

1. Research
We researched the mainstream file systems in the industry and found that Alluxio is more suitable for our scenario. The reasons are as follows:
1) Transparent cache: Compared with other file systems, Alluxio can only be used as a cache to arrange data. The business side does not need to write model files to other file systems, but only needs to maintain the status quo and write to HDFS;
2) Metadata and data cache: Alluxio supports custom cache metadata and data, so that when reading cached files It is completely unaffected by HDFS; at present, the QPS of our UnionStore is about 20K-30K, cache metadata can greatly reduce the pressure on NameNode, and feed back offline scenarios; 3) Rich UFS support: support a variety of UFS in addition to
HDFS , such as object storage, also provides strong support for our data lake scenarios;
4) Acceleration of ad hoc queries: Zhihu’s Adhoc engine uses Spark and Presto, and Alluxio has good support for these two engines;
5 ) Rich access interfaces: The S3 Proxy component provided by Alluxio is fully compatible with the S3 protocol. The cost of migrating our model online from UnionStore to Alluxio is almost negligible; in addition, the Alluxio fuse provided by Alluxio has local metadata cache and data cache. The S3 fuse previously used by the business has better performance, which just meets our model training scenario.
6) Active community: The Alluxio community is very active. During our research period, there are basically enthusiastic netizens who reply in a timely manner in the communication group. It is rare for an issue to not reply for more than half a day.
We were very pleasantly surprised by the research on Alluxio. It not only met our needs, but also gave us a lot of additional functions. We tested Alluxio internally. We used a 100G file to do a single-thread read test, and took the average value of multiple tests. The results are as follows

Among them, HDFS has the largest fluctuation because it involves the cache at the OS level, ranging from 200MB/sec to 500MB/sec, while UnionStore and Alluxio are very stable when hitting the cache.
2. Cluster planning
In our plan, one set of Alluxio will be deployed in each computer room, and high-performance NVME disks will be used to cache data on HDFS and object storage, providing businesses with massive data acceleration services.
According to business usage scenarios, we divide Alluxio clusters into two types.
Model online acceleration cluster: The Alluxio cluster caches the model itself, and uses S3 Proxy to provide read-only services to the outside world to accelerate model online
model training Acceleration cluster: The Alluxio cluster caches model training data, and uses Alluxio fuse to cache data and metadata on HDFS locally , to speed up model training; the output model is directly written to HDFS through Alluxio fuse for persistent storage.

3. Model online scenario adaptation
3.1 Scenario characteristics
Our model online scenario has the following characteristics:
1) The user uses the S3 protocol to read the model file;
2) After the user writes the model data to HDFS, it needs to be read immediately, and the data generated The interval between output and reading is at the second level, it is almost impossible to warm up in advance, and there is a problem of cache penetration;
3) A model file will be read by hundreds or even thousands of containers at the same time, the traffic amplification is obvious, and the largest single model read When fetching, the peak traffic can even reach 1Tb/sec;
4) The model file will only be used for a short period of time, and it can be considered expired after high-concurrency reading is completed;
5) Tens of thousands of containers are scattered on thousands of K8s nodes, and a single Containers have less resources available.
For the model launch scenario, we chose S3 Proxy to provide caching services for the business. The reason for not using Alluxio Client and Alluxio fuse is mainly based on the following considerations:

  1. Users originally used the S3 protocol to read files, and it costs almost nothing to switch to S3 Proxy;
  2. The languages ​​used by the business side include Python, Golang, and Java. Alluxio Client is implemented based on Java, and other languages ​​are more troublesome to use;
  3. Due to the resource limitation of a single container, it is not suitable to use CSI to start Alluxio fuse in the container, because the performance of fuse depends on the cache of disk and memory.

3.2 Cluster deployment
The first is the cluster deployment method. In this scenario, our Alluxio cluster adopts the "Large Cluster Light Client" method to deploy, that is, to provide a sufficient number of Workers and S3 Proxy to support business initiation through the S3 protocol High concurrent requests, the architecture diagram is as follows

Our cluster version is 2.9.2. In this version, S3 Proxy has two implementations of v1 and v2, which can be switched by configuring alluxio.proxy.s3.v2.version.enabled. The v2 version has a very important function, which is to classify IO operations and metadata operations and hand them over to different thread pools for processing. The advantage of this is that the metadata operation can be executed quickly without being stuck by the IO thread, because in general, the QPS of the metadata request is much larger than the QPS of reading and writing files. This function is very useful to us. The QPS of our UnionStore is around 25K, and 90% of the operations are metadata access.
We adopted bare metal machine deployment for the entire Alluxio cluster, and Alluxio also provides a k8s deployment method, but under our balance, we still chose bare metal machine deployment for the following reasons:

1) According to our test results, Alluxio Worker can easily fill up the dual 10G NICs under the condition of "full firepower". At this time, the NICs are the bottleneck; When connecting to the same k8s node, the container is vulnerable to the influence of Alluxio Worker and cannot seize enough network card resources;

2) Alluxio Worker relies on high-performance disks as local cache, and it is easily affected by disk IO of other processes when it is mixed with other services, so it cannot achieve the best performance;

3) Because Alluxio Worker strongly relies on physical resources such as network cards and disks, these resources are not suitable for sharing with other services. Forcibly deploying with k8s may mean that a k8s node starts a DaemonSet of Alluxio Worker. In fact, it is not necessary to deploy with k8s, because based on our past experience, storage in the container may encounter various strange problems. It is a waste of time to solve these problems and affect the normal online progress.

In addition to deploying the Master and Job Master, Worker and Job Worker on the same machine as recommended by the community documentation, we also mixed the S3 Proxy and Worker. Although S3 Proxy looks like a server to the user, it is still a client to the Alluxio cluster, and Alluxio has a very important optimization for the client: when the client
and the worker are on the same node, the short-circuit read function can be used , when the short-circuit read is enabled, the Client will no longer use the network request to call the RPC interface on the Worker to read data, but directly read the data on the local disk, which can greatly save network card resources. When accessing Alluxio through S3 Porxy, the traffic is mainly divided into the following parts:

  1. Files are not cached to Alluxio: Workers read data from UFS, as long as any Worker caches UFS files, this part of the traffic will not exist;
  2. The file is cached in the remote Worker: the local Worker reads the data from other Workers and caches it locally, and the S3 Proxy temporarily reads it from the remote Worker. After the local Worker is cached, this part of the traffic will not exist;
  3. Files are cached in the local Worker: the traffic read by the S3 Proxy from the local Worker, this part of the traffic will not exist after short-circuit reading is enabled;
  4. The traffic read by the business side from the S3 Proxy, this part of the traffic cannot be avoided.

Among them, the flow in 1 and 2 is much smaller than the flow in 3 and 4, and the short-circuit reading can save the flow of 3, saving about 30%-50% of the flow.

The second is the deployment scale of the cluster. In the scenario of model reading, although the total amount of reading can reach several petabytes per day, because the model file will expire soon, the capacity of Worker does not need to be large. The total number of Worker network cards The bandwidth can support the read traffic. The number of workers can be calculated according to peak traffic/(2/3* network card bandwidth), where the network card needs to reserve 1/3 of the buffer for workers to read UFS and workers to synchronize data with each other.
The last is the HA method of Alluxio Master. We chose Raft. During our test, in the case of hundreds of millions of metadata and hundreds of GB of heap, the Master master-slave switch is basically completed within 10 seconds, which is extremely efficient. , The business is almost insensitive.
3.3 Go-live and tuning
Our go-live process is also a process of our tuning.
At the beginning, we only switched the read request of a small model from UnionStore to Alluxio S3 Proxy, the effect is as follows:

Each line segment in it represents a read request for a model, and the length of the line segment represents the time it takes to read data.
The first stage is our internal UnionStore service, and the second stage is the state when we directly switch to S3 Proxy. It can be clearly seen that after switching to S3 Proxy, the average speed of model reading has increased, but there has been a sharp Sting, that is, occasionally there are requests to read very slowly. The problem is that when the model is read, it is always cold read, that is, the model data has not been preheated. If the file is not preheated, reading data from Alluxio can only achieve the same speed as HDFS at most, and cannot fully utilize the cache. ability. And through testing, we found that when Alluxio concurrently requests the same file that has not been preheated, the performance will drop very seriously, and it can't even reach the speed of directly reading HDFS. So we need to find a way to warm up the file.
There are generally two ways to warm up files:
1) After the user finishes writing the file, manually call the Alluxio load command to cache the data in advance to ensure that the required file has been cached when reading; 2) According to
HDFS audit log or use inotify of HDFS to subscribe to file changes, and load the cache into Alluxio as soon as any file changes are found in the algorithm directory.

The problem with method 1 is that it requires deep participation of the user, which has additional mental burden and development cost. Secondly, the user cannot control the call of the load command. If a super large directory is loaded, all caches will be invalidated.
Method 2 also requires the user to provide the monitoring path. If the path is a file, it is more convenient to listen to the close request. However, if the path is a directory, it involves temporary files, rename, etc., which is very complicated; each time the user adds a new model At any time, we need to add new paths to monitoring, which has additional communication costs; in addition, due to our scenario, the interval between data output and reading is at the second level, and the monitoring file change link is too long, and some delays may occur, resulting in Preheating scheme failed.

Based on the above shortcomings, we designed a set of caching strategies:
the essence of slow cold reading files is that when reading uncached files through Alluxio, which block is read will be cached, and the block is not cached concurrently. Therefore, we added a logic to the S3 Proxy. When reading a file, the file will be divided into blocks to generate cache block tasks, which are submitted to each Worker on average for asynchronous caching. The advantage of this is that after the client reads a few uncached blocks in front, all subsequent blocks have been cached, and the reading speed is very fast. In addition, because the block is cached in advance, the problem of cache penetration can also be alleviated, and the HDFS traffic can be reduced by more than 2 times.

This caching strategy needs to pay attention to the following points:

1) The cache block needs to be asynchronous, and all exceptions must be handled without affecting normal read requests;

2) When caching a block, it is best to bind the block id with the Worker id in a certain way (such as hash), so as to ensure that when concurrent requests are made to the same file, only the cache request for a certain block is hit On the same Worker, prevent different Workers from reading the same block from UFS and amplify UFS traffic;

3) S3 Proxy needs to count the submitted cache block tasks to avoid submitting too many tasks affecting the normal cache logic of Worker. It is best not to exceed half of the configured alluxio.worker.network.async.cache.manager.threads.max. This configuration The maximum number of threads that handle asynchronous cache requests on behalf of Worker, the default value is twice the number of CPUs;

4) S3 Proxy needs to deduplicate the blocks that have been submitted to the cache to prevent multiple submissions of cache requests for the same block to the Worker when the same file is read at high concurrency, filling up the asynchronous cache queue of the Worker. The size of Worker's asynchronous cache queue is controlled by configuring alluxio.worker.network.async.cache.manager.queue.max, and the default is 512. It is recommended to use bitmap to do the deduplication comparison according to the block id;

5) When the Worker asynchronous cache queue is not full, the number of threads in the asynchronous cache will always remain at 4. It is necessary to modify the code to increase the minimum number of threads in the Worker asynchronous cache to prevent low efficiency. Refer to #17179.

After launching this caching strategy, we have entered phase three, and we can see that all the spikes in phase three have disappeared, and the overall speed has improved slightly. Because we are caching small files (about 1GB), the improvement effect is not obvious. After our tests, this caching strategy can increase the speed of reading large files (10GB and above) by 3-5 times, and the larger the file, the more obvious it is.
After solving the cache problem, we continue to switch to read more models to S3 Proxy, the effect is as follows:

This time we have switched the read requests of three models to S3 Proxy. The orange model is the model we have switched to S3 Proxy before. This time the new model has a maximum of 10G, and the peak read traffic is 500Gb/sec .
This time we are also divided into three stages. Stage 1 is that the orange model has been switched to S3 Proxy, and other models use UnionStore. Because the orange model has a small amount of data and is accelerated by Alluxio, its reading speed can be compared to Other models read dozens of times faster.
The second stage is the state after we switch other models to S3 Proxy. You can see that the reading speed of other models has become significantly faster, but the reading speed of the orange model has been affected by other models but slowed down. This is very strange. The phenomenon. Finally, we located the reason why the metadata cache is not enabled. When the metadata cache is not enabled, Alluxio will send every request from the client to HDFS, and S3 Proxy will also frequently check some system directories. , which results in a very heavy burden on the Master to synchronize metadata, and the performance can even drop thousands of times.
In this scenario, we originally did not intend to enable metadata caching, mainly because we were worried that the business would modify the cached modified files, which would lead to wrong files being read, which would affect the launch of the model. But from the practical results, the metadata cache must be enabled to improve the performance of the Master.
After communicating with the business side, we formulated the metadata consistency specification:
1) The metadata cache is set to 1min;

2) Newly added files should be written into the new directory as much as possible, and managed by version number, do not modify or overwrite the old files;

3) For tasks left over from history that need to overwrite new files, and tasks that require high metadata consistency, we provide special commands on S3 Proxy to synchronize metadata. After the data is updated, the business party calls the command to synchronize metadata. data.

After enabling the metadata cache, we have come to the third stage in the figure, and we can clearly see that the reading speed of all model data has been greatly improved, compared with the initial reading speed without using S3 Proxy. + times. It should be noted here that 10+ times refers to the effect that can be achieved when the number of Alluxio machines is sufficient and the network cards are sufficient. In actual use, we used half of the resources of UnionStore to achieve the same effect as UnionStore.
3.4 S3 Proxy speed limit
The original intention of launching Alluxio in the model reading scene is to improve the speed of business side reading models, but because reading data through Alluxio is too fast, we need to limit its speed instead, which is very dramatic . Unlimited speed will face a very serious problem: when the algorithm container reads the model, if the file is large, it will not only affect the network card of the physical machine where the S3 Proxy is located, but also cause the network card of the k8s host where the container is located to take a long time is in a full state, which affects other containers on this node.

At present, there are mainly the following schemes for the implementation of speed limit:
Worker-side speed limit:  the advantage is that it takes effect on all clients, and the disadvantage is that it does not take effect on short-circuit reads by clients on the same node. In our scenario, S3 Proxy will short-circuit read and cannot Meet our needs.
Client speed limit:  The advantage is that it can take effect on both Alluxio fuse and S3 Proxy at the same time. The disadvantage is that the client can modify the configuration to bypass the limit. At the same time, there may be inconsistencies between the server version and the client version, resulting in the failure of the rate limit.
S3 Proxy rate limit:  only valid for S3 Proxy, not valid for other clients and Workers.
Because our current goal is to replace UnionStore, the only entrance for the business side to access Alluxio is S3 Proxy, so the client speed limit and S3 Proxy speed limit can meet our needs, but considering the difficulty of implementation, we finally chose Speed ​​limit from the S3 Proxy level.
We support two speed limit strategies. On the one hand, the global speed limit of the S3 Proxy process is used to protect the Worker network card from being fully loaded; on the other hand, the single connection speed limit is used to protect the k8s node where the business container is located. We have contributed the speed limit strategy to the community, if you are interested, please refer to: #16866.

4. Model training scenario adaptation
4.1 Scenario characteristics
Our model training scenarios have the following characteristics:
1) Since most open source model training frameworks support local directories best, we'd better provide POSIX access for business;

2) During model training, the main bottleneck is the GPU, and physical resources such as memory, disk, network card, and CPU are relatively sufficient;

3) GPU machines will not run tasks other than training tasks, and there will be no service mix;

4) Data is managed in the form of snapshots, and there is no consistency requirement for metadata, but means are needed to sense new snapshots generated on HDFS.

For model training scenarios, there is no doubt that we should choose Alluxio fuse to provide cache services:

1. Alluxio fuse provides POSIX access method;

2. Alluxio fuse can use memory and disk for metadata cache and data cache, and can maximize the use of idle physical resources on GPU machines.

4.2 Performance test
Before going online, we conducted a pressure test on the fuse using fio.
Alluxio fuse configuration:


The test results are as follows:

The above results are all for the case where the data has been cached to the fuse local disk. When reading 1G files and 10G files, the speed is twice that of 100G files. This is because the memory of the container is 40G, and there is sufficient pagecache to cache 1G and 10G files. file, but the 100G file does not have enough pagecache, so the performance will drop, but it can also achieve a good speed, and the overall behavior is in line with expectations.

4.3 Cluster Deployment
The deployment method of Alluxio fuse is to deploy with DaemonSet, map through host path, and not choose CSI deployment, mainly based on the following considerations: 1) The core of
Alluxio fuse's high performance lies in data cache and metadata cache, data cache It needs to consume a lot of disk, and metadata cache needs to consume a lot of memory. If it is deployed in the form of CSI, each container can only allocate a small amount of disk and memory to the Alluxio fuse process;

2) When the model is being trained, the training data read is highly repetitive. If each container starts a fuse process, it may cause the same machine to cache multiple copies of the same file, wasting disk;

3) The GPU machine only runs training tasks, so the fuse process can run long without considering the issue of resource release;

4) The deployment method of host path can easily restore the mount point.

Here is an explanation for the recovery of the mount point. Generally, if the Alluxio fuse container hangs due to various abnormalities, even if the fuse process is restarted and the directory is remounted, the mount point in the business container is also broken. If the data is lost, the business cannot read the data; but if the mount point is restored, after the Alluxio fuse container is started, the mount point in the business container will be automatically restored. At this time, if the business itself has retry logic, it can not Affected. The mount point recovery of the Alluxio fuse process includes two parts, one is the recovery of the mount point itself, that is, the fuse process must be mounted to the same mount point after each restart; the other is the recovery of the client cache data, which is also That is, after each restart of the fuse process, the cached data directory should be consistent with the original one, so as to avoid repeatedly pulling files that have been cached locally from the Alluxio cluster. Mount point recovery requires some additional development in CSI to support it, but if it is mapped in the form of host path, as long as HostToContainer is configured in the business container, no additional development is required.
The deployment architecture diagram of our fuse process is as follows:

In this scenario, our Alluxio cluster is deployed in a "small cluster heavy client" approach, that is, a small Alluxio cluster is provided for data distribution only, and the performance and cache are guaranteed by Alluxio fuse itself. An Alluxio cluster only needs to provide a high-configuration Master and a small number of Workers. The overall deployment architecture of the cluster is as follows:

According to this deployment mode, three Raft HA Masters and a small number of Workers can support large-scale deployment of fuse processes.

4.4 Alluxio fuse tuning
The first is metadata caching. Alluxio fuse can enable metadata caching. This is easy to confuse with Master’s UFS metadata caching. Let’s briefly explain:

1) Alluxio Master will cache UFS metadata, and whether to update metadata is determined by the alluxio.user.file.metadata.sync.interval configured on the client. If this value is set to 10 minutes, when the client requests the Master, if the Master has updated the metadata within the previous 10 minutes, the Master will directly return the cached metadata without requesting UFS to get the latest metadata; otherwise, it will It will return the latest metadata of UFS and update the metadata of Master

2) When users use Alluxio fuse to access Alluxio, they will first check whether the kernel cache metadata is invalid (configured as fuse startup parameters attr_timeout, entry_timeout), and then check whether the user space metadata cache is invalid (configured as alluxio.user.metadata.cache .expiration.time), and then check whether the Master cache is invalid (configured as alluxio.user.file.metadata.sync.interval), as long as one layer is not invalid, the latest metadata of HDFS cannot be obtained.

Therefore, it is recommended to set alluxio.user.file.metadata.sync.interval=0 after enabling the fuse metadata cache, so that the fuse can get the latest metadata of UFS every time the local metadata cache fails.
In addition, the metadata cache of fuse can be updated through some special commands (need to configure alluxio.fuse.special.command.enabled=true):
The metadata cache can be forced to refresh through the following command, assuming our mount directory is /mnt/ alluxio, use the following command to refresh all metadata caches:

 ls -l /mnt/alluxio/.alluxiocli.metadatacache.dropAll

Use the following command to refresh the metadata cache of the specified directory (here /user/test is taken as an example):

ls -l /mnt/alluxio/user/test/.alluxiocli.metadatacache.drop

In code (using python as an example), metadata can be cleaned up like this:

import os
 print(os.path.getsize("/mnt/alluxio/user/test/.alluxiocli.metadatacache.drop"))

However, it should be noted that the kernel metadata cache cannot be cleared, so it is recommended to set a smaller value for the kernel metadata cache, such as one minute, and a larger value for the user space metadata cache, such as one hour. When the metadata has consistency requirements, manually refresh the user space metadata cache and wait for the kernel metadata cache to become invalid.
When the metadata cache and data cache are enabled at the same time, there will be some problems in the use of the command to clear the metadata cache. We have fixed it, reference: #17029.
The second is data caching. Because our Alluxio fuse is deployed in DeamonSet mode, we can basically use the entire disk of the physical machine for data caching, which greatly reduces the traffic of Alluxio Workers.
The last is resource allocation. Because each machine only has one fuse process, more CPU and memory can be properly allocated to the fuse process, and the CPU can be properly oversold to handle the sudden surge of requests.
In terms of memory, the first is the configuration of the heap memory. If the user space metadata cache is enabled, set Xmx according to the number of cache paths * 2KB * 2. In addition, the DirectoryMemory can be set larger, generally 8G is enough. If the kernel data cache is enabled, you also need to reserve some space for the container to store the pagecache, because the kubernetes calculation of the container memory usage will include the usage of the pagecache. Regarding whether pagecache will cause container OOM, we searched many documents but failed to get an accurate conclusion, but we conducted a pressure test with the following configuration and found that the container will not OOM, and the performance of fuse is very stable:

4.5 Launch results
After switching our algorithm model training to Alluxio fuse, the efficiency of model training has reached 90% of the performance of the local disk. Compared with the original s3fs-fuse mount of UnionStore, the performance has improved by about 250%.

5. Application of S3 Proxy in big data scenarios

Looking back at the model launch scenario, we not only provided the algorithm business with the ability to accelerate model reading, but also precipitated a component that is compatible with the object storage protocol, but the download speed is far faster than that of ordinary object storage, that is Alluxio S3 Proxy, so we now You can do something like "find the nail with a hammer".
Here is an introduction to the release and online process of our big data components. The flow chart is roughly as follows:

The following is a brief description in words:

1) After the developer modifies the code, merge the code into the master branch of the corresponding component. At this time, Gitlab will call the Web Hook of CI, and CI will run the packaging and compilation logic of the corresponding component;

2) After the components are packaged into a binary package, CI will register the metadata of the binary package with Kosmos and upload the binary package to Kosmos. After Kosmos receives the binary package, it will upload it to the object storage;

3) The developer selects the components to be launched on the big data operation and maintenance platform, as well as the version of the components, and the big data components will automatically run the deployment logic on the server in the production environment;

4) During the running of the deployment logic, Kosmos will be requested to download the binary package of the component, and Kosmos will directly return the read-only link of the object storage for the production environment server to download.

Among them, Kosmos is our self-developed package management system. The background of its birth can be referred to: the evolution of the Flink real-time computing platform in Zhihu; in addition, our big data operation and maintenance platform also has a corresponding column. If you are interested, you can check: Ansible in Zhihu The practice of big data.
On the one hand, the biggest problem with this process is that the download speed of the binary package from the object storage is too slow when launching nodes on a large scale. For example, when we want to make changes to all DataNode nodes and NodeManager nodes, each machine needs to download hundreds of MB or even more than GB of binary packages. According to the download speed of 20-30MB/sec of object storage, each machine needs to spend about 30 Seconds to download, accounting for about 2/3 of the entire deployment logic. If calculated on the basis of 10,000 DataNodes, every two rolling restarts (to ensure that one of the three copies is available), the time spent on downloading the binary package will reach 40+ hours, and it will affect the deployment efficiency.
On the other hand, when object storage is used in different computer rooms, it will also face the problem of external network traffic, resulting in relatively high costs; therefore, Kosmos has been transformed into multiple computer rooms to support uploading binary packages to different object storage. When requesting Kosmos, you need to add computer room parameters to the request, so as to obtain the download link of the same computer room object storage from Kosmos. If the user chooses the wrong computer room, the external network traffic will still be used.
The above problems can actually be solved by transforming the big data operation and maintenance platform, such as decoupling the download and deployment logic, downloading the binary package on the node with a high concurrency and then rolling the deployment, but the transformation is time-consuming and laborious, not to mention our Now there is a more efficient way to download files — Alluxio S3 Proxy, so there is even less incentive to do this transformation.
We mount the object storage of Kosmos on Alluxio. When Kosmos is requested to download, it returns the read-only link of Alluxio S3 Proxy, allowing users to read data from S3 Proxy. The modified flow chart is as follows:

After our transformation, almost all download requests of Kosmos can be completed within 1-2 seconds, which is more than 90% faster than downloading from object storage. The following figure shows that in our production environment, Kosmos connects to object storage and Alluxio's download speed comparison, where Alluxio S3 Proxy is limited to 600MB/sec by us:

In addition, Alluxio has also deployed multiple computer rooms, supporting Kosmos’ multi-computer room solution. Even if the user chooses the wrong computer room, it will not cause additional external network traffic. It will only request the Alluxio cluster in other computer rooms, consuming a certain amount of dedicated lines. bandwidth.

6. Permission related

When Alluxio interfaces with HDFS, it will inherit the file permission system of HDFS, and the users of HDFS and Alluxio may be inconsistent, which may easily cause permission problems. The issue of permissions is more important, so we use a separate chapter for introduction.
We have summarized the mapping relationship between users and permissions based on Alluxio version 2.9.2 (both HDFS and Alluxio use SIMPLE authentication methods) through code research and testing. The overview is as follows:

The first is the user of the Alluxio Java Client: When the Alluxio Java Client interacts with Alluxio, if alluxio.security.login.username is configured, the Alluxio client will access the Alluxio cluster as the configured user, otherwise it will use the startup user of the Alluxio Java Client Visit Alluxio.
When Alluxio Master/Worker interacts with HDFS, if the Master/Worker configures the environment variable HADOOP_USER_NAME (configurable in alluxio-env.sh) at startup, the Master/Worker will access HDFS as the configured user, otherwise it will use The Master/Worker process starts the user to access HDFS. It should be noted here that the Master and Worker should be configured with the same HDFS user as much as possible, otherwise permission problems will definitely occur.
When writing a file to HDFS, Alluxio will first write the file as the HDFS user configured by the Master/Worker, and then call the chown command of HDFS to change the owner of the file to the user of the Alluxio Java Client. Here is an example: Assume that the Alluxio startup user is alluxio, and the Alluxio Java Client user is test. When writing files to HDFS, Alluxio will first write the files to HDFS with the alluxio account, and then chown the files into the test user. At this time, if the alluxio user is not For HDFS superusers, an error will occur when chowning (the more pitiful point is that alluxio will not throw this error to the client), resulting in the owner of the file seen on Alluxio being test, but the owner of the file on HDFS is alluxio, resulting in The data is inconsistent.
The second is the user of S3 Proxy. S3 Proxy is also a special Alluxio Java Client, but at the same time it is also a server side. Here, the user requests the mapping between the AK SK of S3 Proxy and the HDFS user. By default, S3 Proxy maps the user's AK to the user accessing the Alluxio cluster. You can also implement the mapping relationship yourself, such as mapping the AK to a specific user. There are related plug-ins in S3 Proxy.
Finally, users of Alluxio fuse. Because Alluxio fuse involves the linux file system and has many implementations related to the linux local file system, it is more complicated than the previous one. Here we only discuss the default situation, which is alluxio.fuse.auth The case when .policy.class=alluxio.fuse.auth.LaunchUserGroupAuthPolicy. When the user accesses the mount directory, the current linux user is used. The user sees that the owner of all the files in the mount directory is the user who started the fuse process; when fuse writes the local cache directory, it uses the user who started the fuse process. In addition, when the fuse process interacts with the Alluxio cluster, it completely follows the logic of the Alluxio Java Client.
To sum up, the recommended user setting method is:

1) The Alluxio cluster is started with the alluxio account, and the alluxio account is set as the HDFS super user;

2) S3 Proxy starts with the alluxio account, and when the user accesses, AK is the HDFS account;

3) Alluxio fuse starts as root user to prevent writing local data without permission, and add allow_other parameter, configure alluxio.security.login.username as HDFS user.

7. Other issues

During the launch process, we encountered many problems, most of which were related to configuration item tuning. The main reason for these problems is that Alluxio is a cache system with a general design, and the user scenarios are various, so it is difficult to perfectly adapt through the default configuration. For example, we have multiple sets of Alluxio clusters, and each cluster is used to solve the problem. Different problems, so the configuration of these clusters is slightly different. Thanks to the many flexible configurations provided by Alluxio, most of the problems can be solved by modifying the configuration, so here we only introduce some "representatives" that impress us.
Maximum number of copies: In the model launch scenario, we do not set an upper limit on the number of cached copies, because when the algorithm model is read, it is often a large model that is read by dozens or even hundreds of containers at the same time, which does not occupy much storage, but The number of reads is large, and only this time is read with high concurrency, and it is rare to read the second time. Therefore, there is no limit to the number of copies of each cache file here, and each Worker can cache one copy, so as to achieve the maximum throughput and the best performance. In the model training scenario, we set the number of cache copies to 3. On the one hand, the amount of training data is large and storage needs to be saved. On the other hand, the local cache of Alluxio fuse will bear most of the traffic, so the throughput requirements for Workers are relatively high. lower.
S3 Proxy ListObjects issue: We found that S3 Proxy ignores the maxkeys parameter when fulfilling the ListObjects request, listing a large number of unnecessary directories. For example, if the prefix we request is /tmp/b, and maxkeys is 1, S3 Proxy will recursively list all files under /tmp, and then select the first piece of data that satisfies the prefix /tmp/b from all files, which not only has poor performance , will also lead to the possibility of OOM. We use a temporary solution to fix it. If you are interested, you can refer to #16926. This problem is more complicated and needs to be solved jointly by Master and S3 Proxy. We can look forward to the progress of #16132.
Monitor address conflicts: We use the Prometheus solution for monitoring. Alluxio exposes some indicators, but the JVM indicators need to be exposed by adding an agent and port to the startup parameters of the Master or Worker. After adding the agent, because the monitor will inherit the startup parameters of the Master and Worker, so The monitor will also try to use the same indicator port as the Master and Worker, which will cause an "Address already in use" error, which will cause the monitor to fail to start. See #16657 for details.
Master abnormally loads UFS full metadata: If there is a UFS mount path under a path, when the getStatus method is called on this path, the Alluxio master will recursively synchronize the metadata of all files under this path. For example, the /a/b path under the /a path is the UFS mount path. When calling getStatus("/a"), the metadata under /a will be fully loaded. If /a is a large path, it may cause frequent GC or even stuck Master due to loading too much metadata. See #16922 for details.
Master frequently updates access time: During our use, we found that the Master occasionally gets stuck. With the help of fellow students in the Alluxio community, we located the problem because of the last access time of the frequently updated files of the Master. We solved this problem by incorporating #16981 .

8. Summary and Outlook

In fact, we started researching Alluxio from the second half of 2022, but due to various reasons, it was put on hold for a period of time, causing Alluxio to be delayed until this year. In the process of our research and launch, the Alluxio community is our most powerful foreign aid, providing us with a lot of help.
This time, we tested Alluxio in the algorithm scene, and the results were very surprising.
In terms of performance, in the scene where the algorithm model is launched, we can achieve a performance improvement of up to dozens of times after replacing the UnionStore with Alluxio; in the model training scene, we cooperate with the local data cache of Alluxio fuse to achieve a similar local NVME disk Compared with the solution of UnionStore + s3fs-fuse, the performance is improved by 2-3 times.
In terms of stability, when HDFS fluctuates or upgrades and cuts masters, Alluxio can provide services normally without being affected for a certain period of time because of the data cache and metadata cache.
In terms of cost, Alluxio saves us hundreds of thousands of real money every year compared to UnionStore, and there is still a surplus in performance.
From the perspective of long-term development, Alluxio has strong scalability, especially Alluxio's new-generation architecture Dora, which can support our demand for massive small file caching, which makes us more confident in supporting the algorithm team, facing the upcoming artificial intelligence wave.
Finally, I would like to thank the Alluxio team again for providing us with a lot of help and suggestions during our launch. I also hope that we can continue to cooperate and communicate in depth in the field of big data OLAP query acceleration scenarios and distributed dataset orchestration.

[Case 2: Ants] Application of Alluxio in the large-scale training of Ant Group

1. Background introduction

The first is why we introduced Alluxio. In fact, the problems we face are basically the same as those in the industry:

  • The first is the performance problem of storage IO. At present, the model training speed of GPU is getting faster and faster, which will inevitably put a certain pressure on the underlying storage. If the underlying storage cannot support the current training speed of GPU, it will seriously restrict the efficiency of model training. .
  • The second is the storage capacity of a single machine. At present, our model collection is getting larger and larger, which will inevitably cause the problem that a single machine cannot be stored. So how do we support this kind of large model training?
  • The third is the network delay problem. At present, we have many storage solutions, but none of them can combine a high throughput, high concurrency and low latency performance together. Alluxio provides us with a set of solutions. Compared with Alluxio It is miniaturized, ready to use, and can be deployed in the same computer room as the computer, which can minimize network delay and performance loss. It is mainly for this reason that we decided to introduce Alluxio into Ant Group.

The following is the core content of the sharing: it is divided into 3 parts in total, that is, after Alluxio was introduced to Ant Group, we mainly carried out performance optimization from the following three aspects: the first part is stability construction, the second part is performance optimization, and the third part is Part of it is scale up.

2. Stability Construction

First, we will introduce why we need to build stability. If our resources are scheduled by k8s, and we frequently restart or migrate resources, then we need to face frequent FO clusters, and the performance of FO will be directly reflected in the performance of users. In terms of experience, if our FO time is unavailable for two minutes, users may see a large number of error reports. If it is unavailable for several hours, the user's model training may be killed directly, so stability building is the most important Most importantly, the optimization we do is mainly carried out from two parts: one is worker register follower, and the other is master migration.
1.  Worker Register Follower

First introduce the background of this problem: the above picture shows the stable state of our Alluxio operation. The master performs metadata services, and then internally synchronizes metadata consistency through raft, and provides metadata services externally through the primary, and then through workers Nodes provide data services to the outside world. Between the two, a discovery is made through the worker registration primary, that is, the discovery of the worker node, so that it can be guaranteed to run in a stable state. If the primary is restarted at this time, a FO migration is required, which is the next process. For example, if the primary is restarted at this time, the internal standby needs to be re-elected through raft. Before the election, In fact, the metadata of the primary is disconnected from the worker. In the disconnected state, a consistent election of raft is required to perform a failover. Next, if the machine elects a new primary, the work needs to be done at this time. Re-discovery, after the discovery is registered in the primary, then the new primary will provide metadata services to the outside world, and the worker will provide data data services to the outside world, thus completing a fault transfer, then the problem occurs in the fault When doing FO, the worker needs to re-register after discovering the new primary. This part mainly faces three problems: the first
is that the cluster is unavailable before the first worker registers , because the first worker has just recovered The new primary leadership capability, if there is no worker at this time, in fact, the entire primary does not have data nodes, that is, it can only access metadata but not data.
The second is the impact of cold data on performance during the registration process of all workers.If the first worker registers in, it can provide services to the outside world at this time, because there are data nodes, and in the process of successive registration, if the first node registers in, and then the subsequent nodes are in the process of registration, the user visits When worker2 caches the block, worker2 is in a miss state. At this time, the data data is lost. It will be elected from the existing workers to the bottom layer to read the file. After reading the file, it will provide external services again, but the read During the process, for example, when worker1 reads in ufs, this involves a warm-up process, which will slow down the performance. This is the problem in registration.
The third is the data redundancy cleaning problem after worker registration is completed. After the registration is completed, there is actually a problem that a small amount of data is continuously reheated during the registration process. After all workers are registered, the re-cached data during the registration process will cause redundancy, so it needs to be processed afterwards. Clean up. According to this severity level, the cluster is unavailable before the first worker is registered. If the size of the worker is relatively small, the registration time may be 2-5 minutes, and the cluster may be unavailable within 2-5 minutes. A large number of errors are reported. If the size of the worker is relatively large, for example, a disk with a volume of several terabytes, it will take several hours to fully register. Then the entire cluster cannot provide external services for a few hours, so the cluster is unstable from the user's point of view, so this part must be optimized.
Our current optimization plan is: register all workers with all masters, register in advance, and re-register with all masters as soon as the workers wake up, and then keep the worker status updated through this real-time heartbeat in the middle. So what is the effect of this optimization? You can see the picture below:

At this time, if the primary is restarted, the internal election is carried out through raft, and the new primary elected provides services to the outside world. The primary election needs to go through several parts: the first part is that after the primary is restarted, raft performs self-discovery. After the election, the new primary can provide services to the outside world after being caught up, and there is no need to re-acquire workers for a register, so this can completely save time and only need three steps : Self-discovery, election, catch up.
This solution is very efficient and can be completed within 30 seconds, which greatly shortens the FO time. On another level, there are also some negative impacts. The main reason is that if one of the masters is restarted, the primary can provide normal services to the outside world. Then, if the standby is restarted, while providing services to the outside world, the worker It is also necessary to re-register the metadata information of this block. In fact, the traffic of this block metadata information is very large. At this time, it will have a certain impact on the current worker, and it will also affect the performance of some registered masters. If this time the cluster If the load is not very heavy, it can be completely ignored, so this optimization is made.
2.  Migration of Master

As shown in the figure, in fact, the three masters provide services to the outside world at the beginning. These three masters reach a stable state, and then the workers register to the primary to provide services to the outside world. Replace, then standby4 replaces standby2, and then the new primary replaces the old primary. At this time, the new master cluster node is composed of these three: standby3, standby4, and the new primary. Follow the normal process Said that this worker needs to establish a connection with the current new cluster, maintain a normal heartbeat, and then provide external services, but at this time it does not. The main reason is that the master information recognized by the worker is actually statically performed by the configer at the beginning. Injected, it has been written in during initialization, and the background is statically managed without dynamic updates, so these three nodes will never be recognized, and the three old nodes will always be recognized, which is equivalent to this scenario Hang up the entire cluster directly. If there are no external data nodes, services cannot be provided. The recovery method mainly needs to manually register the three new nodes in the configer, restart the worker again, and then identify it. If the cluster at this time If the scale is relatively large and the number of worker nodes is relatively large, then the operation and maintenance cost will be very high at this time. This is the master migration problem we are facing. Next, let’s look at how to deal with this stability:
Our solution is to maintain a main heartbeat between the primary and worker. If the master node changes, it will synchronize the current worker through the main heartbeat to achieve real-time update of the master node. For example, standby3 replaces standby1. At this time, the primary will Synchronize the current three nodes: primary, standby2, and standby3 to the current worker through the main heartbeat. At this time, the worker is the latest. If you replace standby4 and standby2, the state between these three will be changed at this time. Synchronize it to keep it up to date. If you add a new primary next, you will synchronize the four of them. After restarting, you will be elected. After the election, this will be the new primary. Due to the last step of the worker node These four nodes are stored, and it is convenient to find the current leader among these four nodes, and then you can identify the new primary, and then synchronize the three new masters to achieve a safe iterative process. In this case When it is moved by resource scheduling, it can be moved stably. The above two parts are the content of stability construction.

3. Performance optimization

For performance optimization, we mainly carried out the process of follower read only. First, let me introduce the background, as shown in the figure:

This is the current overall framework of Alluxio. First, the client gets the metadata from the leader, and accesses the normal workers according to the metadata. The leader and the standby synchronize with the metadata consistency through raft, and the leader synchronizes the metadata. It can only be initiated by the leader and then synchronized to the standby, so there is a sequence. The standby cannot synchronize to the leader by initiating new information, which is a problem that violates the principle of data consistency.
The other part is that after the current standby is optimized by the previous worker register follower, there is actually a certain connection between the standby and the worker, and the data will be collected, so that the standby already has the attribute of a leader in terms of data integrity. , that is, the data is basically consistent with the leader.
And if this part is used as a backup, that is, as a kind of stability backup, it is actually a waste of resources. We want to use it but cannot break the rules of raft data consistency. In this case, we will try whether Read-only services can be provided, because read-only services do not need to update the journal entry of raft, and have no impact on consistency, so that the performance of standby can be fully utilized, so some optimized solutions are thought of here, and it also involves A business scenario, that is, if our scenario is suitable for model training or file cache acceleration, the data will only be written during the first warm-up, and the subsequent data will be read-only. For a large number of read-only scenarios, the standby pair is used The performance win for the entire cluster is very impressive.

The following is a detailed optimization scheme, as shown in the figure:

It is mainly based on the previous summary. All workers register with all standbys. At this time, the data of standby is basically the same as that of primary. The other part is the main heartbeat maintained between primary and worker. At this time, if the client When the client initiates a read-only request, it will be randomly hashed to all current masters for processing. After the processing is completed, it will return to the client, and the write request will still be issued to the primary. Then, on the premise of not breaking the consistency of Raft, the read-only performance can be improved. When this machine is expanded, according to normal reasoning, the read-only performance can reach more than three times the expansion. The actual test results of follower read are also the same. More obvious. This is the performance optimization after we introduced Alluxio.

4. Scale up

Scale improvement is mainly horizontal expansion. First, let’s look at the background of this problem: as shown in the figure:

It is still the framework of Alluxio. The master mainly contains many component elements. The first is the block master, the second is the file master, and there are raft and snapshot. The main influencing factors of this part are these four aspects:

  • Bblock master, if we create a large-scale cluster, the bottleneck faced by the block master is memory, which will occupy a large amount of master memory, mainly the stored worker block information;
  • File master mainly saves inode information. If it is a large-scale scenario, the pressure on local storage is very large
  • The synchronization efficiency problem faced by Raft;
  • The efficiency of the snapshot, if the efficiency of the snapshot cannot keep up, you can find that there will be a lot of journal entries in the background, which will also have a certain impact on performance improvement;

After doing some tests, in a large-scale scenario, if the machine specifications are not very large, it can support a scale of 300-600 million. If you want to support a scale of 1 billion or even tens of billions, all you need is to expand the storage machine The specification is unrealistic, because the scale of model training can grow infinitely, but the specification of the machine cannot be expanded infinitely, so how do we optimize this problem?

For this optimization, we mainly refer to the implementation scheme of Redis, that is, the metadata can be fragmented at the bottom layer, and then multiple clusters can provide external services. One advantage of this is that it can provide a whole externally, of course, different methods can also be adopted. Optimization strategies, such as multiple clusters are completely controlled by users themselves, and different data is allocated to each cluster, but this will put more pressure on users. Let’s introduce this framework first. First, we segment the metadata. For example, the overall data set obtained by the user is too large to fit in a single cluster. At this time, the large-scale data set will be segmented. Carry out some hash (Hash) mapping of metadata, and map a certain hash value to one of the shards, so that the small cluster of the cluster only needs to cache the files corresponding to the corresponding part of the key, so that there can be targets on the cluster Sexually choose.
Then other data will be reserved for other clusters, and the full amount of hash will be allocated to a set cluster size, so that the entire number of large model training files can be cached through several shards, and large-scale data can be provided externally. Model training, and then we add a proxy to the front end. The proxy actually maintains a hash mapping table internally. The request from the user is actually searched by the hash mapping through the proxy, and then allocated to a fixed cluster for processing. For example, by calculating its hash mapping for a file request, it can be determined that the hash mapping is routed to cluster1. In this way, cluster1 can be responsible, and the mapping of other keys is allocated to other clusters to break up the data. There are many benefits in this way. aspect:

  • The first is that the metadata carrying capacity has become larger;
  • The second is to distribute the pressure of the request to multiple clusters, and the overall qps capability and the throughput of the cluster will be improved accordingly;
  • The third is that through this scheme, many clusters can be expanded theoretically. If a single cluster supports a scale of 300-600 million, then the scale supported by three clusters is 900-1800 million. If the expansion is more , can also provide a supporting solution for the scale of tens of billions.

The above are some optimizations we made to the model. The whole framework includes stability construction, performance optimization and scale improvement.

  • In terms of stable construction: we can control the FO time of the entire cluster within 30 seconds. If we cooperate with some other mechanisms, such as some metadata caching mechanisms on the client side, we can achieve FO without user awareness , this effect is actually what users want most. Without their perception, everything done at the bottom layer can be restored, their business training will not be interrupted, and they will not feel any mistakes. Therefore, this The method is more friendly to users.
  • In terms of performance optimization: The throughput of a single cluster has been improved by more than three times, and the overall performance will also be improved, which can support more concurrent model training tasks.
  • In terms of model scale improvement: the model training set is getting bigger and bigger, and this kind of model training can be introduced to provide external support.

After Alluxio introduced ants to adapt these optimizations, the support effect for all directions of business is relatively obvious after running. In addition, we currently have a lot of cooperation with the open source community, and the community also provides us with a lot of help. For example, it can provide us with some solutions and help on some urgent issues. We would like to express our gratitude here!

[Case 3: Microsoft] Cache optimization practice for large-scale deep learning training

Sharing Guest: Zhang Qianxi-Microsoft Senior R&D Engineer

guide

In recent years, with the rise of deep learning, Alluxio distributed cache technology has gradually become the mainstream solution in the industry to solve IO performance problems on the cloud. Not only that, Alluxio also naturally has the unified management and access capabilities required by data lakes. This article will share cache optimization for large-scale deep learning training, mainly analyze the storage status and challenges of large-scale deep learning training today, explain the application of cache data arrangement in deep learning training, and introduce resource allocation and scheduling of large-scale cache systems .

1. Project background and caching strategy

First, let me share the relevant background.

In recent years, AI training applications have become more and more widespread. From the perspective of infrastructure, whether it is big data or AI training clusters, most of them use an architecture that separates storage and computing. For example, many GPU arrays are placed in a large computing cluster, and the other cluster is storage. It may also be some cloud storage used, such as Microsoft's Azure or Amazon's S3. The characteristics of such an infrastructure are, first of all, there are many very expensive GPUs in the computing cluster, and each GPU often has a certain amount of local storage, such as tens of terabytes of storage such as SSD. In such an array of machines, high-speed networks are often used to connect remote ends. For example, very large-scale training data such as Coco, image net, and YouTube 8M are connected by network.

As shown in the figure above, data may become the bottleneck of the next AI training. We observe that datasets are getting larger and more training data is being accumulated as AI applications become more widespread. At the same time the GPU track is very voluptuous. For example, manufacturers such as AMD and TPU have spent a lot of energy optimizing hardware and software, making accelerators, such as GPUs and TPUs, faster and faster. With the wide application of accelerators in the company, the cluster deployment is getting bigger and bigger. The two tables here present some variations on datasets and GPU speeds. From the previous K80 to V100, P100, and A100, the speed is very fast. But as they get faster, GPUs get more expensive. Our data, such as whether the IO speed can keep up with the speed of the GPU, is a big challenge.

As shown in the figure above, in the applications of many large companies, we have observed such a phenomenon: when reading remote data, the GPU is idle. Because the GPU is waiting for remote data reads, this means that IO becomes a bottleneck, causing expensive GPUs to be wasted. There is a lot of work being done to optimize to alleviate this bottleneck, and caching is one of the most important optimization directions. Here are two ways.

First, in many application scenarios, especially in the basic AI training architecture such as K8s plus Docker, a lot of local disks are used. As mentioned in the previous article, the GPU machine has a certain amount of local storage. You can use the local disk to do some caching and cache the data first. After starting a GPU's Docker, instead of starting the GPU's AI training immediately, it is to download the data first, download the data from the remote end to the inside of Docker, or mount it. Start training after downloading it inside Docker. In this way, as much as possible, the subsequent training data reads can be converted into local data reads. The performance of local IO is currently sufficient to support GPU training. On VLDB 2020, there is a paper, CoorDL, which is based on DALI for data caching. This approach also brought many problems. First of all, the local space is limited, which means that the cached data is also limited. When the data set becomes larger and larger, it is difficult to cache all the data. In addition, a big difference between AI scenarios and big data scenarios is that the data sets in AI scenarios are relatively limited. Unlike a big data scenario where there are many tables and various businesses, the content of each business data table varies greatly. In AI scenarios, the scale and number of datasets are far smaller than those in big data scenarios. Therefore, it is often found that many tasks submitted in the company read the same data. If everyone downloads the data locally, it cannot be shared, and many copies of data will be stored repeatedly on the local machine. This method obviously has many problems and is not efficient enough.

The second method is introduced next. Since the local storage is not very good, can we use a distributed cache like Alluxio to alleviate the problem just now? The distributed cache has a very large capacity to load data. In addition, Alluxio, as a distributed cache, is easy to share. The data is downloaded to Alluxio, and other clients can also read this data from the cache. From this point of view, using Alluxio can easily solve the above-mentioned problems and greatly improve the performance of AI training. A paper named Quiver published by Microsoft India Research Institute at FAST2020 mentioned such a solution. However, our analysis found that such a seemingly perfect allocation scheme is relatively static and not efficient. At the same time, what kind of cache elimination algorithm to use is also a question worthy of discussion.

As shown in the figure above, it is an application that uses Alluxio as a cache for AI training. Use K8s to schedule the entire cluster task and manage resources such as GPU, CPU, and memory. When a user submits a task to K8s, K8s will first make a plug-in to notify the Alluxio master to download this part of the data. That is to do some warm-up first, and try to cache some tasks that may be required for the job. Of course, it does not have to be cached completely, because Alluxio uses as much data as it has. For the rest, if it has not been cached yet, it will be read from the remote end. In addition, after the Alluxio master receives such a command, it can send its scheduled workers to the remote end. It may be cloud storage, or it may be Hadoop cluster downloading data. At this time, K8s will also schedule the job to the GPU cluster. For example, in the figure above, in such a cluster, it selects the first node and the third node to start the training task. After starting the training task, data needs to be read. In the current mainstream frameworks such as PyTorch and Tensorflow, Prefetch is also built-in, that is, data pre-reading will be performed. It will read the cached data in Alluxio that has been cached in advance to provide support for training data IO. Of course, if it finds that some data has not been read, Alluxio can also read it remotely. Alluxio is great as a unified interface. At the same time, it can also share data across jobs.

As shown in the figure above, for example, another person submits another job with the same data and consumes the same data set. At this time, when the job is submitted to K8s, Alluxio knows that this part of the data already exists. If Alluxio wants to do better, it can even know which machine the data will be dispatched to. For example, at this time, it is scheduled to node 1, node 3, and node 4. The data of node 4 can even be copied by making some copies. In this way, all data, even inside Alluxio, does not need to be read across machines, but is read locally. So it seems that Alluxio has greatly alleviated and optimized the IO problem in AI training. But if you look closely, you will find two problems.

The first problem is that the cache elimination algorithm is very inefficient, because in AI scenarios, the mode of accessing data is very different from the past. The second problem is that cache, as a resource, has an antagonistic relationship with bandwidth (that is, the read speed of remote storage). If the cache is large, the chances of reading data from the remote end are small. If the cache is small, a lot of data has to be read from the remote side. How to schedule and allocate these resources well is also a problem that needs to be considered.

Before discussing the cache elimination algorithm, let's take a look at the data access process in AI training. In AI training, it will be divided into many epochs and iteratively trained. For each training epoch, each piece of data is read, and read only once. In order to prevent overfitting of training, after each epoch ends, the reading order will change in the next epoch, and a shuffle will be performed. That is, every epoch will read all the data once, but the order is different. The default LRU elimination algorithm in Alluxio obviously cannot be well applied to AI training scenarios. Because LRU uses the locality of the cache. Locality is divided into two aspects. The first is time locality, that is, the data accessed now may be accessed soon. This does not exist in AI training. Because the data accessed now will only be accessed in the next round, and will be accessed in the next round. There is no particular probability that it must be more accessible than other data. Another aspect is data locality, as well as spatial locality. That is, the reason why Alluxio caches data with relatively large blocks is because when a piece of data is read, the surrounding data may also be read. For example, in big data scenarios, OLAP applications often scan tables, which means that the surrounding data will be accessed immediately. But it cannot be applied in AI training scenarios. Because it will shuffle every time, the order of each read is different. Therefore, the elimination algorithm of LRU is not suitable for AI training scenarios.

Not only LRU, but also mainstream elimination algorithms such as LFU have such a problem. Because the entire AI training has very equal access to data. Therefore, the simplest caching algorithm can be used, as long as a part of the data is cached, and it will never be moved. After a job comes, only part of the data is always cached. Never knock it out. No elimination algorithm is required. This is probably the best elimination mechanic out there. As in the example above. The above is the LRU algorithm, and the following is the equal method. At the beginning, only two pieces of data can be cached. Let's make the problem simpler. Its capacity is only two, and the two data of D and B are cached, and the access sequence is in the middle. For example, the first access to hit is B, if it is LRU, the cache where B exists is hit. The next access is C, and C is not in the cache of D and B, LRU, so based on the LRU strategy, D will be replaced, and C will be kept. That is, the caches are C and B at this time. The next one to visit is A, which is also not in C and B. So B will be eliminated and replaced by C and A. The next one is D, which is not in the cache either, so D and A are replaced. By analogy, it will be found that all subsequent accesses will no longer hit the cache. The reason is that it is replaced during LRU caching, but it has been accessed once in an epoch, and it will never be accessed again in this epoch. LRU caches it instead, and instead of helping, LRU makes it worse. It is better to use uniform, such as the following method. In the following uniform method, D and B are always cached in the cache, and no replacement is ever performed. In such cases, you'll find at least a 50% hit rate. So it can be seen that the cache algorithm does not need to be very complicated, as long as you use uniform, do not use algorithms such as LRU and LFU.

For the second question, which is about the relationship between caching and remote bandwidth. Now all mainstream AI frameworks have built-in data pre-reading to prevent the GPU from waiting for data. So when the GPU is training, it actually triggers the CPU to prefetch the data that may be used in the next round. In this way, the computing power of the GPU can be fully utilized. But when the remote storage IO becomes the bottleneck, it means that the GPU has to wait for the CPU. Therefore, the GPU will have a lot of idle time, resulting in a waste of resources. I hope there can be a better scheduling management method to alleviate the IO problem.

Caching and remote IO have a great impact on the throughput of the entire job. So in addition to GPU, CPU and memory, cache and network also need to be scheduled. In the development process of big data in the past, such as Hadoop, yarn, my source, K8s, etc., mainly dispatched CPU, memory, and GPU. For the network, especially for the cache control is not very good. Therefore, we believe that in AI scenarios, they need to be well scheduled and allocated to achieve the optimum of the entire cluster.

2. SiloD framework

Such an article was published at EuroSys 2023, which is a unified framework to schedule computing resources and storage resources.

The overall structure is shown in the figure above. The lower left corner is the CPU and GPU hardware computing resources in the cluster, as well as storage resources, such as NFS, cloud storage HDFS, etc. There are some AI training frameworks such as TensorFlow and PyTorch on the upper layer. We think it is necessary to add a plug-in for unified management and allocation of computing and storage resources, which is our proposed SiloD.

As shown in the figure above, what kind of throughput and performance a job can achieve is determined by the minimum value of GPU and IO. How much remote IO is used, how much remote network will be used. The access speed can be calculated by such a formula. The job speed is multiplied by the cache miss ratio, which is (1-c/d). Where c is the size of the cache and d is the data set. This means that when the data only considers that IO may become a bottleneck, the approximate throughput is equal to (b/(1-c/d)), where b is the bandwidth of the remote end. Combining the above three formulas, the formula on the right can be deduced, that is, what kind of performance a job ultimately wants to achieve. You can use the formula to calculate the performance when there is no IO bottleneck and the performance when there is an IO bottleneck. Take the two min.

After getting the above formula, differentiate it to get the effectiveness of the cache, or cache efficiency. That is, although there are many jobs, they cannot be treated equally when allocating cache. For each job, depending on the data set and speed, the amount of cache allocation is very particular. Here is an example, take this formula as an example, if a job is found to be very fast, it can be trained very fast, and the data set is small, which means that a larger cache is allocated and the benefit will be greater.

Based on the above observations, SiloD can be used to allocate cache and network. Moreover, the size of the cache is allocated according to the speed of each job and the overall size of the data set. The same goes for the web. So the whole architecture is like this: In addition to mainstream job scheduling like K8s, there is also data management. On the left side of the figure, such as cache management, it is necessary to count or monitor the size of the cache allocated in the entire cluster, the size of each job cache, and the size of the remote IO used by each job. The following jobs are very similar to the Alluxio method, and can all use the API for data training. Each worker uses a cache to support local jobs. Of course, it can also be shared across nodes in a cluster.

After preliminary tests and experiments, it is found that such an allocation method can significantly improve the utilization and throughput of the entire cluster, up to an 8-fold performance improvement. It can obviously alleviate the state of job waiting and GPU idle.

To summarize the above introduction:
First, in AI or deep learning training scenarios, traditional caching strategies such as LRU and LFU are not suitable. It is better to use uniform directly.
Second, cache and remote bandwidth are a pair of partners that play a very large role in overall performance.
Third, mainstream scheduling frameworks such as K8s and yarn can be easily inherited from SiloD.
Finally, we did some experiments in the paper, and different scheduling strategies can bring about a significant improvement in throughput.

3. Distributed cache strategy and copy management

We also did some open source work. The work of distributed cache strategy and replica management has been submitted to the community and is now in the PR stage. Alluxio master mainly manages Meta and the entire worker cluster. It is the workers who actually cache the data. There are many blocks in units of blocks to cache data. One problem is that the current caching strategy is for a single worker. When calculating whether to eliminate each data in a worker, it only needs to be calculated on one worker, which is localized.

As shown in the example above, if worker 1 has block A, block B, and block C, and block C is calculated based on LRU as the one that has not been used for the longest time, block C will be eliminated. If you look at the big picture, it's not good. Because block C has only one copy in the entire cluster. After it is eliminated, if there are still people below who want to access block C, they can only pull data from the remote end, which will cause performance and cost losses. We propose to make a global elimination strategy. In this case, block C should not be eliminated, but the one with more copies should be eliminated. In this example, block A should be eliminated because it still has two copies on other nodes, which is better in terms of cost and performance.

As shown in the figure above, what we do is to maintain replica information on each worker. When a worker, for example, adds a copy or subtracts a copy, it will first report to the master, and the master will return this information as a heartbeat return value to other related workers. Other workers can know the real-time changes of the entire global copy. At the same time, the copy information is updated. Therefore, when eliminating internal workers, we can know how many copies each worker has in the whole world, and then we can design some weights. For example, LRU is still used, but the weight of the number of replicas will be added to comprehensively consider which data to eliminate and replace. After our preliminary tests, in many fields, whether it is big data or AI training, it can bring great improvement. So it's not just about optimizing cache hits for one worker on one machine. Our goal is to improve the cache hit ratio of the entire cluster.

Finally, make a summary of the full text. First of all, in the AI ​​training scenario, the uniform cache elimination algorithm is better than the traditional LRU and LFU. Second, cache and remote networking are also resources that need to be allocated and scheduled. Third, when performing cache optimization, don't limit yourself to just one job or one worker. You should control the entire end-to-end global parameters to improve the efficiency and performance of the entire cluster.

For more interesting and interesting [event information] [technical articles] [big coffee views], please pay attention to [Alluxio Think Tank] :

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially releases Linus to personally review the code, hoping to quell the "infighting" about the Bcachefs file system driver ByteDance launched a public DNS service 7-Zip official website was identified as a malicious website by Baidu Google releases AI code editor: Project IDX Tsinghua Report: Wenxin Yiyan firmly sits first in China, beyond the ChatGPT Vim project, the future meditation software will be launched, ChatGPT was founded by "China's first Linux person" with a daily cost of about 700,000 US dollars , OpenAI may be on the verge of bankruptcy
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5904778/blog/10096045