Build a unified data access layer based on Alluxio

01Background  _

       First, let’s introduce the distribution of our computer rooms. For cost and disaster recovery considerations, Zhihu has a multi-cloud hybrid architecture. The architecture diagram is as follows:

picture

       Offline computer room:  An offline computing service center designed to meet the needs of big data-related business parties. Its main function is to deploy services such as offline scheduling, offline storage, and scheduling platforms. The goal of these services is to provide efficient offline data processing and computing capabilities. In the offline computer room, big data business parties can safely perform batch data processing and computing tasks to meet their requirements for data processing, storage and scheduling.

       Online computer room:  This computer room is specially designed for Zhihu’s main website to provide direct user-facing services. These include core services such as comments, answers, searches, and recommendations. The focus of the online computer room is real-time and response speed to ensure that users can obtain a stable and efficient service experience in the shortest possible time. As a knowledge community, the Zhihu main website's online computer room is to ensure that users can receive high-quality and continuous support for the exchange and sharing of knowledge and information.

       GPU computer room:  This computer room is specially used to deploy machine learning platforms and mainly serves algorithm users. Its main feature is to provide powerful one-stop solutions for GPU resource management, model training, and data set import and export. The core mission of the GPU computer room is to provide algorithm users with high-performance computing resources to meet the requirements of machine learning model training and inference. Such a design allows algorithm users to focus more on model development and optimization without having to worry about the supply of computing resources.

       Multi-cloud architecture brings new challenges to storage. Compared with a single computer room, the impact of dedicated line capacity and network latency on the storage system needs to be additionally considered. As far as algorithm scenarios are concerned, the impact on dedicated lines needs to be considered during model training and launch to avoid reading too much data across dedicated lines, which will cause the dedicated lines to be full, thus affecting other cross-cloud services. Some time ago, we tried our best and used Alluxio in the training of large language models and the launch of recommended search models. We used its caching capabilities to solve the performance problems and dedicated line traffic problems of cross-cloud data usage in the algorithm, and achieved It has good results. If you are interested, you can check out the exploration of multi-cloud cache on Zhihu. (https://zhuanlan.zhihu.com/p/622005118)

picture

        With the rapid iteration and development of large language model projects, we have a deeper understanding of Alluxio's understanding and use, and we also find that the value of Alluxio goes far beyond that. With the help of Alluxio's two core capabilities of caching and unified namespace, we successfully built a unified data access layer, further improving the efficiency of data processing and the convenience of management.

02  Large language model training and data management

       As we all know, the data sets required for large language model training and the models produced by the training are very precious. Therefore, we hope that algorithm users will be subject to strict permission control when using them to avoid the corruption of models and data sets. leakage, thereby protecting the company's assets.

       The first is the security guarantee of the underlying data. Our data are all stored on HDFS, and the current authentication and authorization of our HDFS are based on group accounts and HDFS ACL, which is a coarse-grained control. It can be expected that user authentication based on group accounts will inevitably lead to user authentication information being spread word of mouth among members of the same group. Moreover, everyone has different perceptions of security, so some colleagues with weak security awareness may write HDFS authentication information into the configuration or code of the Gitlab project, causing user authentication information to be leaked. While colleagues in the security department may detect such a situation in time, it is difficult to determine whether the authentication information has been compromised. Therefore, we worked with the security department and algorithm team to develop a new security approach.

Here’s an introduction to our plans:

  1. Build a new HDFS cluster specifically for large language model training instead of sharing the same HDFS with other users;

  2. Use the security group policy provided by the cloud vendor to configure access permissions at the network level for the independent HDFS cluster. The network policy used by the HDFS machine is called a black box policy , and these machines are called black box machines;

  3. Only specific machines can access the black box machine, and the black box policy is contagious. Any machine that can access the black box will also be classified into the black box and restricted by the black box policy;

  4. A small number of machines outside the black box can access specific services in the black box for export of data sets or models. These machines are called gray box machines. Gray box machines are strictly monitored and restricted. If there is any abnormal behavior, an alarm will be issued directly, and all relevant colleagues will be reminded in the company WeChat group.

       In short, the core idea of ​​our solution is to deploy all services required for large language model training on machines restricted by black-box policies. Although this will increase a certain operation and maintenance burden, it can ensure the security of the model and data set to the greatest extent.

       The final architecture diagram of our large language model training is as follows:

picture

The entire model training process is as follows:

  1. The original unprocessed data set is stored on offline HDFS and cleaned and processed by the offline Spark cluster;

  2. The cleaned high-quality data sets are transferred to black box HDFS to provide support for model training;

  3. When users read or write model data sets, they implement data reading and writing through Alluxio Fuse. Network policies restrict users from directly accessing black box HDFS and offline HDFS;

  4. When the model training container is started, the corresponding data set storage directory is mounted according to the data set requirements declared by the task to ensure data isolation between tasks;

  5. Alluxio mounts the data on offline HDFS in read-only mode, ensuring that the model training container can access the offline HDFS data, but cannot write data to it;

  6. The audit logs generated by the black box HDFS will be imported into Kafka, and Flink will perform consumption and behavior statistics. Abnormal behavior will trigger an alarm immediately.

       Throughout the entire training process, Alluxio serves as a bridge between large language models and HDFS data, providing efficient and secure data access for model training:

        First , we use Alluxio's high-performance caching capabilities to improve the efficiency of cross-cloud access to model training data sets. Efficient IO can greatly improve GPU utilization;

        Secondly , we use the unified namespace function of Alluxio to mount multiple HDFS to Alluxio, which provides flexible and unified data access capabilities for model training. Since Alluxio can flexibly configure the read-only attribute of UFS, we can ensure that the data is not leak; leak

        Finally , thanks to Alluxio Fuse's flexible mounting of local file systems, we have achieved isolation and control of data sets .

        Although Alluxio is powerful, it is not omnipotent. We still have some unresolved problems:

        On the one hand, because Alluxio currently does not support random writing , when faced with random writing scenarios, we can only choose to write data to the local disk or a file system that supports random writing, and finally the algorithm platform synchronizes the data to HDFS.

        On the other hand, because Alluxio currently does not provide a pipeline writing method similar to HDFS DataNode, we cannot write multiple copies to Alluxio at one time, and we are worried about data loss when writing a single copy, so we do not use asynchronous when writing data. The writing method is to write to HDFS synchronously through Alluxio Fuse, although it is not as efficient as asynchronous writing.

03Recommendation  /search model training

        This section introduces us to briefly introduce the training of the recommendation/search model. The training of recommendation/search models has always been the main contributor to dedicated line bandwidth consumption. Although we provide UnionStore (self-developed component) as a data access layer for algorithm users, only a small number of users have access due to performance reasons. For a long time, most users still directly connected to the HDFS read and write model and data. The architecture diagram of model training is as follows, in which UnionStore is our self-developed component:

picture

        Although the direct connection to HDFS can meet the needs of users in the short term, there are also some hidden dangers. As the scale of model training expands, the traffic of reading training data from HDFS will also become larger and larger. Offline HDFS data needs to pass through two dedicated lines (offline computer room → online computer room → GPU computer room). It is difficult for us to ensure that the dedicated lines between computer rooms are not full, and the cost of expanding the dedicated line is very high, so we must look for reliable caching components. To alleviate the problem of dedicated line traffic. We have accumulated experience in using Alluxio in large language model training, and Alluxio can indeed meet our needs better, so in the scenario of recommendation/search model training, we still choose Alluxio. The training process of the recommendation/search model is basically the same as that of the large language model. The only difference is that the training is not completed in a black box. The architecture diagram is as follows:

picture

        Whether it is large language model training or recommendation/search model training, we have encountered the situation where the performance of Alluxio Fuse's synchronous writing to HDFS is not up to standard. This is because the efficiency of Alluxio's native synchronous writing to HDFS can only be the same as that of direct writing to HDFS. Compared with Alluxio, which deploys asynchronous writing to NVMe disks, the speed difference is more than ten times. Therefore, we developed a new solution for Alluxio Fuse to write synchronously to HDFS. The following introduces our acceleration solution for synchronous writing to HDFS:

  1. Maintain a memory pool in Alluxio Fuse, with multiple memory blocks in the memory pool;

  2. When Alluxio Fuse accepts user writes, it does not write directly to HDFS, but applies for free memory blocks from the memory pool. If a memory block is applied for, it is written to the memory block; if no memory block is applied for, it leaves. Normal writing logic;

  3. After the memory block is full, the thread pool maintained in Alluxio Fuse asynchronously uploads the data to a temporary file on HDFS. After the upload is completed, the memory block is returned to the memory pool;

  4. Repeat step 2 until the user finishes writing. When closing the file, Alluxio Fuse will wait for all memory blocks to be uploaded, then use the HDFS concat command to splice all temporary files into a complete file, and finally rename the file to the corresponding path. .

Writing this way has the following benefits:

  1. When the file is written, it is atomic and there will be no intermediate state. Either the file is visible if the writing is successful, or the file is invisible if it fails;

  2. When uploading memory blocks to HDFS temporary files, you can retry multiple times to improve the success rate of writing to HDFS. This is of great help to the problems we encountered some time ago: our DataNode uses storage Dense models have reached the bottleneck of block locks (which have been solved by splitting read-write locks and other solutions), resulting in a high failure rate when the algorithm business writes large files to HDFS;

  3. The process of uploading memory block data to HDFS is completed by the background thread pool, which can support extremely high concurrency. During the entire process of users writing files to Alluxio Fuse, it is equivalent to writing directly into the memory, which can achieve extremely high write speeds. speed. We tested the user's single-threaded writing to Alluxio Fuse. When Alluxio Fuse's 5 background threads uploaded data to HDFS, it was able to achieve a writing speed of 1GB/sec.

The following points need to be noted:

  1. If it is HDFS with Federation, the path of the temporary file must consider the NameService. The temporary file and the path to be written must be under the same NameService, otherwise a rename error will occur;

  2. Each memory block in the memory pool should be aligned with the HDFS block size as much as possible (usually 64MB or 128MB). If it is too small, it will cause too many blocks in the file and put greater pressure on the NameNode;

  3. Although this solution is only applicable to HDFS, the idea is general. For other UFS such as object storage, you can consider using MultipartUpload to implement it.

        After using Alluxio for recommendation/search model training, dedicated line traffic can be significantly reduced, especially for some retrospective training tasks. During training, they will read tens or even hundreds of TB in several months with extremely high concurrency. Data, and Alluxio can cache this data to the cluster and reuse it to avoid reading to HDFS across dedicated lines.

04Object  storage unified management and acceleration

        First, let's explain why unified management of object storage is necessary.

         On the one hand, in our internal operations, cloud services are requested and procured on a team basis. Each team has its own cloud vendor sub-account, and its members share this sub-account. Object storage is one of our commonly used cloud services, and it is also allocated according to sub-accounts. Although we can apply for different buckets, buckets belonging to the same sub-account share the same access key (AK) and secret key (SK). This arrangement brings about security problems: members with low security awareness may write AK and SK into the configuration file of the public warehouse, thus exposing the data of all buckets under the sub-account to the risk of leakage.

        On the other hand, since Zhihu adopts a multi-cloud hybrid architecture, and object storage is bound to cloud vendors, a little carelessness when using object storage across clouds may result in expensive public network traffic charges.

        The above two problems have been bothering us for a long time, until we introduced the Alluxio solution, these problems were finally completely solved.

        Before we introduced Alluxio, object storage was used as follows:

can be seen:

  • All users use the same AK SK to access different buckets of object storage;

  • Some users (user4) use object storage across computer rooms, which may generate public network traffic.

After the introduction of Alluxio, object storage is used as follows:

  • Different object stores are mounted to different first-level directories on Alluxio. The names of the first-level directories correspond to the bucket names of the object stores one-to-one, making it easier for users to access through S3 Proxy;

  • Using the Alluxio S3 Proxy user authentication plug-in, we allow each user to have his own independent AK and SK without interfering with each other. Even if the AK and SK are leaked, we can change them in time;

  • We map AKs to different users, so that we can use the directory permissions function of the Alluxio file system to allow each AK to have different permissions and can only access specified directories, thus achieving isolation between buckets;

  • When users access data across computer rooms, since the object storage data has passed through the Alluxio proxy, public network traffic will be converted into computer room dedicated line traffic. Of course, the more recommended way here is to expand an Alluxio cluster in the computer room that needs to access data, and use the cache of the local computer room to not only save public network/dedicated line traffic, but also obtain better performance.

        After using Alluxio proxy for object storage, users need almost no modification. They can easily access Alluxio through the object storage protocol through Alluxio S3 Proxy, and can also enjoy Alluxio's high-performance caching and metadata caching. By accessing object storage through Alluxio S3 Proxy, its single-thread download speed can be increased by nearly a hundred times; at the same time, Alluxio's metadata cache can save the number of operations users request from object storage, and obtain better API performance, reducing costs and increasing efficiency.

05Summary  and Outlook

        In this article, we introduce in detail how to use Alluxio to optimize the training and data management of large language models, the training of recommendation/search models, and the unified management and acceleration of object storage under Zhihu's multi-cloud hybrid architecture. By leveraging Alluxio's core capabilities such as caching and unified namespace, we built a unified data access layer , which significantly improved the efficiency of data processing and the convenience of management .

        In terms of training and data management of large language models, we build an independent HDFS cluster and use black box strategies, gray box machines and network strategies to ensure data security. Through Alluxio's high-performance cache and unified namespace, we achieve efficient and secure data access, while solving the problem of cross-cloud and cross-computer room data reading.

        In the training of the recommendation/search model, we used Alluxio to alleviate the dedicated line traffic problem and significantly reduce the cost of reading data across computer rooms. By introducing the memory pool and HDFS concurrent upload scheme, we further improved the data writing speed.

        Finally, in terms of unified management and acceleration of object storage, we mounted different object storage to Alluxio and implemented user authentication and permission control through Alluxio S3 Proxy, thereby solving the problems of security and multi-cloud usage, while also A high-performance data access experience is achieved.

        Once Alluxio was introduced internally, it won unanimous praise from users, which in turn drove its rapid growth. So far, we have deployed 5 large-scale Alluxio clusters internally, with a total of more than 300 nodes, and the cache capacity has even reached the PB level. These clusters are distributed in different computer rooms and provide support for multiple key areas, including the training of large language models, the training and launch of recommendation and search models, real-time computing platforms, and data integration platforms.

        In general, Alluxio plays an important role in Zhihu's multi-cloud architecture, solving a series of problems such as data security, cross-cloud, dedicated line traffic, etc., and provides efficient, safe, and efficient data processing and model training for Zhihu. Convenient solution. In the future, we will continue to deeply explore the potential of Alluxio, explore more application scenarios, and contribute more to Zhihu's technological development.

Excerpted from: Building a unified data access layer based on Alluxio

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/132844810