AIGC data processing and storage solutions

In 2023, AIGC Technology Week AIGC Shanghai Special Session will be held in Putuo District. The event opened with three chapters: "Intelligence Emergence", "Computing Power Breakthrough", and "Beyond Reality". AI is the engine, computing is the cornerstone, and the digital base is built; the third chapter "Beyond Reality" leads a new era of "spatial computing" and enters a new stage of "coexistence of reality and virtuality".

Yang Guanjun, Expert Architect of Tencent Cloud Storage Solution

Aiming at how to solve the problems of data storage and data processing in the AIGC training process in the context of AIGC, Yang Guanjun introduced and interpreted it from three aspects: one is the new demand of AIGC for storage; the other is that Tencent Cloud can The overall storage solution provided to users; the third is the overall data processing solution provided by Tencent Cloud.

AIGC 's New Requirements: Model Training and Applied Reasoning Requirements

The amount of data generated each year in our country shows a very large growth trend. This premise is that the AIGC scene has not yet appeared in the previous two years. But now from UGC to AIGC, I believe that the amount of data generated by the entire industry will be even larger than this. How to process these data and how to apply these data to the system? All these have brought greater demands and challenges to data storage.

From the original data collected at the beginning, data processing is performed based on these data, and the corresponding preprocessed data is generated, and then given to the subsequent training model. It can be seen that during the entire model training process, a large amount of data will increase. , also brings the demand for unified storage of data.

Tencent Cloud has three requirements for this training scenario summary: one is the unified storage of the data lake. Throughout the process of AIGC, the amount of data storage is very large, and the storage requirements brought by it need to be solved by data lakes to avoid the problem of data islands. The second is the requirement of data flow during the processing of each business. If these data are stored in some traditional files, they will encounter the problem of data islands, so a unified storage is needed to provide services for them. The third is high throughput and low latency. In the AIGC scenario, GPU computing power is rare and expensive. Customers hope that the entire training can run as fast as possible, and the more fully used the GPU is, the better it is for the bottom layer. The storage puts forward a requirement: the faster the data is read out, the faster it can be provided to the upper layer for training, so that the value is the highest.

In the AIGC business process, the core requirements of application reasoning scenarios are mainly composed of two parts: content review and data intelligence. After the trained model is deployed and provided to the user through the service mode, the usual logic is that the user provides a prompt and generates some data based on the prompt. Whether it is Wenshengwen, Wenshengtu, or video, etc., it requires the storage of massive data, which is provided by object storage on Tencent Cloud.

In the process of generating these data, based on the regulatory requirements of national security compliance, it is necessary to use the content review and data processing capabilities provided by Tencent Cloud. At the same time, for these data, users hope that it can have some data intelligence functions. Based on our analysis of AIGC business, an intelligent intelligence is used here, and this requirement will be introduced in detail later.

Tencent Cloud Storage Solution

AIGC's overall storage solution uses a total of three products from Tencent Cloud: object storage COS, GooseFS, and GooseFSx. From uploading the most original datasets to the cloud, to model training, reasoning applications, and data storage in content governance, Tencent Cloud provides a one-stop overall storage solution.

AIGC's overall storage solution uses a total of three products from Tencent Cloud: object storage COS, GooseFS, and GooseFSx. From uploading the most original datasets to the cloud, to model training, reasoning applications, and data storage in content governance, Tencent Cloud provides a one-stop overall storage solution.

On the far left is the special data migration cloud service provided by Tencent Cloud, which can import data collected by users or data from friends to Tencent Cloud COS object storage. The middle part describes the one-stop storage solution we mentioned. The bottom layer is the base of Tencent Cloud's mass storage - COS object storage. The two products above use GooseFS and GooseFSx to complete the acceleration of data preprocessing in AIGC scenarios. , POSIX access requirements in model training.

In the era of explosive data growth, object storage is always the most reasonable storage base. The figure above is the overall service framework of Tencent Cloud Object Storage COS. In this architecture, the bottom layer is Tencent Cloud's self-developed distributed object storage engine Yotta , it can support 10,000 servers in a single cluster, and EB-level storage in a single cluster. It is very suitable for unified data lake storage for original data and data generated by AIGC. In addition, COS object storage provides a variety of storage types such as standard, low-frequency, archive, and deep archive, and supports appropriate cost reduction through lifecycle management, allowing customers to have a massive storage system without paying Too high storage cost.

In the data preprocessing requirements of docking customers, we found that there are usually many free disks available locally on the node. Tencent Cloud GooseFS is a distributed cache system that can effectively utilize these disks on computing nodes to speed up the processing of underlying objects. Storage access provides higher read performance to upper-layer applications. In addition, GooseFS also supports a variety of commonly used protocols, including HDFS, FUSE and S3 protocols. In different application scenarios, GooseFS can effectively improve the performance of upper-layer applications accessing COS, basically with a performance improvement of 2 to 10 times.

The following introduces the best practices of GooseFS in data preprocessing scenarios. Its deployment scheme mainly has three characteristics: low cost, high performance and high reliability. Low cost: GooseFS Worker is deployed on the computing node, using the computing node NVME SSD as the cache medium to provide PB-level cache space; high performance: through the VPC network to open up data flow, multiple nodes can build TB/s throughput capacity; high reliability: The GooseFS Master is deployed separately, and the 3 nodes ensure the high reliability of the GooseFS cluster through the RAFT protocol.

In AIGC training scenarios, many accesses are based on file interfaces, which are consistent with POSIX semantic access in traditional HPC or AI scenarios. Our GooseFSx products are fully compatible with POSIX semantic access ability.

Compared with traditional customers deploying distributed file storage services by themselves, GooseFSx has the following advantages as a whole:

1. Fully managed cloud service, one-click purchase and delivery, eliminating the need for deployment, commissioning and other operation and maintenance work;

2. Fully compatible with POSIX file semantics, no need to make any changes to the workload;

3. Billing based on the created capacity, pay-as-you-go, flexible expansion, to avoid idle resources;

4. Automatically deploy the client software and mount GooseFSx to the local directory of the host;

5. Using a distributed architecture, the performance increases linearly with node expansion;

Next, I will focus on the ability of GooseFSx and COS to freely flow data, which is very important in the scenario where COS provides unified data lake storage, and then upper-layer applications require POSIX file access.

1. The Object on COS is projected to GooseFSx with the same directory structure according to the Key;

2. Associate multiple storage buckets: the data accelerator can accelerate multiple storage buckets at the same time;

3. Two-way flow: can load from COS, and settle newly produced files to COS;

4. Customize the flow strategy: based on the entire storage bucket or a custom prefix, load or settle;

5. Incremental synchronization: When loading or settling again, only the incremental data will be synchronized

6. Data flow task: manage data flow, output task report, ensure the integrity of data flow, easy to use;

Tencent Cloud Data Processing Solution

Data Vientiane is a one-stop intelligent platform provided by Tencent Cloud. It integrates Tencent's leading AI technology to create a treasure chest for data processing, providing image processing, media processing, content review, file processing, AI content recognition, document services and other multimedia data services of all categories. processing power.

There are multiple laboratories within Tencent Cloud, and Data Vientiane integrates the technical capabilities of Tencent's cutting-edge laboratories, such as AI laboratory: basic algorithms; Youtu laboratory: image recognition; multimedia laboratory: codec research; Tianyu laboratory: security Blocking and control algorithm, combined with the best practices of Tencent's industry-leading business, such as Tencent Music: noise reduction, separation and other scenarios; Tencent Video: video fingerprint, codec and other scenarios; Tencent News: graphic review and other scenarios; Tencent National K Song: singing Scoring, music labeling and other scenarios.

In the scene of AIGC, text is what everyone pays most attention to at present. With the development of multi-modal models later, there will be more and more scenes of Vincent graph, Vincent audio, Vincent video, and even a video based on pictures. The way. Data Vientiane covers all these capabilities, including image processing, audio processing, and video processing capabilities.

The country has always had content compliance and audit requirements. The functions of Data Vientiane also include the ability to audit content. Whether it is for text, audio or video, Vientiane Data provides a complete set of content audit solutions and capabilities. Based on the data stored on COS, you can easily do business content review docking.

In summary, the one-stop data processing provided by Tencent Cloud has the following three advantages:

One is convenient intervention, whether it is object storage or data Vientiane, it is an integrated platform that provides one-stop storage and content review solutions;

The second is an accurate model. Based on the many customers connected with Tencent Cloud, we have made a special audit model and some special optimizations for the AIGC scenario;

The third is higher performance. Data is stored on object storage, and its call review and processing are all in the same campus. The latency of this process loading and processing is very low;

Another great function of Data Vientiane is the intelligent retrieval service. In the era of AIGC, with more and more data, the demand for data retrieval will also increase. For example, we have dozens of gigabytes of data on personal computers, and it will be difficult to retrieve suitable data. With the development of AIGC, the data owned by users will reach the level of TB and PB. In this case, search Getting the right data is even more difficult. After the emergence of the large model, we found that this kind of intelligent retrieval service is used to extract features from text, pictures, and videos, and then store the extracted features, and then match the corresponding features based on the input text, so that the richness of the search and accuracy are high.

Currently, the intelligent retrieval services supported by Tencent Cloud Data Vientiane include: text search for images, image search for images, image search for videos, and video search for videos. The bottom layer of the intelligent retrieval service is the big language model of Data Vientiane, which is Tencent Cloud's preprocessing and extraction based on authorized business data and its own business data, machine translation, model cleaning, image-text matching, manual proofreading, etc., and then training A large model of the vertical field came out.

In actual application requirements, the intelligent retrieval service can be effectively applied in various image retrieval scenarios. To sum up, Data Vientiane has three advantages:

First, it is more accurate to build a feature library through intelligent deduction maps;

The second is to support various retrieval forms of text and pictures, and it is more convenient to provide API/SDK access methods;

The third is that the bottom layer is the self-developed large language model of Tencent Cloud Data Vientiane, which can provide detection results in seconds;

Summary review

Focusing on AIGC, Tencent Cloud provides storage and data processing solutions for the entire life cycle of generation, audit, and intelligence, which are divided into the following three parts:

The first is data generation. Tencent Cloud has object storage COS, GooseFS, and GooseFSx to connect with our large language model training and the construction of an inference platform;

The second is content review, through the content review in Vientiane to do some compliance review, so as to ensure the security of the entire platform;

The third is data intelligence, which uses intelligent retrieval services to perform feature matching and query to quickly meet upper-level business needs.

Guess you like

Origin blog.csdn.net/Tencent_COS/article/details/132499239