Machine Learning Application and Optimization of ByteDance EB-level Iceberg Data Lake

The scale of deep learning models is getting larger and larger, and the magnitude of its training data is also increasing exponentially. This also puts forward higher requirements for the storage scheme of massive training data: how to read training samples with higher performance, without making data read Taking it as the bottleneck of model training, how to support feature engineering more efficiently, add, delete, and backfill features more conveniently. This article will introduce how ByteDance supports exabyte-level machine learning sample storage through the Iceberg data lake to achieve high-performance feature reading, efficient feature research, and feature engineering to accelerate model iteration.
 

Machine Learning Sample Storage: Background and Trends

In ByteDance, the application range of machine learning models is very wide. In order to support the training of the model, we have established two training platforms: recommended advertising training platform and general CV/NLP training platform. The weekly training scale of the recommended advertising platform reaches tens of thousands of models, while the training scale of the CV/NLP platform is as high as 200,000 models per week. Such a large model training scale is inseparable from the support of massive training samples. At present, in ByteDance's offline training sample storage, the total amount of data has reached the exabyte level, and it is still growing at a rate of petabyte level every day. These data are used to support the training of advertising, search, recommendation and other models, covering multiple business areas; these data also support the feature research and feature engineering of the algorithm team, and provide a basis for model iteration and optimization. Some current trends in ByteDance and the entire industry in the field of machine learning and training samples are as follows:
 
First, the model /sample is getting bigger and bigger . With the increase of model parameters, in order to train these huge models, more and richer training data are needed to ensure the accuracy and generalization ability of the model.
Secondly, the training computing power is getting stronger and stronger . In the past, training a machine learning model could take weeks or even months. However, today with better model architectures and high-speed graphics cards, we can complete the training process and A/B test validation in a relatively short time.
In addition, feature engineering is becoming more and more automated and end-to- end . In traditional machine learning, feature engineering is a very important part, which usually requires a lot of manpower, time and energy to process data and features. With the development of deep learning, we can use the feature extraction ability of deep learning to automatically learn features through simple data processing steps, and even simplify the process to adding columns to a sample table in the original features to be investigated Then use the deep learning framework to automatically learn and extract information.
 
Overall, ByteDance's machine learning and training samples play an important role in its business. By establishing a powerful training platform and accumulating massive training samples, ByteDance can support large-scale model training and optimization. In addition, the current industry trend shows that the growth of model and sample size, and the improvement of training computing power are promoting the development of machine learning. At the same time, the automation and end-to-end feature engineering have also brought convenience and efficiency to model training.
Machine Learning and Training Samples - Language Model Trends
Take the language model as an example to see the trend of parameters and sample size. First up is BERT, a language model that debuted in 2018. BERT is based on the Transformer architecture and has only 340 million model parameters. At the time, this was already considered a major breakthrough. However, language models have grown in size and capability over time. Standout is GPT-3, a powerful language model developed by OpenAI. Compared with BERT's 340 million parameters, the number of model parameters of GPT-3 soared to 175 billion. This huge increase has attracted a lot of attention and led to impressive achievements of GPT-3 in natural language processing tasks.
 
However, as the model parameters grow, the size of the model also becomes an issue. In order to solve this problem, people began to try the method of model miniaturization. Chinchilla is an attempt to miniaturize the model. Compared with its predecessor model, the model parameters are reduced by 4 times, but the sample size is increased by 4 times. This method tries to maintain a relatively small model size while Improve model performance with more data. The recently launched GPT-4 model and Google's recently released second-generation PaLM have not released specific model details. But it can be guessed that the scale of these models may have reached trillions of parameters, and these advances have brought new opportunities and challenges for researchers in natural language processing and other related fields.
 
Through the trends mentioned above, we can also see some current problems that need to be solved and the places that need to be adjusted to achieve the goal of cost reduction and efficiency increase.
 
First, the storage size of training samples needs to be optimized to reduce storage costs. As the size of the data set grows, storage requirements and costs will increase accordingly, which is a challenge for large-scale training models.
Second, it is also necessary to optimize the reading speed of training samples . With the iteration of chip technology and the growth of computing power, the computing resources required for training models are also increasing. However, if the reading speed of samples cannot keep up with the growth of computing power, it will become a bottleneck in the training process and limit the effective utilization of computing power resources. Therefore, we need to find ways to increase the throughput of reading samples and ensure that the existing computing power resources can be fully utilized.
Finally, with the support of deep learning, feature engineering has become more automated and simplified, and we can follow the trend to further improve the efficiency of feature research and engineering . By accelerating the feature engineering and research process, the model iteration cycle is shortened and the algorithm development efficiency is improved.
 

Storage Sample Solution Evolution

Traditional Storage Sample Solution

 
First of all, traditional sample storage is a solution that stores samples directly on HDFS , object storage, or Hive . This solution will encounter performance bottlenecks when processing massive samples. Due to the single-point List operation, it becomes very slow when scanning a large number of samples. In addition, when you need to add columns or features, use copy-on-write (Copy-On-Write) method, which will double the storage capacity, greatly increase the cost burden, and also cause unnecessary computing resources due to the nature of read and write amplification. overhead.
The second is to store samples through the traditional database solution . This solution is more suitable for the scene of processing a small number of samples. When the massive data reaches the PB or EB level, it will encounter difficulties. In addition, because the training code cannot directly read the underlying files of the database, the reading throughput may be limited. Even in the application scenario of real-time splicing features and labels, the training throughput speed will drop.
 

Data Lake Storage Sample Solution

 
Among the emerging sample storage solutions based on data lakes, two solutions that have attracted much attention are Apache Hudi and Apache Iceberg.
  • Apache Hudi provides the MOR (Merge-On-Read) method to update and add columns, which greatly reduces the cost of feature research and import compared with the traditional COW method. However, Hudi's merge performance when reading is not ideal, involving conversion of multiple formats, overflowing disks causing additional IO, etc. In addition, Hudi does not support the native Python API, and it can only be used through PySpark, which is not very friendly to algorithm engineers.
  • Apache Iceberg is an open table format that records the metadata of a table: including the table's Schema, files, partitions, statistics, etc. This kind of metadata calculation is highly scalable, providing better support and faster file scanning for data lake management. However, Iceberg's MOR method also has some problems. For example, the community version does not support updating only some columns (Partial Update). It is worth mentioning that Iceberg provides support for the Python API, which is an important advantage for algorithm engineers.
 
To sum up, both Apache Hudi and Apache Iceberg are emerging sample storage solutions based on data lakes, each with different characteristics and advantages. Although Hudi has some performance issues in some areas and does not support Python, its MOR method does an excellent job of adding research features. Iceberg provides an open table format and highly scalable metadata calculation, and also supports Python API, which provides a more friendly environment for algorithm engineers, but its MOR capability needs to be strengthened. At this point, we can understand that some common solutions have some shortcomings and are not ideal. In the end, after multi-dimensional investigation, we decided to self-develop and strengthen the Iceberg data lake to fill in the gaps and meet the needs of business sample storage and feature engineering.
 

Byte Enhanced Iceberg Data Lake: Magnus Mammoth

Overall structure

Mammoth Lakes (Magnus) self-developed and enhanced overall architecture based on Apache Iceberg is as follows:
The top layer is the computing layer , which continues the design concept of separation of computing and storage. It naturally supports Flink and Spark engines for data analysis and ETL data processing, and also supports a variety of training frameworks, including Primus, a distributed training scheduling framework recently open sourced by our team, as well as traditional PyTorch and TensorFlow, etc. Users can choose the appropriate one according to their needs. Computing, training framework.
The second layer is the core layer of Mammoth Lakes . Externally, it provides SDK self-service and metadata services for users. The platform supports various operation and maintenance tasks, such as data import and maintenance. It is worth mentioning that this layer introduces an Arrow-based high-speed vectorized read-time merge engine, which can efficiently merge data and improve read performance. The base of Mammoth Lakes is based on the enhanced version of Iceberg metadata. The metadata supports functions such as version management and file scanning, providing users with more comprehensive data management capabilities.
The underlying storage layer is the basis of the entire architecture, responsible for actual data storage, and supports multiple file formats, including the open source columnar storage format Parquet, row storage format TFRecord, and other self-developed formats. The platform encourages businesses to migrate to the column storage format, which can save storage costs by about 30% to 50% on average and improve read performance. Finally, these files will be stored in HDFS or object storage to ensure data safety and reliability.
 

Core feature optimization and practice

Core feature 1: support data update and write branch

 
Through the previous article, I learned that the lightweight update operation based on MOR read-time merge is the key to accelerating the feature investigation and engineering iteration cycle. So we first developed and introduced the first core feature: lightweight data update and branch management on Iceberg.
The Iceberg data lake manages the following file types: Data File data file—expresses new row records, Delete File delete file—expresses row deletion information, and on this basis, Update File updates file—expresses column update information. When writing data, updating or adding columns, users only need to provide the row number, primary key and backfill column data information, which greatly avoids the problem of read and write amplification and realizes lightweight updates. When reading, the data file and the update file can be read together, and merged and applied to update and add column when reading.
Iceberg's tree metadata has strong expressive power and can well support data branch expression. By using this point to write to the branch for research when feature research\writing update files, you can directly refer to the data files on the trunk, so that the branches can be kept isolated without affecting the baseline model training on the trunk, and at the same time Unnecessary data duplication is avoided. Corresponding branch operations have also been developed, which can operate data as conveniently as Git: merge, delete, and rebase (rebase the branch on the trunk). These branch operations are all based on Iceberg metadata, which is lighter than operating data order of magnitude.
This feature is widely used in shortening the iterative cycle of feature investigation and sharing features of multiple training targets.
 
  • Application 1 : Large-Scale Feature Survey and Engineering
Based on the core capabilities of updating and branching, in order to speed up the iterative cycle of feature research, we have widely used it in the process of feature engineering. Some businesses contain multiple high-potential feature sets, and algorithm students can perform parallel backfilling, research, and training on their respective branches. When the research model indicators meet expectations, users can submit a work order for branch merge review and follow-up writing features. If there is a gap between branch merge and follow-up, it can be backfilled from offline to the trunk.
For a model with high maturity, most of the research features may not be effective. In this case, after deleting a branch, the data maintenance task will delete the file of this branch to save space. Of course, algorithm engineers can also continue to perform rebase operations on branches for verification and research. The application also has some difficulties, such as the problem of small files caused by a large number of updates and merges, so the monitoring of the number of files is deployed on the branch, and the compact merge operation of small files is only performed when necessary.
 
  • Application 2 : Multiple training targets, shared features
Another application scenario is to support multiple training targets to reuse the same feature through data branches. When promoting a new recommendation project, if there is a new recommendation target, the algorithm engineer only needs to backfill the label Label of the recommendation target to directly reuse the existing features of the backbone. AB experiments and inspections can be started after a few hours of training Model effect, new features successfully investigated on the trunk can also be reused on all recommendation targets as soon as possible, with zero data copying. In the end, we can save about 90% of the sample storage space on some recommendation items through branching and the ability to reuse feature data , which greatly speeds up the research cycle of recommended targets.
 

Core feature 2: High-speed read-time merge engine

Mammoth Dataset (Magnus Dataset) is a read-time merge engine developed based on Apache Arrow. Apache Arrow is an open source columnar memory structure that supports multiple languages, zero-copy in the same process, extremely low serialization overhead, and vectorized computing capabilities. The Iceberg community also has support for Arrow vectorized reads, but does not support complex nested types, which is extremely unfriendly to training samples containing nested types of data, while the Mammoth dataset can support it well.
On the Byte open-source training scheduling framework Primus, compared with general vectorized reads, the read throughput can be improved by about 2 times. Therefore, we can support high-performance sample read-time merging and reading without relying on Compaction to merge files, so that data reading is no longer a bottleneck in GPU training. The output result is in Arrow format, which can be easily connected to interfaces such as Spark Dataset and Pandas in a zero-copy manner.
Among them, many samples can be skipped and sampled in some training models/data processing during read-time merging and push-down filtering. We also use push-down filtering to reduce the amount of training sample calculations to speed up. In support of high-speed read-time merging, memory unification and massive sample shuffle optimization are supported, as detailed in the next two sections.
 
  • Application 1 : Memory Unification Arrow
In the implementation of this feature, we and other internal teams switched the end-to-end memory format of offline training to the Arrow format in the head model, which greatly reduced the use of memory and computing resources, and avoided a lot of unnecessary memory. Format conversion and serialization overhead, achieved great gains. Some Arrow transformations have also been made in the commonly used Spark engine for data analysis and processing. It should be noted that we are also trying to switch Arrow in online streaming training, but the overhead is still very high. The possible reason is that the streaming samples are passed every time, which is not suitable for the batch format of Arrow, resulting in additional overhead.
 
  • Application 2 : Massive Sample Shuffle Optimization
In the processing of massive samples, algorithm engineers will spend a lot of time on data cleaning in order to make the model perform better. However, cleaning data often requires the use of Shuffle operations, and the common problem is that Shuffle fails and is slow. In this part, we did Shuffle optimized response work based on update and push-down filtering.
For example, a user needs to join a PB-level sample table with a medium-sized table. Their bucketing methods are different—the common Bucket Join is not used, and the memory is insufficient—and the commonly used Broadcast Join is not used. At this time, we can update the small table through the Update operation. Update the table of the large table to the temporary branch of the large table, change it to the same layout as the large table, and then read the spliced ​​samples with high throughput through push-down filtering.
If the user needs to optimize the performance of the model by breaking up PB-level samples and removing duplicates, they can also update the small data to the large table through Update Shuffle in a similar way, and then push down to filter and fish out.
 

Core feature three: Upsert and global index

It is not enough to have updates and high-speed read-time merging. We also need some business scenarios so that the data streams of multiple samples can be directly sent to the lake concurrently, spliced ​​and backfilled. This depends on the third core feature introduced next-global index . Through the global index, you can know whether a written record has been written. If it is not written, you can insert it; if it is written, you can use Update. In this part, we refer to the design of Apache Hudi. In addition to supporting HBase global index, it also supports HFile file index, that is, directly uses the underlying data format of HBase as the index and hosts it in Iceberg metadata, which optimizes performance and concurrency.
Compared with other indexes, using the HFile file index can reduce operation and maintenance components, reuse storage resources, and avoid pulse traffic read and write problems. From the perspective of the entire writing process, when writing data, the framework will check the global index to locate which partition or bucket a record should be written to. When reading, it will perform read-time merge according to the bucket, and finally restore the result sample. In terms of specific applications, it is mainly used in scenarios such as large window features and label splicing.
 
  • Application : large window features and stitching on the label lake
Business scenarios such as risk control are more suitable for large windows (windows greater than or equal to one month) feature splicing features and labels. Online splicing in the form of a large window requires a lot of machine resources, so we use concurrent Upsert support to allow sample tracking, label backfilling, and feature research to be carried out at the same time, and features and labels can be directly performed on the lower-cost offline Mammoth Lakes. stitching.
Its core is to give users the option to tolerate concurrent write conflicts, and provide multiple MOR strategies to meet business needs: First-write-win, keep the first write, Last-write-win, keep the last write, splicing To list, custom read-time merge tolerates concurrent Upsert conflicts. For scenarios where the business cannot tolerate concurrency, partition-level and bucket-level optimistic conflict detection is also supported. At the same time, the data that is reflowed by Upsert to the previous partition is compacted according to the hot and cold data, so as to avoid the performance loss caused by small files.
After introducing the core features, we have also made a lot of optimizations for the Iceberg data lake for massive samples, and are gradually contributing some good packages to the community. Here we briefly introduce the key points.
 

Other optimizations and practices

  • feature elimination
In some cases, the physical deletion of the features merged into the backbone may cause omissions or affect downstream tasks. In this case, logical deletion can be realized by renaming the feature column. Since the training side is read based on the feature name, it cannot be read after renaming. If some algorithm students find that it has an impact on the model, just rename it back. After a period of time without any impact, the feature can be safely and physically deleted.
  • Rollback & Undo
When there is a problem with the data source/pipeline, if there is a problem with the characteristics of the lake, it will affect the effect of the training model and cause the online data flow to fail. In this case, the common practice is to roll back the problematic written snapshot version, which will also roll back the normal written snapshot version together, which may affect some subsequent downstream training/samples deal with. Therefore, we have developed the undo function, which can undo the operation of a certain snapshot at the metadata level without affecting the characteristics of subsequent normal writing, and is more friendly to downstream tasks.
  • Dirty data skip
In the face of massive samples, dirty data such as data loss, damage, etc. often appear, which is an inevitable phenomenon when the data volume increases. Therefore, we support retries for dirty data, such as switching nodes to retry, and only skipping a certain percentage.
  • Large metadata optimization
In the face of massive samples, metadata has also become Big Metadata, that is, big metadata. It also needs to be treated, slimmed down and optimized like big data. For example, in machine learning scenarios, most of the data reading methods are Scan scans. At this time, we can remove a large number of column statistics recorded in Iceberg metadata to effectively reduce the size of metadata, especially in large and wide table scenarios, leaving only Some necessary ones, such as partitions, primary key Min-max, etc. Thereby greatly reducing the time-consuming task plan and avoiding AM and Driver OOM over-memory.
  • Big metadata speed up
For the acceleration of large metadata, traditionally, single-point processing of metadata is often used. This processing method is also powerless in the face of large metadata. At this time, we can speed up by cutting IO and distributed processing of large metadata. For jobs that rely on full table scans, such as compaction, distributed planning is also used to speed up and avoid over-memory problems.
 

Summary and Outlook

Summary of this article

This sharing introduces the overall architecture, core features, optimization and practice of the enhanced version of Iceberg. A brief summary of the previously shared content mainly includes:
  1. Significantly reduce sample storage space and storage costs by promoting business switching column storage formats and reusing feature data;
  2. Accelerate feature read throughput through vectorized read-time merging engine, so that data reading in GPU training can not become a bottleneck and make full use of computing power;
  3. Perform large-scale feature engineering and research based on the ability to update, upsert, and branch, making model iteration more efficient.
 

Recent Thoughts on LLM Large Language Models

We mentioned a lot of technologies related to feature research and feature engineering. Will feature engineering be unnecessary in the future? Here are some of my thoughts based on the recently popular LLM large language model, and based on some public information and paper knowledge:
 
At present, large language models are basically implemented using the Transformer architecture. Although the ability to learn features is already very strong, word segmentation components are still needed to assist in converting text into a form that the model understands, and the quality of word segmentation will also affect the model to a certain extent. Effect. At this stage, the word segmentation algorithms of various large language models are not the same, and there is still a certain distance from complete end-to-end, and basically all of them can be automated. Of course, there are also new research and papers such as Megabyte trying to do word segmentation and training architecture in a completely end-to-end manner, and have achieved good results, but we still need to look forward to larger-scale effect verification. Therefore, if it is necessary to re-develop a large language model in a short period of time, word segmentation and feature engineering are still the only way to go.
Of course, due to cost considerations, many companies and institutions will not re-develop a large language model from scratch. Generally, they will fine-tune an existing large language model and optimize it for downstream and vertical tasks, so feature engineering is still worth considering. . For example: use artificial feedback to sort AI questions and answers, score it to align with human preferences and social legal norms; add some additional features to assist AI to understand the current context and make more appropriate answers. There are also some new technologies such as Low-Rank Adaptation (LoRA) that greatly reduce the amount of parameters that need to be fine-tuned, and do not need to update the parameters of the basic large model, so that the fine-tuning training can be completed faster and the input tokens can be reduced greatly. Reduce computational costs.
For prompt word engineering and context learning, it is really not necessary to pay attention to the underlying feature engineering, and there is no need for training, and AI can be directly combined with context information to acquire knowledge and answer. At present, many applications have appeared in the industry, combined with word vector search, to provide contextual information needed by AI to answer content that has not been trained before. This is a brand new direction, applicable to many scenarios that do not require high accuracy, which greatly reduces the threshold for AI model development and has great potential. However, context information must be provided for each interface call, and some large language model billing standards are billed according to the number of input and output tokens, and the cost of use is relatively high. If you can fine-tune it, you can save a lot of cost. The effect can also be better. Secondly, the prompt word project will be more suitable for large models with hundreds of billions of parameters, and its thinking chain and emergence ability are better; for those with fewer parameters, fine-tuning can be used to achieve equal or even better performance. However, if fine-tuning is needed at present, feature engineering still has the opportunity to improve the effect. Generally speaking, it will be the same as the trend mentioned at the beginning: feature engineering will become more and more simplified, and its existence in the future will no longer require a lot of time and effort to manually operate.
 

Future Outlook and Planning

Finally, share some future prospects for data lakes and sample storage.
  • "Lake and Stream"
The first is the integration of lakes and streams. In the "lake-stream integration" architecture, data lakes, message queues, and streaming computing can be connected to each other, and can provide unified historical batch and stream-based management and interfaces through the computing framework, while serving low-latency online Streaming training, high-throughput offline batch training; and use the idle computing resources of the message queue to meet the data management of the data lake, saving resource costs.
  • Next Generation Data Format
Common column storage file formats have fewer encoding algorithms, and most of them support primitive primitive types. The data types in the training samples are mostly nested and Tensor vectors. We can explore richer encoding algorithms to better optimize the storage and cost of machine learning features, and use richer index support to speed up training.
  • cloud native
Finally, for enterprises, adopting cloud-native architecture has become a trend and a necessary choice, which can help enterprises better cope with business changes and market challenges, and improve business competitiveness and innovation capabilities. The enhanced version of Iceberg also has enterprise-level capabilities, which can bring computing and storage benefits to users, reduce costs and increase efficiency.
 
Relevant capabilities have been implemented in the streaming computing Flink version of the product
The streaming computing Flink version is a fully managed big data real-time computing engine that is fully compatible with the Apache Flink protocol. The product integrates enterprise-level ultra-large-scale production practice capabilities, and has out-of-the-box, minimalist SQL development, global observability, and free operation and maintenance. , Serverless extreme flexibility, low TCO, high SLA guarantee and other characteristics. A set of codes can easily handle stream-batch integrated data processing, helping enterprises upgrade their big data platforms to cloud-native, real-time, and intelligent.
 
Musk announced that Twitter will change its name to X and replace the Logo . React core developer Dan Abramov announced his resignation from Meta Clarification about MyBatis-Flex plagiarizing MyBatis-Plus OpenAI officially launched the Android version of ChatGPT ChatGPT for Android will be launched next week, now Started pre-registration Arc browser officially released 1.0, claiming to be a replacement for Chrome Musk "purchased for zero yuan", robbed @x Twitter account VS Code optimized name obfuscation compression, reduced built-in JS by 20%! Bun 0.7, a new high-speed JavaScript runtime , was officially released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/10091240