How to train your own large language model

How to Train Large Language Models (LLMs) Using Databricks, Hugging Face, and MosaicML

introduce

Large language models, such as OpenAI's GPT-4 or Google's PaLM, have taken the AI ​​world by storm. However, most companies currently do not have the capacity to train these models and rely entirely on a handful of large tech companies as technology providers.

At Replit, we've invested heavily in the infrastructure needed to train our own large-scale language models from scratch. In this blog post, we outline how we train LLMs, from raw data to deployment in user-facing production environments. We'll discuss the engineering challenges we faced along the way, and how we leverage the vendors we believe make up the modern LLM stack: Databricks, Hugging Face, and MosaicML.

While our model is primarily intended for the use case of code generation, the techniques and lessons discussed are applicable to all types of LLMs, including general language models. We plan to dive into the nuts and bolts of our process in a series of blog posts over the coming weeks and months.

Why train yourself for an LLM?

One of the most common questions asked by Replit's AI team is "Why train your own model?" There are many reasons why a company might decide to train its own LL.M., from data privacy and security to increased control over updates and improvements.

At Replit, we care primarily about customization, reducing dependencies, and cost efficiency.

  • customized . Training a custom model allows us to tailor it to our specific needs and requirements, including platform-specific features, terminology, and context, which are found in general-purpose models such as GPT-4 or even code-specific models such as Codex Not well covered. For example, our models were trained to work better with specific web-based languages ​​popular on Replit, including Javascript React (JSX) and Typescript React (TSX).
  • Reduce dependencies . While we will always use the correct model for the task at hand, we believe there are benefits to being less dependent on a few AI providers. This applies not only to Replit, but to the wider developer community as well. That's why we plan to open source some of our models, which we couldn't do without a way to train them.
  • cost-effective . Although costs will continue to decline, LLM remains prohibitively expensive to use in the global developer community. At Replit, our mission is to bring the next billion software creators online. We believe that students who program mobile phones in India should have access to the same AI as professional developers in Silicon Valley. To make this possible, we train custom models that are smaller, more efficient, and can be hosted at a drastically reduced cost.

data pipeline

LLM requires a lot of data for training. Training them requires building robust data pipelines that are highly optimized but flexible enough to easily include new sources of public and proprietary data.

the stack

We started with The Stack available on Hugging Face as our primary data source. Hugging Face is a great resource for datasets and pretrained models. They also provide various useful tools as part of the Transformers library, including tools for tokenization, model inference, and code evaluation.

Stack is provided by the BigCode project. Details of the dataset construction are provided by Kocetkov et al . (2022) . After deduplication, version 1.2 of the dataset contains approximately 2.7 TB of licensed source code written in more than 350 programming languages.

The Transformers library does a good job of abstracting away many of the challenges associated with model training, including dealing with large-scale data. However, we found it insufficient for our process as we needed additional control over the data and the ability to process it in a distributed fashion.

data processing

When more advanced data processing is required, we use Databricks to build our pipelines. This approach also allows us to easily bring other data sources, such as Replit or Stack Overflow, into our process, which we plan to do in future iterations.

The first step is to download the raw data from Hugging Face. We use Apache Spark to parallelize the dataset building process across each programming language. We then repartition the data and rewrite in parquet format using optimized settings for downstream processing.

Next, we turn to cleaning and preprocessing our data. Usually it's important to deduplicate the data and fix various encoding issues, but The Stack has done this for us using the approximate deduplication technique outlined by Kocetkov et al. (2022). However, once we start bringing Replit data into our pipeline, we will have to re-run the deduplication process. That's the beauty of having a tool like Databricks where we can treat Stack, Stackoverflow and Replit data as three sources in a larger data lake and use them in our downstream processes as needed.

Another benefit of using Databricks is that we can run scalable and tractable analytics on the underlying data. We run all types of summary statistics on our data sources, examine long-tailed distributions, and diagnose any problems or inconsistencies in the process. All of this is done in Databricks notebooks, which can also be integrated with MLFlow to track and reproduce all our analysis throughout the process. This step amounts to a regular X-ray of our data and also helps inform the various steps we take for preprocessing.

For preprocessing, we take the following steps:

  • We anonymize data by removing any personally identifiable information (PII), including email, IP addresses, and encryption keys.
  • We use several heuristics to detect and remove auto-generated code.
  • For some languages, we removed code that didn't compile or parse with standard parsers.
  • We filter out files based on average line length, maximum line length, and percentage of alphanumeric characters.

Tokenization and Vocabulary Training

Before tokenization, we train our own custom vocabulary using a random subsample of the same data we used for model training. Custom vocabularies enable our model to better understand and generate code content. This improves model performance and speeds up model training and inference.

This step is one of the most important in the process as it is used in all three stages of our process (data pipeline, model training, inference). It highlights the importance of providing a robust and fully integrated infrastructure for the model training process.

We plan to explore tokenization in more depth in a future blog post. At a high level, some of the important things we have to consider are vocabulary size, special tokens, and reserved space for markup tokens.

Once we have trained our custom vocabulary, we will label our data. Finally, we constructed our training dataset and wrote it into a sharded format optimized for the model training process.

model training

We train our model using MosaicML . Having previously deployed our own training cluster, we found that the MosaicML platform provided us with some key advantages.

  • Multiple cloud providers . Mosaic allows us to utilize GPUs from different cloud providers without the overhead of setting up an account and all the required integrations.
  • LL.M. training configuration . The Composer library has many well-tuned configurations for training various models and different types of training objectives.
  • Hosting infrastructure . Their hosting infrastructure provides us with orchestration, efficiency optimization, and fault tolerance (i.e. recovery from node failure).

When determining the parameters of our model, we considered various trade-offs among model size, context window, inference time, memory footprint, etc. Larger models generally provide better performance and are more capable of transfer learning. However, these models have higher computational requirements for both training and inference. The latter is especially important to us. Replit is a cloud-native IDE that performs like a desktop-native app, so our code completion model needed to be lightning fast. For this reason, we typically choose smaller models with smaller memory footprints and low-latency inference.

In addition to model parameters, we choose from a variety of training objectives, each with its own unique strengths and weaknesses. The most common training objective is next marker prediction. This usually works well for code completion, but fails to take into account the context downstream of the documentation. This can be mitigated by using a "fill-in-the-middle" objective, where a range of tokens in the document are masked out and the model must use the surrounding context to predict them. Another approach is UL2 (Unsupervised Latent Language Learning), which frames a different objective function for training a language model as a denoising task, where the model has to recover missing subsequences given an input.

Once we decide on our model configuration and training goals, we launch our training run on a multi-node cluster of GPUs. We can adjust the number of nodes allocated for each run based on the size of the model we are training and how quickly we want the training process to complete. Running large GPU clusters is expensive, so it's important to utilize them in the most efficient way possible. We closely monitor GPU utilization and memory to ensure we are getting the maximum possible usage from our computing resources.

We use Weights & Biases to monitor the training process, including resource utilization and training progress. We monitor our loss curves to ensure the model is learning efficiently at each step of the training process. We also focus on loss spikes. These are sudden increases in loss values ​​that usually indicate a problem with the underlying training data or model architecture. Because these events often require further investigation and potential adjustments, we enforce data certainty in our processes so we can more easily reproduce, diagnose, and address the potential source of any such loss spikes.

Evaluate

To test our model, we use a variant of the HumanEval framework described in Chen et al . (2021) . Given a function signature and docstring, we use the model to generate a piece of Python code. We then run test cases against the generated functions to determine whether the generated code blocks work as expected. We run multiple samples and analyze the corresponding Pass@K numbers.

This approach works best with Python, where evaluators and test cases can be used. But since Replit supports multiple programming languages, we need to evaluate the model performance in various other languages. We've found this difficult to do, and there are no widely adopted tools or frameworks that provide a comprehensive solution. Two specific challenges include building reproducible runtime environments in any programming language, and the ambiguity of programming languages ​​in the absence of widely used standards for test cases (e.g., HTML, CSS, etc.). Fortunately, "Reproducible Runtime Environments for Any Programming Language" is Replit's specialty! We are currently building an evaluation framework that allows any researcher to plug in and test their multilingual benchmarks. us'

deploy to production

Once we've trained and evaluated our model, it's time to deploy it to production. As we mentioned before, our code completion model should feel fast, with very low latency between requests. We use NVIDIA's FasterTransformer and Triton Server to accelerate our inference process. FasterTransformer is a library that implements an acceleration engine for inference of transformer-based neural networks, and Triton is a stable and fast inference server that is easy to configure. This combination provides us with a highly optimized layer between the converter model and the underlying GPU hardware, and allows for ultra-fast distributed inference on large models.

After deploying our model into production, we were able to use our Kubernetes infrastructure to automatically scale it to meet demand. Although we've discussed autoscaling in previous blog posts, it's worth mentioning that hosting an inference server presents a unique set of challenges. These include large artifacts (ie model weights) and special hardware requirements (ie different GPU sizes/numbers). We design our deployments and cluster configurations so that we can deliver quickly and reliably. For example, our clusters are designed to address GPU shortages in individual regions and look for the cheapest available nodes.

We like to test it ourselves and get a sense of the "vibe" of the model before we put it in front of actual users. The HumanEval test results we computed earlier are useful, but there's nothing like using the model to get a feel for it, including its latency, consistency of recommendations, and general help. Putting a model in front of a Replit crew is as easy as flipping a switch. Once we're happy with it, we'll flip another switch and roll it out to our other users.

We will continue to monitor model performance and usage metrics. For model performance, we monitor metrics such as request latency and GPU utilization. For usage, we track acceptance rates for code suggestions and break them down into multiple dimensions including programming language. This also allows us to A/B test different models and get a quantitative measure of how one model compares to another.

Feedback and Iteration

Our model training platform enables us to turn raw data into production-deployed models in less than a day. But more importantly, it allows us to train and deploy models, gather feedback, and then iterate quickly based on that feedback.

It is also important for our process to remain robust to any changes in the underlying data source, model training target, or server architecture. This allows us to take advantage of new advancements and features in a rapidly evolving field where there seem to be new and exciting announcements every day.

Next, we'll extend our platform to allow us to use Replit itself to improve our models. This includes techniques such as reinforcement learning based on human feedback (RLHF), and instruction tuning using data collected from Replit Bounties.

Next step

While we have come a long way, we are still in the early stages of training an LL.M. We have a lot of improvements to do, and a lot of puzzles to solve. This trend will only accelerate as language models continue to improve. There will be a new set of challenges related to data, algorithm and model evaluation.

If you're excited about the many engineering challenges of training an LLM, we'd love to talk with you. We love feedback and would love to hear from you about what we're missing and what you might do differently.

Guess you like

Origin blog.csdn.net/weixin_39842528/article/details/130307527