This article was jointly translated by chatgpt and claude from: How to train your own Large Language Models
How to train large language models (LLMs) using Databricks, Hugging Face, and MosaicML
introduce
Large language models such as OpenAI's GPT-4 or Google's PaLM have taken the AI field by storm. However, most companies currently do not have the capacity to train these models and rely entirely on technology provided by only a few large tech companies.
At Replit, we have invested a lot of infrastructure to train our own large language models from scratch. In this article, we outline how to train LLMs, from raw data to deployment in user-facing production environments. We'll discuss the engineering challenges we faced along the way, and how we leverage the vendors we believe make up the modern LLM stack: Databricks, Hugging Face, and MosaicML.
While our model is primarily intended for the use case of code generation, the techniques and lessons learned are applicable to all types of LLMs, including general language models. In the coming weeks and months, we plan to dive into the details of our process.
Why train your own LLMs?
One of the most common questions asked by AI teams at Replit is “Why train your own models?” There are many reasons why companies decide to train their own LLMs, from data privacy and security to more control over updates and improvements.
At Replit, our main concerns are customization, dependency reduction, and cost-effectiveness.
- Personalization . Training a custom model allows us to tailor it to specific needs and requirements, including platform-specific features, terminology, and context, which are not well covered in general-purpose models such as GPT-4 or code-only models such as Codex. For example, our models are trained to better handle specific web-based languages popular on Replit, including Javascript React (JSX) and Typescript React (TSX).
- Reduce dependencies . While we will always use the appropriate model for our needs, we believe there are benefits to being less dependent on a small number of AI vendors. This applies not only to Replit, but to the wider developer community as well. This is why we plan to open source some of our models, which would not be possible without the means to train them.
- cost-effective . While costs will continue to drop, LLMs remain cost-prohibitive for the global developer community. At Replit, our mission is to bring the next 100 million software creators online. We believe that students coding on mobile phones in India should have access to the same AI as professional developers in Silicon Valley. To achieve this, we train custom models that are smaller, more efficient, and can be hosted at a greatly reduced cost.
data pipeline
LLM models require a large amount of data for training. To train them, robust data pipelines need to be built that are highly optimized yet flexible enough to easily incorporate new public and proprietary data sources.
Data Sources
We start with "Data Sources" as our primary data source, which is available on Hugging Face . Hugging Face is an excellent resource for datasets and pretrained models. They also provide many useful tools, including tools for tokenization, model inference, and code evaluation, available as part of the Transformers library.
The "data source" is provided by the BigCode project. Details on dataset construction can be found in Kocetkov et al. (2022) . After deduplication, the version 1.2 dataset contains about 2.7TB of licensable source code, covering more than 350 programming languages.
The Transformers library does an excellent job of abstracting away many of the challenges associated with model training, including dealing with large-scale data. However, we found it insufficient for our process as we needed additional control over the data and the ability to process it in a distributed fashion.
data processing
When more advanced data processing is required, we use Databricks to build data pipelines. This approach also allows us to easily bring other data sources (such as Replit or Stack Overflow) into our processing, which is something we plan to do in future iterations.
First, we need to download the raw data from Hugging Face. We use Apache Spark to parallelize the dataset building process into each programming language. We then repartition the data and rewrite it in parquet format with optimized settings for downstream processing.
Next, we turn to data cleaning and preprocessing. Often it is important to deduplicate the data and fix various encoding issues, but The Stack has done this for us using an approximate deduplication technique proposed by Kocetkov et al. (2022). However, once we start ingesting Replit data into our pipeline, we have to re-run the deduplication process. That's the beauty of using a tool like Databricks where we can treat The Stack, Stackoverflow and Replit data as three sources in a larger data lake and leverage them in downstream processing as needed.
Another benefit of using Databricks is that we can run scalable and traceable analytics on the underlying data. We run various summary statistics on the data sources, examine long-tailed distributions, and diagnose any problems or inconsistencies in the process. All of this is done in a Databricks notebook, which also integrates with MLFlow to track and reproduce all of our analysis along the way. This step, which amounts to a regular X-ray of our data, also helps guide us through the various steps of preprocessing.
For preprocessing, we take the following steps:
- We anonymize data by removing any personally identifiable information (PII) including email, IP addresses and encryption keys.
- We use some heuristics to detect and remove auto-generated code.
- For some subsets of languages, we remove code that doesn't compile or can't be parsed with a standard parser.
- We filter files based on average line length, maximum line length, and percentage containing alphanumeric characters.
Tokenization and vocabulary training
Before tokenization, we use a random subsample of the same dataset to train our own custom vocabulary. Custom vocabularies can enable our models to better understand and generate code content, improve model performance and accelerate model training and inference.
This step is one of the most important in the whole process, as it is used in all three stages of our process (data pipeline, model training, and inference). This highlights the importance of having a robust and complete infrastructure for the model training process.
We plan to explore tokenization in depth in a future blog post. At a high level, some important things we need to consider are vocabulary size, special tokens, and space reserved for sentinel tokens.
Once we have trained our custom vocabulary, we tokenize the data. Finally, we build our training dataset and write it into shards for optimal input into the model training process.
model training
We train our model using MosaicML . After previously deploying our own training clusters, we found that the MosaicML platform provided us with several key benefits.
- Multiple cloud providers . Mosaic allows us to leverage GPUs from different cloud providers without the overhead of setting up accounts and all the necessary integrations.
- LLM training configuration . The Composer library has many configurations optimized for training various models and different types of training objectives.
- Managed infrastructure . Their hosted infrastructure provides us with orchestration, efficiency optimization, and fault tolerance (i.e. recovery from node failure).
When determining the parameters of our model, we consider various trade-offs in model size, context window, inference time, memory footprint, etc. Larger models generally provide better performance and are easier to transfer learn. However, these models have higher computational requirements in both training and inference. The latter is of particular importance to us. Replit is a cloud-native IDE that performs like a native application, so our code completion model needs to be very fast. Therefore, we usually prefer smaller models with memory footprint and low-latency inference.
In addition to model parameters, we can choose from a wide variety of training objectives, each with its own unique advantages and disadvantages. The most common training objective is next token prediction. This usually works for code completion, but does not take into account the context downstream in the documentation. This can be mitigated by using a "fill in the middle" objective, where a series of tokens in a document are occluded and the model must predict them using the surrounding context. Another approach is UL2 (Unsupervised Implicit Language Learning), which frames different language model training objectives as a denoising task, where the model has to recover lost subsequences given an input.
Once we decide on the model configuration and training goals, we start the training run on the multi-node GPU cluster. We can adjust the number of nodes allocated for each run depending on the size of the model to be trained and how quickly the training process completes. Running large GPU clusters is expensive, so it's important to make sure they're being utilized in the most efficient way possible. We closely monitor GPU utilization and memory to ensure that our computing resources are getting the best possible utilization.
We use Weights & Biases to monitor the training process, including resource utilization and training progress. We monitor our loss curves to ensure the model is learning efficiently at every step of the training process. We also pay attention to loss spikes. These are sudden increases in loss values that usually indicate a potential problem with the training data or model architecture. Because these situations often require further investigation and potential adjustments, we implemented data certainty in our process to more easily reproduce, diagnose, and address the potential source of any such loss spikes.
Evaluate
To test our model, we use a variant similar to the HumanEval framework described in Chen et al. (2021). We use the model to generate a block of Python code given a function signature and docstring. We then run a test case against the generated function to determine whether the generated code block works as expected. We run multiple samples and analyze the corresponding Pass@K numbers.
This approach works best with Python because of the ready-made evaluators and test cases. But since Replit supports many programming languages, we need to evaluate the performance of the model in a wider range of programming languages. We found this difficult to do, and no widely adopted tool or framework provided a comprehensive solution. Two specific challenges include producing reproducible runtime environments in any programming language, and ambiguity for programming languages that do not have widely used standards for test cases (e.g., HTML, CSS, etc.). Fortunately, "generating reproducible runtime environments in any programming language" is our area of expertise at Replit! We are building an evaluation framework that enables any researcher to plug in and test their multilingual benchmarks. We will discuss this issue in a future blog post.
Deploy to production
Once we've trained and evaluated our model, it's time to deploy it to production. As mentioned earlier, our code completion model should feel fast, with very low latency between requests. We use NVIDIA's FasterTransformer and Triton Server to accelerate our inference process. FasterTransformer is a library that implements an accelerator engine for inference of transformer-based neural networks, and Triton is a stable and fast inference server with simple configuration. This combination gives us a highly optimized layer that sits between the Transformer model and the underlying GPU hardware, enabling ultrafast distributed inference of large models.
After deploying our model to production, we were able to use the Kubernetes infrastructure to automatically scale it as needed. While we discussed autoscaling in an earlier blog post, it's worth mentioning that hosting an inference server presents a unique set of challenges. These include large artifacts (i.e. model weights) and special hardware requirements (i.e. different GPU sizes/numbers). Our deployments and cluster configurations are designed to ship quickly and reliably. For example, our clusters are designed to bypass GPU shortages in individual regions, looking for the cheapest available nodes.
Before putting a mockup in front of actual users, we like to test it ourselves and get a feel for the "vibe" of the mockup. While the HumanEval test results we computed earlier are useful, there is no substitute for using the model yourself, including its latency, consistency of recommendations, and overall usefulness. Putting a model in front of a Replit crew is as easy as flipping a switch. Once we're comfortable with it, we flip another switch and roll it out to our other users.
We continue to monitor model performance and usage metrics. For model performance, we monitor metrics such as request latency and GPU utilization. For usage, we track acceptance rates for code suggestions and segment them by several dimensions, including programming language. This also allows us to A/B test different models and get a quantitative measure of how one model compares to another.
Feedback and Iteration
Our model training platform allows us to train a production-deployable model from raw data in less than a day. But more importantly, it allows us to train and deploy models, gather feedback, and iterate quickly based on that feedback.
Our process also needs to be robust to any changes in the underlying data sources, model training targets, or server architecture. This allows us to take advantage of the new and exciting advances and capabilities that are brought to this rapidly evolving field every day.
Next, we will extend our platform to allow us to leverage Replit itself to improve our models. This includes the use of reinforcement learning based on human feedback (RLHF) techniques, as well as instruction tuning using data collected from Replit bounty tasks.
Next step
Although we have made great progress, we are still in the early stages of training LLM. We still have a lot of improvements to do, and many hard problems to fix. This trend will only accelerate as language models continue to evolve. A new set of challenges related to data, algorithm and model evaluation will persist.
If you are interested in the many engineering challenges of training an LLM, we would love to talk with you. We welcome feedback and would love to hear your thoughts on what we missed and how you might differ.
Our Replit AI team is always looking for talented engineers, researchers and architects. Be sure to check out our careers page for open positions. If you don't find a role that fits you, but think you can contribute, please get in touch; we'd love to hear from you.