Header Image

How to train large language models (LLMs) using Databricks, Hugging Face, and MosaicML

introduce

Large language models such as OpenAI's GPT-4 or Google's PaLM have taken the AI ​​field by storm. However, most companies currently do not have the capacity to train these models and rely entirely on technology provided by only a few large tech companies.

At Replit, we have invested a lot of infrastructure to train our own large language models from scratch. In this article, we outline how to train LLMs, from raw data to deployment in user-facing production environments. We'll discuss the engineering challenges we faced along the way, and how we leverage the vendors we believe make up the modern LLM stack: Databricks, Hugging Face, and MosaicML.

While our model is primarily intended for the use case of code generation, the techniques and lessons learned are applicable to all types of LLMs, including general language models. In the coming weeks and months, we plan to dive into the details of our process.

Why train your own LLMs?

One of the most common questions asked by AI teams at Replit is “Why train your own models?” There are many reasons why companies decide to train their own LLMs, from data privacy and security to more control over updates and improvements.

At Replit, our main concerns are customization, dependency reduction, and cost-effectiveness.

  • Personalization . Training a custom model allows us to tailor it to specific needs and requirements, including platform-specific features, terminology, and context, which are not well covered in general-purpose models such as GPT-4 or code-only models such as Codex. For example, our models are trained to better handle specific web-based languages ​​popular on Replit, including Javascript React (JSX) and Typescript React (TSX).
  • Reduce dependencies . While we will always use the appropriate model for our needs, we believe there are benefits to being less dependent on a small number of AI vendors. This applies not only to Replit, but to the wider developer community as well. This is why we plan to open source some of our models, which would not be possible without the means to train them.
  • cost-effective . While costs will continue to drop, LLMs remain cost-prohibitive for the global developer community. At Replit, our mission is to bring the next 100 million software creators online. We believe that students coding on mobile phones in India should have access to the same AI as professional developers in Silicon Valley. To achieve this, we train custom models that are smaller, more efficient, and can be hosted at a greatly reduced cost.

data pipeline

LLM models require a large amount of data for training. To train them, robust data pipelines need to be built that are highly optimized yet flexible enough to easily incorporate new public and proprietary data sources.

Data Sources

We start with "Data Sources" as our primary data source, which is available on Hugging Face . Hugging Face is an excellent resource for datasets and pretrained models. They also provide many useful tools, including tools for tokenization, model inference, and code evaluation, available as part of the Transformers library.

The "data source" is provided by the BigCode project. Details on dataset construction can be found in Kocetkov et al. (2022) . After deduplication, the version 1.2 dataset contains about 2.7TB of licensable source code, covering more than 350 programming languages.

The Transformers library does an excellent job of abstracting away many of the challenges associated with model training, including dealing with large-scale data. However, we found it insufficient for our process as we needed additional control over the data and the ability to process it in a distributed fashion.

llm-training