Publicly available numerical data for LLM

Publicly available numerical data for LLM

This repository contains some publicly available numerical data used to train OpenAI's large language models. These data have been processed into OpenAI's data pipeline format. Additionally, we provide a  Python script for converting raw tabular data into a format suitable for training.

Data Sources

These data come from the following publicly available sources:

Data Format

Data is stored as JSON files, each JSON file includes an array named data. Each element in the array is a dictionary with two keys:

  • input: The input text used to train the model. The input text usually includes a question or description.
  • output: The expected output of the model. This is usually a short answer or numeric value.
{
    "data": [
        {
            "input": "What was the average price of a gallon of regular gasoline in the United States in 2019?",
            "output": "2.60"
        },
        {
            "input": "What is the distance from Earth to Mars in kilometers?",
            "output": "225,000,000"
        },
        ...
    ]
}

how the data is used

To use this data to train your model, you need to process it into a format suitable for your training framework. We provide a  Python script for converting raw tabular data into a format suitable for training. You can refer to this script to know how to process the data and how to modify it according to your needs.

Numbers LLM Developers Should Know

At Google, legendary engineer Jeff Dean put together a document called "Numbers Every Engineer Should Know." For large language model (LLM) developers, it is useful to have a similar set of numbers that can be used for rough calculations. Here we share some specific numbers used by Anyscale, explain why they are important and how to use them to your advantage.

content list

  • CPU clock cycles
  • memory access latency
  • disk latency
  • network delay
  • FLOPs and AI Training

CPU clock cycles

  • One CPU clock cycle takes approximately 0.4 nanoseconds (ns).
    CPU clock cycles are a key metric for measuring CPU performance. Knowing the length of CPU clock cycles can help you better understand performance bottlenecks when designing and optimizing algorithms.

memory access latency

  • Reading data from L1 cache takes about 0.5 nanoseconds.
  • Reading data from L2 cache takes about 7 nanoseconds.
  • Reading data from L3 cache takes about 100 nanoseconds.
  • Reading data from main memory takes about 100 nanoseconds.
    When the CPU needs to access data, it first checks the cache levels (L1, L2, and L3). If the required data is not in cache, the CPU needs to access main memory. Understanding the latency of accessing various levels of cache and main memory is critical to identifying and optimizing algorithm performance.

disk latency

  • Reading data from a solid-state drive (SSD) takes approximately 20-100 microseconds (µs).
  • Reading data from a traditional hard disk drive (HDD) takes approximately 1-10 milliseconds (ms).
    Disk latency is the time it takes to read or write data from disk. Knowing disk latencies can help understand storage system performance bottlenecks when processing large amounts of data.

network delay

  • The round-trip latency (RTT) within the same data center is approximately 0.5 milliseconds.
  • Intercontinental fiber optic cable has a round trip latency of about 150 milliseconds.
    Network latency refers to the time it takes for data to travel across a network. Understanding network latency helps predict performance when developing distributed systems and optimizing network communications.

FLOPs and AI Training

  • An NVIDIA A100 GPU (NVIDIA A100 Graphics Processing Unit) can provide approximately 312 trillion floating point operations per second (TFLOPs).
  • Training a GPT-3 model requires about 3.14 * 10^23 floating point operations.
    FLOPs (floating point operations per second) are a common measure of processor performance, especially in the fields of AI training and high-performance computing. Knowing the number of FLOPs of the processor and the number of FLOPs required to train the model can help estimate training time and hardware requirements.

license

These data follow  the CC0 1.0  protocol. You are free to copy, modify, distribute and use this data without obtaining a license or paying a fee. However, when using the data, we encourage you to cite this repository so others can find these resources.

project address

https://github.com/ray-project/llm-numbers

Guess you like

Origin blog.csdn.net/qq_19968255/article/details/130774378