LLM large model 1_Basic knowledge

First of all, after the model is trained, it is the original version. The model at this time is the largest and can only be used through transformers. Of course, the model at this time is also the most compatible, basically as long as it can support transformers All machines can run it. Transformers are the most common algorithm system in the AI ​​world.
Secondly, due to factors such as the large size of the original model and its slow speed, many experts have invented ways to reduce it but cannot So the way that affects the quality of the model is quantification. The most common quantification methods now are GPTQ and GGML. We generally use quantized models because they require a lot less VRAM or RAM. For example, a 33B model without quantization probably requires 50G~65G VRAM or RAM, 24G is enough after quantification. The actual model is loaded about 1XG, and the remaining space is used for reasoning, which is completely sufficient.

  • Model types and loaders:

    original model loader Name marking features
    original model transformers It consists of multiple consecutive numbered files, such as 001-of-008, 002-of-008, 003-of-008 directory
    GPTQ AutoGPTQ
    ExLlama
    ExLlama_HF
    GPTQ-for-LLaMa
    GGML call.cpp The model name contains GGML and only has a file extension of .bin
  • Name description

    7B,13B,33B,65B,170B 1B=1 billion
    fp16 The fp16 precision version is generally used as a small original version before quantization.
    8K Model with 8K context length
    4 bits Using a 4-bit quantized model. Generally in order to save VRAM or RAM
    128g Model using 128g parameters in quantization.g=groupsize
    gpt4 Using gpt4 calibration data, that is, after gpt4 training, it is usually strengthened for certain aspects. Currently, the most common one is chat enhancement.
    Chat Chat enhancement
    QLoRA QLoRA fine-tuned version
    LoRA LoRA fine-tuned version
    Uncensored Uncensored version (removed ideological stamp)
    NSFW An uncensored version enhanced for NSFW content
    OPT OPT format, this is not a model of the LLaMa series. It is a series developed by KoboldAI itself. It was originally used for writing. Their data are uncensored and NSFW enhanced.
    SuperHOT Extended context length version
    SuperCOT LoRA strengthens model logic and is generally used for writing
  • Resource occupation

    Model RAM size required for original size Required RAM size after quantization (4bit)
    7B 13GB 3.9GB
    13B 24GB 7.8GB
    30B 60GB 19.5GB
    65B 120GB 38.5GB

Guess you like

Origin blog.csdn.net/weixin_42452716/article/details/132173959