First of all, after the model is trained, it is the original version. The model at this time is the largest and can only be used through transformers. Of course, the model at this time is also the most compatible, basically as long as it can support transformers All machines can run it. Transformers are the most common algorithm system in the AI world.
Secondly, due to factors such as the large size of the original model and its slow speed, many experts have invented ways to reduce it but cannot So the way that affects the quality of the model is quantification. The most common quantification methods now are GPTQ and GGML. We generally use quantized models because they require a lot less VRAM or RAM. For example, a 33B model without quantization probably requires 50G~65G VRAM or RAM, 24G is enough after quantification. The actual model is loaded about 1XG, and the remaining space is used for reasoning, which is completely sufficient.
-
Model types and loaders:
original model loader Name marking features original model transformers It consists of multiple consecutive numbered files, such as 001-of-008, 002-of-008, 003-of-008 directory GPTQ AutoGPTQ ExLlama ExLlama_HF GPTQ-for-LLaMa GGML call.cpp The model name contains GGML and only has a file extension of .bin -
Name description
7B,13B,33B,65B,170B 1B=1 billion fp16 The fp16 precision version is generally used as a small original version before quantization. 8K Model with 8K context length 4 bits Using a 4-bit quantized model. Generally in order to save VRAM or RAM 128g Model using 128g parameters in quantization.g=groupsize gpt4 Using gpt4 calibration data, that is, after gpt4 training, it is usually strengthened for certain aspects. Currently, the most common one is chat enhancement. Chat Chat enhancement QLoRA QLoRA fine-tuned version LoRA LoRA fine-tuned version Uncensored Uncensored version (removed ideological stamp) NSFW An uncensored version enhanced for NSFW content OPT OPT format, this is not a model of the LLaMa series. It is a series developed by KoboldAI itself. It was originally used for writing. Their data are uncensored and NSFW enhanced. SuperHOT Extended context length version SuperCOT LoRA strengthens model logic and is generally used for writing -
Resource occupation
Model RAM size required for original size Required RAM size after quantization (4bit) 7B 13GB 3.9GB 13B 24GB 7.8GB 30B 60GB 19.5GB 65B 120GB 38.5GB