LLaMA Large Language Model Deployment Tutorial by Meta

The LLaMA model launched by Facebook

Introduction:

LLaMA (Lager Language Model From Meta), which is a collection of basic language models from 7B to 65B parameters. It trains models on trillions of text tokens and shows that it is possible to train state-of-the-art models using only publicly available datasets, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models Chinchilla-70B and PaLM-540B.

The goal of the LLaMA model is to train a smaller model for a longer period of time on a larger dataset to achieve the same or higher accuracy. Because the inference cost of small models is lower, the resource conditions required for their deployment are also cheaper, which enables individuals or institutions without high hardware resources to study LLMs.

data set:

The model was trained using the following data sources: CCNet [67%], C4 [15%], GitHub [4.5%], Wispedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange [ 2%]. The Wikipedia and Books domain includes data in the following languages: Bulgarian, Catalan, Czech, Danish, German, English, Spanish, French, Croatian, Hungarian, Italian, Dutch, Polish Chinese, Portuguese, Romanian, Russian, Slovenian, Serbian, Swedish, Ukrainian. More details on the training set and corresponding preprocessing can be found in the paper .

Hyperparameter settings for the model :

Model usage :

Main purpose: The main purpose of LLaMA is the study of large language models, including: exploring potential applications such as question answering, natural language understanding or reading comprehension, understanding the capabilities and limitations of current language models, and developing improvements to these capabilities and limitations Technology to assess and mitigate bias, risk, production of toxic and harmful content, hallucinations.

Primary target users: The primary target users of this model are researchers in the fields of natural language processing, machine learning, and artificial intelligence.

Use cases out of scope: LLaMA is a base model. Therefore, it should not be used in downstream applications without further risk assessment. In particular, the model was not trained on human feedback and could therefore generate toxic or offensive content, incorrect information, or generally unhelpful answers.

Model use cases:

LLaMA was not trained to be a chatbot. All it knows is to predict the next word in the sequence. Chat-GPT also has many hidden hints, just examples where you don't see it. So if you want LLaMA's answers to be what you expect, try giving examples of questions and answers first.

 

As shown in the picture above, apart from the long and cumbersome guide, it is not Chinese-friendly, and if you ask questions in Chinese, you will get even worse results.

As shown in the figure above, if the model is not guided, the model's answers will be very confusing. If a little guidance is given, it can answer some questions correctly, but it will still generate a bunch of nonsense beyond the questions (need do some processing).

Significance of the LLaMA model:

LLaMA will be useful in areas of natural language research and potentially advanced applications such as "question answering, natural language understanding or reading comprehension, gaining insight into the capabilities and limitations of current language models".

While the highest-end LLaMA model (LLaMA-65B with 65 billion parameters) is pointing to similar offerings from AI competitors like DeepMind, Google, and OpenAI, arguably the most noteworthy movement comes from the LLaMA-13B model: As mentioned, this model is said to be able to run on a single GPU and outperform GPT-3.

Unlike the data center requirements of GPT-3 derivatives, LLaMA-13B opens the door to ChatGPT-like performance on consumer-grade hardware in the near future.

The number of parameters is a very important indicator in AI. Parameters are variables that a machine learning model uses to make predictions or classifications based on input data. The number of parameters in a language model is a key determinant of performance, and larger models are generally able to handle more complex tasks and generate more consistent output. However, more parameters take up more space and require more computing resources to run. Therefore, if a model can achieve the same results as another model with fewer parameters, it shows that it has significantly improved efficiency.

Independent AI researcher Simon Willison analyzed the impact of Meta's new AI model in a Mastodon post, writing: "I now think that within a year or two we will be running on our own (state-of-the-art) phones and laptops. A language model with most of the features of ChatGPT."

Download and deployment of the LLaMA model:

download:

Model code acquisition:

GitHub - facebookresearch/llama: Inference code for LLaMA models

Model pre-training parameter download:

LLaMA open source language model 7B 13B 30B 65B leaked version complete 568GB domestic network disk download address free magnetic link- openAI

deploy:

Configure the model hyperparameter super_params.json file, and configure it according to the size of the deployed model:

Modify the example.py file to adapt it to read the .bin parameter file:

(1) Modify the load function:

def load(
    ckpt_dir: str,
    tokenizer_path: str,
    local_rank: int,
    world_size: int,
    max_seq_len: int,
    max_batch_size: int,
) -> LLaMA:
    start_time = time.time()
    print("Loading")
    # checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
    # assert world_size == len(
    #     checkpoints
    # ), f"Loading a checkpoint for MP={len(checkpoints)} but world size is {world_size}"
    # ckpt_path = checkpoints[local_rank]
    # print("Loading")
    # checkpoint = torch.load(ckpt_path, map_location="cpu")
# 加载超参数
    with open(Path(ckpt_dir) / "super_params.json", "r",encoding='utf-8') as f:
        super_params = json.loads(f.read().decode('utf-8'))
    model_args: ModelArgs = ModelArgs(
        max_seq_len=max_seq_len, max_batch_size=max_batch_size,**super_params
    )
    tokenizer = Tokenizer(model_path=tokenizer_path)
    model_args.vocab_size = tokenizer.n_words
    torch.set_default_tensor_type(torch.cuda.HalfTensor)
    model = Transformer(model_args)
    # print(model.layers)
    torch.set_default_tensor_type(torch.FloatTensor)

# 加载模型参数bin文件
    checkpoints = sorted(Path(ckpt_dir).glob("*.bin"))
    weights = {}
    for i in checkpoints:
        weights.update(torch.load(i))
    # 需要更改加载的bin文件参数所对应的各层的名字,使其与model中各层名字一一对应
    keys = [i for i in weights.keys()]
    for key in keys:
        if key.find('model.decoder.') != -1:
            keyNew = key.split('model.decoder.')[1]
            if keyNew.find('q_')>0:
                temp = keyNew.split('self_attn.q_proj')
                keyNew = temp[0] + 'attention.wq' + temp[1]
            elif keyNew.find('k_')>0:
                temp = keyNew.split('self_attn.k_proj')
                keyNew = temp[0] + 'attention.wk' + temp[1]
            elif keyNew.find('v_')>0:
                temp = keyNew.split('self_attn.v_proj')
                keyNew = temp[0] + 'attention.wv' + temp[1]
            elif keyNew.find('o_')>0:
                temp = keyNew.split('self_attn.o_proj')
                keyNew = temp[0] + 'attention.wo' + temp[1]
            elif keyNew.find('embed_tokens') != -1:
                keyNew = 'tok_embeddings.weight'
            weights.update({ keyNew: weights.pop(key)})
        elif key.find('lm_head.weight') != -1:
            weights.update({ 'output.weight': weights.pop(key)})
    model.load_state_dict(weights, strict=False)
    # print(model.state_dict())
    generator = LLaMA(model, tokenizer)
    print(f"Loaded in {time.time() - start_time:.2f} seconds")
    return generator

(2) Modify the accepted parameters in the main function, and give ckpt_dir and tokenizer_path a default value:

def main(
    ckpt_dir: str = '7b', # 模型参数所在目录
    tokenizer_path: str = '7b/tokenizer.model',
    temperature: float = 0.8,
    top_p: float = 0.95,
    max_seq_len: int = 512,
    max_batch_size: int = 32,
):

(3) Run the command in the terminal:

The first run requires the following commands:

  1. pip install -r requirements.txt
  2. pip install -e .

Start the model:

torchrun example.py

Guess you like

Origin blog.csdn.net/qq_52495709/article/details/130109485