[AI Combat] Open source and commercially available Chinese and English large language model baichuan-7B, built from scratch

[AI Combat] Open source and commercially available Chinese and English large language model baichuan-7B, built from scratch

insert image description here

Introduction to baichuan-7B

baichuan-7B is an open source and commercially available large-scale pre-trained language model developed by Baichuan Intelligence. Based on the Transformer structure, the 7 billion parameter model trained on about 1.2 trillion tokens supports both Chinese and English, and the context window length is 4096. It achieves the best results of the same size on the standard Chinese and English authoritative benchmarks (C-EVAL/MMLU).

baichuan-7B Chinese evaluation

  • C-Eval
    insert image description here

  • Gaokao
    insert image description here

baichuan-7B build

  • 1. Pull the docker image

    docker pull nvcr.io/nvidia/pytorch:21.08-py3
    

    【】Requires cuda 11.1 and above

  • 2. Create docker

    nvidia-docker run -it -d \
        --name baichuan_llm \
        -v /llm:/notebooks \
        -e TZ='Asia/Shanghai' \
        --shm-size 16G \
        nvcr.io/nvidia/pytorch:21.08-py3
    

    Into the container:

    docker exec -it baichuan_llm env LANG=C.UTF-8 /bin/bash
    
  • 3. Download the code

    cd /notebooks/
    git clone https://github.com/baichuan-inc/baichuan-7B.git
    
  • 4. Download the model weight file

    cd baichuan-7B/
    git clone https://huggingface.co/baichuan-inc/baichuan-7B
    
  • 5. According to the dependent library

    pip install -r requirements.txt
    
  • 6. Reasoning

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("baichuan-7B", trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained("baichuan-7B", device_map="auto", trust_remote_code=True)
    inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
    inputs = inputs.to('cuda:0')
    pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
    print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
    
    • output

      insert image description here

  • 7. Training

    • Data preparation
      The user divides the training corpus into multiple UTF-8 text files evenly according to the multiple of the total rank number, and places them in the corpus directory (data_dir by default). Each rank process will read different files in the corpus directory, load them all into memory, and start the subsequent training process. The above is a simplified demonstration process. It is recommended that users adjust the data production logic according to their needs during formal training tasks.

    • Configure DeepSpeed
      ​​and modify config/hostfile. If there are multiple machines and multiple cards, you need to modify the IP configuration of each node in ssh.

    • train

      sh scripts/train.sh
      

reference

https://huggingface.co/baichuan-inc/baichuan-7B/tree/main
https://github.com/baichuan-inc/baichuan-7B

Guess you like

Origin blog.csdn.net/zengNLP/article/details/131290939