How to use Huawei Cloud ModelArts platform to play with Llama2

This article is shared from the Huawei Cloud Community " How to Play Llama2 with the Huawei Cloud ModelArts Platform " by Ma Shanghua_Lancer.

Oh my god~~ The Llama2 model is open source! !

Llama2 not only open sourced the pre-trained model, but also open sourced the Llama2-Chat model using SFT of conversation data, and gave a detailed introduction to the fine-tuning of the Llama2-Chat model.

The open source model currently has three sizes: 7B, 13B, and 70B. 2 trillion Tokens were used in the pre-training stage, more than 100,000 data were used in the SFT stage, and more than 1 million human preference data were used.

Llama 2, which was released less than a week ago, has already become popular in the research community, and a series of performance evaluation and online trial demos have been released.

Even OpenAI co-founder Karpathy implemented the inference of the Llama 2 baby model in C language .

Now that Llama 2 is available to everyone, how can we fine-tune it on Huawei Cloud to achieve more possible applications?

Open Huawei Cloud's ModelArts to create a notebook. First, you need to download the data set and upload it to the OBS object storage space, and then copy it locally through the command.

Dataset address: https://huggingface.co/datasets/samsum

1. Download the model

Clone Meta's Llama inference repository (download script included):

!git clone https://github.com/facebookresearch/llama.git

Then run the download script:

!bash download.sh

Here, you just need to download the 7B model.

2. Convert the model to a format supported by Hugging Face

!pip install git https://github.com/huggingface/transformerscd transformerspython convert_llama_weights_to_hf.py \ --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir models_hf/7B


Now, we have a Hugging Face model that can be fine-tuned using the Hugging Face library!

3. Run the spinner notebook:

Clone the Llama-recipies repository:

!git clone https://github.com/facebookresearch/llama-recipes.git


Then, open the quickstart.ipynb file in your favorite notebook interface and run the entire notebook.

(Here, Jupyter lab is used):

!pip install jupyterlabjupyter lab # in the repo you want to work in


To accommodate the actual model path after conversion, make sure to change the next line to:

model_id="./models_hf/7B"


Finally, a model fine-tuned by Lora is completed.

4. Perform inference on the fine-tuned model

Currently, the problem is that Hugging Face only saves the adapter weights, not the complete model. So we need to load the adapter weights into the full model.

Import library:

import torchfrom transformers
import LlamaForCausalLM, LlamaTokenizerfrom peft import PeftModel, PeftConfig


Load the tokenizer and model:

model_id="./models_hf/7B"tokenizer = LlamaTokenizer.from_pretrained(model_id)model =LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16)


Load the adapter from the location saved after training:

model = PeftModel.from_pretrained(model, "/root/llama-recipes/samsungsumarizercheckpoint")


Run inference:

eval_prompt = """Summarize this dialog:A: Hi Tom, are you busy tomorrow’s afternoon?B: I’m pretty sure I am. What’s up?A: Can you go with me to the animal shelter?.B: What do you want to do?A: I want to get a puppy for my son.B: That will make him so happy.A: Yeah, we’ve discussed it many times. I think he’s ready now.B: That’s good. Raising a dog is a tough issue. Like having a baby ;-)A: I'll get him one of those little dogs.B: One that won't grow up too big;-)A: And eat too much;-))B: Do you know which one he would like?A: Oh, yes, I took him there last Monday. He showed me one that he really liked.B: I bet you had to drag him away.A: He wanted to take it home right away ;-).B: I wonder what he'll name it.A: He said he’d name it after his dead hamster – Lemmy - he's a great Motorhead fan :-)))---Summary:"""
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
model.eval()with torch.no_grad(): print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

LLM Engine fine-tuning is more convenient

What if you want to fine-tune Llama 2 with your own data?

Alexandr Wang, the Chinese CEO who founded the Scale AI startup, said that his company’s open source LLM Engine can fine-tune Llama 2 in the simplest way.


The team of Scale AI introduced the fine-tuning method of Llama 2 in detail in a blog post.

from llmengine import FineTuneresponse = FineTune.create( model="llama-2-7b", training_file="s3://my-bucket/path/to/training-file.csv",)
print(response.json())

data set


In the following example, Scale uses the Science QA data set.

This is a popular dataset consisting of multiple choice questions, each question may have textual context and image context, and includes thorough explanations and walkthroughs to support the solution.


Examples of Science QA

Currently, LLM Engine supports fine-tuning of "prompt completion pairs". First, the Science QA dataset needs to be converted into a supported format, a CSV containing two columns: prompt and response.

Before you begin, install the required dependencies.

!pip install datasets==2.13.1 smart_open[s3]==5.2.1 pandas==1.4.4


You can load a dataset from Hugging Face and observe the characteristics of the dataset.

from datasets import load_datasetfrom smart_open import smart_openimport pandas as pd
dataset = load_dataset('derek-thomas/ScienceQA')dataset['train'].features


A common format for providing Science QA examples is:

Context: A baby wants to know what is inside of a cabinet. Her hand applies a force to the door, and the door opens.Question: Which type of force from the baby's hand opens the cabinet door?Options: (A) pull (B) pushAnswer: A.


Since the format of options in the Hugging Face dataset is "a list of possible answers", this list needs to be converted into the above example format by adding an enumeration prefix.

choice_prefixes = [chr(ord('A') + i) for i in range(26)] # A-Zdef format_options(options, choice_prefixes): return ' '.join([f'({c}) {o}' for c, o in zip(choice_prefixes, options)])

Now, write a formatting function that converts individual samples in this dataset into prompts and responses for the input model.

def format_prompt(r, choice_prefixes): 
    options = format_options(r['choices'], choice_prefixes) 
    return f'''Context: {r["hint"]}\nQuestion: {r["question"]}\nOptions:{options}\nAnswer:'''
def format_response(r, choice_prefixes):
    return choice_prefixes[r['answer']]


Finally, build the dataset.

Note that some examples in Science QA only have context images. (These examples will be skipped in the following demo because Llama-2 is purely a language model and cannot accept image input.)

def convert_dataset(ds): 
    prompts = [format_prompt(i, choice_prefixes) for i in ds if i['hint'] != '']
        labels = [format_response(i, choice_prefixes) for i in ds if i['hint'] != ''] 
        df = pd.DataFrame.from_dict({'prompt': prompts, 'response': labels}) 
        return df


LLM Engine supports the use of "pre-training and validation data sets" for training. If you only provide the training set, LLM Engine will randomly split 10% of the content from the data set for verification.

Because splitting the dataset prevents the model from overfitting the training data, resulting in poor generalization to real-time data during inference.

Additionally, these dataset files must be stored in publicly accessible URLs so that the LLM Engine can read them. For this example, Scale saves the dataset to s3.

Moreover, the preprocessed training data set and validation data set are also disclosed in Github Gist. You can directly replace train_url and val_url with these links.

train_url = 's3://...'val_url = 's3://...'df_train = convert_dataset(dataset['train'])with smart_open(train_url, 'wb') as f: df_train.to_csv(f)df_val = convert_dataset(dataset['validation'])with smart_open(val_url, 'wb') as f:df_val.to_csv(f)


Now, you can start fine-tuning via the LLM Engine API.

fine-tuning

First, you need to install LLM Engine.

!pip install scale-llm-engine


Next, you need to set up the Scale API key. Follow the instructions in the README to obtain your unique API key.

Advanced users can also follow the self-hosted LLM Engine guide, which eliminates the need for a Scale API key.

import os
os.environ['SCALE_API_KEY'] = 'xxx'


Once you have everything set up, fine-tuning your model only requires one API call.

Here, Scale chose the 7 billion parameter version of Llama-2 because it is powerful enough for most use cases.

from llmengine import FineTuneresponse = FineTune.create( model="llama-2-7b", training_file=train_url, validation_file=val_url, hyperparameters={ 'lr':2e-4, }, suffix='science-qa-llama')run_id = response.fine_tune_id


Through run_id, you can monitor the status of your work and get real-time updated metrics for each epoch, such as training and validation losses.

Science QA is a large dataset, so training may take an hour or two to complete.

while True: job_status = FineTune.get(run_id).status # Returns one of `PENDING`, `STARTED`, `SUCCESS`, `RUNNING`, # `FAILURE`, `CANCELLED`, `UNDEFINED` or `TIMEOUT` print(job_status) if job_status == 'SUCCESS': break time.sleep(60)#Logs for completed or running jobs can be fetched withlogs = FineTune.get_events(run_id)


Reasoning and Evaluation

Once you've finished fine-tuning, you can start generating responses to any input. However, before doing so, make sure the model exists and is ready to accept input.

ft_model = FineTune.get(run_id).fine_tuned_model
However, it may take several minutes for your first inference results to be output. After that, the reasoning process speeds up.

Let’s evaluate the performance of the Llama-2 model fine-tuned on Science QA.

import pandas as pd
#Helper a function to get outputs for fine-tuned model with retriesdef 
 get_output(prompt: str, num_retry: int = 5): for _ in range(num_retry): try: response = Completion.create( model=ft_model, prompt=prompt, max_new_tokens=1, temperature=0.01 ) return response.output.text.strip() except Exception as e: print(e) return ""
#Read the test datatest = pd.read_csv(val_url)
test["prediction"] = test["prompt"].apply(get_output)
print(f"Accuracy: {(test['response'] == test['prediction']).mean() * 100:.2f}%")


After fine-tuning, Llama-2 can achieve an accuracy of 82.15%, which is quite good.

So, how does this result compare to the Llama-2 base model?

Since the pre-trained model was not fine-tuned on these datasets, an example needs to be provided in the prompt so that the model learns to adhere to our expected reply format.

Additionally, we can see how it compares to a fine-tuned model of similar size, the MPT-7B.


Fine-tuning Llama-2 on Science QA shows an absolute difference of 26.59% in performance gain!

Furthermore, due to the shorter cue length, using fine-tuned models for inference is cheaper than using few-shot cues. This fine-tuned Llama-27B model also outperforms the 175 billion parameter model GPT-3.5.

It can be seen that the Llama-2 model performs better than MPT in both fine-tuning and few-sample prompt settings, fully demonstrating its advantages as a basic model and a fine-tunable model.

In addition, Scale uses the LLM Engine to fine-tune and evaluate the performance of LLAMA-2 on several tasks in GLUE, a commonly used NLP benchmark dataset.


Now, anyone can unlock the true potential of fine-tuned models and witness the magic of powerful AI-generated replies.

I've found that while Huggingface has built an excellent library around transformers, their guides are often too complex for the average user.

References:

  • https://twitter.com/MetaAI/status/1683581366758428672
  • https://brev.dev/blog/fine-tuning-llama-2
  • https://scale.com/blog/fine-tune-llama-2

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

 

Bun releases official version 1.0, a magical bug in Windows File Explorer when JavaScript is runtime written by Zig , improves performance in one second JetBrains releases Rust IDE: RustRover PHP latest statistics: market share exceeds 70%, the king of CMS transplants Python programs To Mojo, the performance is increased by 250 times and the speed is faster than C. The performance of .NET 8 is greatly improved, far ahead of .NET 7. Comparison of the three major runtimes of JS: Deno, Bun and Node.js Visual Studio Code 1.82 NetEase Fuxi responded to employees "due to BUG was threatened by HR and passed away. The Unity engine will charge based on the number of game installations (runtime fee) starting next year.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10110639