Beat ChatGPT? OpenChat dominates the Stanford AlpacaEval open source list, with a performance of 105.7%

Source | Xinzhiyuan ID | AI-era

Overnight, the news that the new open source model "OpenLLM" defeated ChatGPT caused an uproar on the Internet.

According to the official introduction, OpenLLM:

- On Stanford AlpacaEval, it ranked first in the open source model with a winning rate of 80.9%

- In the Vicuna GPT-4 evaluation, the performance reached 105.7% of ChatGPT

Most importantly, such excellent performance requires only 6K GPT-4 dialogue data for fine-tuning training.

Project address: https://github.com/imoneoi/openchat

However, the "list master" of Chatbot Arena reminded that due to some bias in the old Vicuña eval benchmark, it is recommended that everyone migrate to the newly proposed MT-bench-to better evaluate the capabilities of more aspects of LLM.

OpenLLM: Just 6K GPT-4 dialogue fine-tuning

OpenLLM is an open source family of language models fine-tuned on diverse and high-quality multi-turn dialogue datasets.

Specifically, the researchers filtered out about 6K GPT-4 conversations from about 90K ShareGPT conversations.

After fine-tuning on 6k data, surprisingly, OpenLLM has been proven to achieve high performance with limited data.

OpenLLM has two general models, which are OpenChat and OpenChat-8192.

OpenChat: based on LLaMA-13B fine-tuning, the context length is 2048

- Achieved a ChatGPT score of 105.7% on the Vicuna GPT-4 assessment

- Achieved a staggering 80.9% win rate on AlpacaEval

OpenChat-8192: based on LLaMA-13B fine-tuning, the context length is 8192

- Achieved a ChatGPT score of 106.6% on the Vicuna GPT-4 assessment

- 79.5% win rate on AlpacaEval

In addition, OpenLLM also has code models with the following properties:

OpenCoderPlus: Based on StarCoderPlus, the original context length is 8192

- Achieved a ChatGPT score of 102.5% on the Vicuna GPT-4 assessment

- 78.7% win rate on AlpacaEval

model evaluation

The researchers evaluated the latest model using the Vicuna GPT-4 and AlpacaEval benchmarks, and the results are shown in the figure below:

Vicuna GPT-4 evaluation (vs gpt-3.5-turbo)

Vicuna GPT-3.5-Turbo evaluation (vs gpt-3.5-turbo)

In addition, it is worth noting that the evaluation model adopted by the researchers is slightly different from that of Vicuna, and Evidence Calibration (EC) + Balanced Position Calibration (BPC) is also used to reduce potential bias.

Mount and Weight

To use OpenLLM, CUDA and PyTorch need to be installed. Users can clone this repository and install these dependencies via pip:

git clone [email protected]:imoneoi/OChat.gitpip install -r requirements.txt
 
 

Currently, researchers have provided the full weights of all models as a huggingface repository.

Users can use the following command to start an API server locally, the address is http://localhost:18888.

Among them, the server is compatible with the openai package and the ChatCompletions protocol (please note that some functions may not be fully supported).

Users can specify the server of the openai package by setting:

 
 
openai.api_base = "http://localhost:18888/v1"

The currently supported ChatCompletions parameters are:

Recommendation: Run the server with a GPU with at least 40GB (1x A100) of memory.

data set

The converted dataset is available at openchat_sharegpt4_dataset.

The dataset used in the project is the cleaned and filtered version of ShareGPT.

Among them, the original ShareGPT dataset contains about 90,000 dialogues, while only 6,000 cleaned GPT-4 dialogues are reserved for fine-tuning.

The cleaned GPT-4 dialogue is combined with the dialogue template and the token at the end of the round, and then truncated according to the context limit of the model (content beyond the limit will be discarded).

To run the data processing pipeline, execute the following command:

./ochat/data/run_data_pipeline.sh INPUT_FOLDER OUTPUT_FOLDER

The input folder should contain a ShareGPT folder containing .html files for each ShareGPT conversation page.

The data processing flow consists of three steps:

- Cleaning: Clean up HTML and convert it to Markdown format, delete malformed dialogues, delete dialogues containing blocked words, and perform precise deduplication based on hash

- Screening: only keep the conversations whose token is Model: GPT-4

- Conversion: For model fine-tuning, conversion and word segmentation are performed on all dialogues

The final transformed dataset follows the following format:

MODEL_TYPE.train.json / .eval.json

 
 
[    [token_id_list, supervise_mask_list],    [token_id_list, supervise_mask_list],    ...]

MODEL_TYPE.train.text.json / .eval.text.json plain text decoded from token_id_list

In addition to this, the researchers also provided a tool for visualizing dialogue embeddings.

Just open ochat/visualization/ui/visualizer.html with a browser, and drag and drop MODEL_TYPE.visualizer.json into the webpage. Click on a point in the 3D map to display the corresponding dialog.

Among them, the embedding is created using openai_embeddings.py, and then uses dim_reduction.ipynb for UMAP dimension reduction and K-Means coloring.

model modification

The researchers added an EOT (End of Dialogue) token to each base model.

For the LLaMA model, the embedding of EOT is initialized as the average of all existing token embeddings. For the StarCoder model, the embeddings of EOT are randomly initialized with 0.02 standard deviation.

For LLaMA-based models with 8192 contexts, max_position_embeddings is set to 8192, and extrapolation of RoPE (Relative Position Encoding) codes is done.

train

The hyperparameters used when training the models are the same in all models:

Use 8xA100 80GB for training:

NUM_GPUS=8
deepspeed --num_gpus=$NUM_GPUS --module ochat.training_deepspeed.train \    --model_type MODEL_TYPE \    --model_path BASE_MODEL_PATH \    --save_path TARGET_FOLDER \    --length_grouping \    --epochs 5 \    --data_path DATASET_PATH \    --deepspeed \    --deepspeed_config ochat/training_deepspeed/deepspeed_config.json

Evaluate

To run the Vicuna GPT-4 evaluation, follow these steps:

1. Generating Model Answers

python -m ochat.evaluation.get_model_answer --model_type MODEL_TYPE --models_path PATH_CONTAINING_ALL_MODELS_SAME_TYPE --data_path ./ochat/evaluation/vicuna --output_path ./eval_results

2. Generate baseline (GPT-3.5) answers

OPENAI_API_KEY=sk-XXX python -m ochat.evaluation.get_openai_answer --data_path ./ochat/evaluation/vicuna --output_path ./eval_baselines --model_types gpt-3.5-turbo

3. Run the GPT-4 evaluation

OPENAI_API_KEY=sk-XXX python -m ochat.evaluation.openai_eval --data_path ./ochat/evaluation/vicuna --baseline_path ./eval_baselines/vicuna_gpt-3.5-turbo.jsonl --input_path ./eval_results

4. Visualization and detail

To visualize and plot evaluation results, open ochat/visualization/eval_result_ui/eval_result_visualizer.html with a browser and select all files in the ./eval_results/eval_result_YYYYMMDD folder to display the results.

limitation

Base Model Limits

Despite being able to achieve excellent performance, OpenLLM is still limited by the inherent limitations of its underlying model. These limitations can affect model performance in the following areas:

- complex reasoning

- math and arithmetic tasks

- Programming and coding challenges

Illusion of absence of information

OpenLLM may sometimes produce non-existent or inaccurate information, also known as "hallucinating". Users should be aware of this possibility and verify any critical information obtained from the model.

References:

https://github.com/imoneoi/openchat

https://tatsu-lab.github.io/alpaca_eval/

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/131604532