Revealing how WeChat trains large models: low-key WeLM|The official website was last updated a year ago

"  By introducing the design ideas, data sets, model structure, training methods, diversified evaluation results, etc. of the large-scale Chinese pre-trained language model WeLM created by WeChat, we comprehensively analyze the technical principles and application value of this model. "

35765ed217d676f8f360010d4fa0d41e.png


01

WeChat’s “chatgpt” is called “WeLM” (WeChat Language Model).

Official introduction: WeLM is a general language model that is very good at understanding and generating text.

WeLM does not make any constraints or presets for natural language-related tasks. By simply calling the model to complete the text you input, you can experience or complete various natural language tasks, such as: user dialogue, question and answer, copywriting generation, text rewriting, reading comprehension, translation, article continuation and other language tasks.

Users can solve a variety of text-related tasks by calling WeLM's API.

Official address:‍‍‍‍‍

https://welm.weixin.qq.com/docs/tutorial/

Notice! ! WeLM does not provide a chat interface. You can only fill in the form and apply for a calling token through API calls. ‍‍‍‍‍‍‍‍‍

Limits and Quotas on API Requests:

  • Up to 30 requests every 1 minute for each token.

  • Up to 1000000 characters can be generated every 24 hours.

  • The quota is reset every 24 hours (starting from the first request, within the next 24 hours).

Up to 30 requests per minute, and up to 1 million characters generated per day. ‍‍‍

In the existing documentation, there is no support for the length limit of the context. The content windows size of 2048 is mentioned in the technical documentation later, which should be this numerical length. ‍‍‍‍‍‍‍‍‍‍‍

02

data

WeLM is pre-trained on a curated dataset, which comes from diverse sources and aims to cover multiple domains.

(1) The Chinese community uses a wide range of topics and languages;

(2) The data undergoes rigorous deduplication, noise reduction, and filtering of harmful content to ensure high quality;

(3) Filter out all data that significantly overlaps with downstream tasks to ensure the fairness of the evaluation.

WeLM builds a common subset of web pages using scraps released monthly by Common Crawl.

All WET files between August 2020 and January 2022 were downloaded and langdetect 2 was used to filter out non-Chinese pages. For domain-specific corpora, data from a variety of sources are mixed, including news, books, popular online forums, and academic works.

Similar to general domain data, WeLM uses langdetect to retain Chinese-only data sources. In addition, approximately 750GB of English data was added, collected from the sources mentioned above, so that the language model can learn bilingual knowledge. The complete data includes over 10TB of raw text data.

Because there is a lot of noise in the data, such as meaningless text, offensive language, placeholder text, and source code, especially web data crawled for general domains.

To reduce these noises, a set of rule-based filters is first applied according to Raffel et al. From the remaining data, a balanced labeled data set was manually constructed, containing 80k positive and negative samples with a positive and negative ratio of 1:1. Positive samples are valid, clean text, and negative samples are text with different types of noise.

WeLM trained a binary classifier on the constructed labeled data using Fasttext 3. Only keep positive samples with probability greater than 0.9. This rule-based + Fasttext filtering process reduced the total data by 87.5%.

Then to deduplicate the training data, WeLM adopted a two-step deduplication method, including using the md5 algorithm to filter duplicate paragraphs and using the SimHash algorithm to deduplicate documents with similar content. Finally, 40.02% of duplicate content was successfully removed.

To eliminate data contamination and ensure fairness of evaluation, text that overlaps with WeLM’s development and test data is filtered out using a method similar to that used in GPT-3.

How it works: Compute 17-gram matches between each document and the development and test data we use. If ≥ 2 repeated 17-grams or 1 repeated 34-gram are found in a document, it is deleted from the corpus. This further removes 0.15% of the remaining data.

After filtering and balancing the data, WeLM’s corpus contains 262B tokens. Due to the uneven distribution of data, the data is resampled during the pre-training process to balance data from different sources.

In this way, the training data is diverse and representative, covering different domains. After balancing, the data from common crawl accounts for more than 75% of the total data, but after balancing, only 50% of the training data comes from common crawl.

854af5512a725122750ca051f90fdae9.png

As can be seen from the chart, the topic distribution of common crawl is very uneven, with most documents concentrated on a few topics. After data balancing, the topic distribution becomes smoother.

03

Model and implementation

WeLM uses training and evaluation code libraries based on Megatron-LM4 and DeepSpeed5 to support efficient training of large language models.

One trained language models of four different sizes, from 1.3B to 10B. Adopts the same autoregressive Transformer decoder architecture as GPT-3, but has some major differences.

Relative encoding is a rotational position embedding based on relative position. Compared with the absolute position encoding used in the original GPT, relative encoding performs better in processing the semantics of long texts, and is very suitable for tasks that require modeling of complete articles or books. helpful.

‍‍‍‍

The SentencePiece word segmenter is used and contains 62k tags. In addition to 30k Chinese tags, it also includes common words in common languages ​​such as English, Japanese and Korean. Preserves all spaces and tabs to aid in downstream tasks.

280dfddfe7877e3c76a957fb51542bb3.png

WeLM pre-trained three models with different numbers of parameters. Lists the number of layers, bottleneck activation size, maximum learning rate, training batch size, and context window size (number of labels) for each model.

(Note: The original article mentioned 4 models, but the details of only 3 models are listed in the chart)‍‍

WeLM uses the AdamW optimizer for model training and adopts the cosine learning rate scheduler.

Use DeepSpeed ​​ZeRO stage 1 optimization to reduce GPU memory consumption. When the model size exceeds a single GPU, a tensor parallel scheme is used.

All models are trained using FP16 mixed precision to avoid underflow.

The batch sizes during training are 1024 and 2048, and the context window size is 2048. Set the maximum learning rate for each model, and gradually increase the learning rate during training, and then gradually decay. The learning rate stops decaying after reaching the minimum learning rate, which is set to 10% of the maximum learning rate.

According to the analysis of Hoffmann et al., as the computational budget increases, the model size and the amount of training data should increase in roughly equal proportions. Therefore, WeLM chose to train a 10B size model using 128 A100-SXM4-40GB GPUs under the computational budget, and the training data volume exceeded 300B markers. This is similar to the training size of GPT-3 and Gopher. The largest model was trained in approximately 24 days.

When training a 10B scale model, instability problems will occur, causing the loss to suddenly rise, affecting the model weight and convergence speed.

To solve this problem, you can restart training from the checkpoint before the sudden increase in loss and skip the next 200 data batches. Also, lowering the learning rate and resetting the dynamic loss scale can help. Similar strategies have been used by other researchers.

7901b9f72a4a40ff972cf8a5d06a82d2.png

Figure 3a shows the training loss curve. Figure 3b shows the average model performance on the CLUE benchmark and visualizes it during the training process.

As you can see from the chart above, both training loss and average model performance improve over time. Larger models perform significantly better than smaller models.

04

Model evaluation

In order to test the performance of the WeLM model on multiple Natural Language Processing (NLP) tasks, the in-context learning method was adopted. By inputting task-related prompt information, the model continued to predict words and output results.

For generation tasks, WeLM decoding is used directly to generate answers.

For classification tasks, a predefined verbalizer is used to associate each label with certain words, and then WeLM is used to calculate the perplexity of these words during inference, and the label corresponding to the word with the lowest perplexity is selected as the model prediction.

To evaluate the different capabilities of WeLM, four assessment methods were used: single-language assessment, cross-/mixed-language assessment, multiple prompt training assessment, and other discovery assessments.

Among them, it also includes research on WeLM's interpretability, self-calibration and memory capabilities. , to reflect on the different capabilities of WeLM.

Single Language Assessment (Chinese )

In the evaluation of WeLM in the Chinese NLP task, experiments were conducted in two scenarios: zero-shot zero-shot and few-shot. The evaluation data set covers 18 Chinese NLP tasks. Compared with similar-sized Chinese pre-trained language models such as CPM, Pangu and Ernie 3.0, WeLM performs best in most tasks.

WeLM performs well in Chinese machine reading comprehension tasks, including CMRC2018, DRCD and DuReader. Treat it as a generation task, input text and questions, and output answers. In DuReader, the Zhidao subset was selected for evaluation. WeLM significantly outperforms other models on this task.

WeLM is evaluated on four Chinese fill-in-the-blank and completion tasks: People_daily (PD), Children_fairy_tale (CFT), CHID and CMRC2017.

The PD and CFT tasks require the model to predict masked words in sentences from the PD news dataset and CFT dataset.

CHID provides 10 candidate Chinese idioms and asks the model to choose the correct one from them.

The CMRC2017 task masks common nouns and named entities in queries and requires the model to predict the masked words.

Ernie3.0 performs best on PD and CMRC2017, while WeLM performs best on other tasks. This is expected because both PD and CMRC2017 are tasks of predicting masked words, which is consistent with the pre-training goals of Ernie 3.0.

The NLI task requires the model to judge whether the hypothesis is established based on the premise, which is divided into three categories: established, contradictory, and neutral. Using the CMNLI and OCNLI datasets in the Chinese GLUE benchmark for 3-category classification tasks, all models performed similarly. This task occurs rarely in the original text.

WeLM performs well in answering questions without external knowledge sources, improving the F1 score by more than 10% compared to other models. The evaluation uses the WebQA dataset, which contains questions from Baidu. Think of it as a generative task and evaluate it by comparing the answers generated by the model to the true answers.

WeLM performs well on Chinese implicit sentiment analysis and ChnSentiCorp datasets. Sentiment analysis is a classic NLP task that requires a model to determine the sentiment of a given text.

WeLM performs well on emotion classification tasks and is able to handle three emotion categories, while ChnSentiCorp only has two categories. In addition, WeLM can also achieve good performance in zero-sample situations.

The Winograd task is a sentence pair ambiguity resolution problem that requires world knowledge and reasoning abilities. The WeLM model was evaluated on the CLUEWSC2020 dataset, transforming the task into a multiple-choice classification problem. WeLM performed the best, but in the case of few samples, the 10B version model declined.

Pangu, Ernie3.0 and WeLM perform similarly in the evaluation of common sense reasoning tasks, the data set C3 used, and the prediction results using the perplexity method. Pangu is slightly better than WeLM in the zero-sample case, but in the few-sample case Poor performance.

In text classification experiments on the Headline News Title Classification (TNEWS) and iFlytek Application Description Classification (IFLYTEK) tasks, WeLM performed well in terms of computational cost and significantly outperformed other models in these two tasks.

Text summarization aims to provide a concise summary of a given length of text input.

Existing pre-trained language models demonstrate their zero-shot summarization skills by using templates like “write a title/summary”.

The performance of WeLM is tested on two public Chinese summary datasets, and the results show that WeLM can produce reasonable text summarization. A lightly trained WeLM generates more diverse summaries, but may also suffer from lower ROUGE scores due to different vocabulary choices.

eccb3fdb59b023331d938603757e5802.png

A central challenge in the field of artificial intelligence is to develop sufficiently intelligent virtual assistant or chat companion systems.

The study found that WeLM can generate human conversation-like content in different styles based on prompts without any fine-tuning.

For example, in the example, WeLM can play two completely different characters: the famous ancient Chinese poet Li Bai and the modern American entrepreneur Elon Musk.

It even seamlessly incorporates the right background knowledge about a specific character. For Li Bai, it uses the places Li Bai visited and the real historical events of Li Bai's time to provide a fascinating response. For Elon Musk, it's using knowledge of self-driving and Shakespeare to provide reasonable answers.

ac7ca9953381058964edb18363c53a7b.png

ceb4e5a04278acd6139657fd305ecadc.png

Text style transfer is an important task in natural language generation, and zero-shot conversion can be achieved through pre-trained large language models. WeLM can enrich and expand a given situation, change the emotion or antonyms of a sentence, etc. through natural human-computer interaction according to user needs.

Sentence completion is the task most similar to the language modeling objective used in pre-training. The examples below show how WeLM can complete a given sentence and go on to generate long coherent texts in different styles.

c553806675e0932baf183ba1643daa43.png

Multilingual assessment

Multilingual assessments include machine translation, cross-language Q&A, and cross-language summarization. Experimental results show that WeLM performs well in the case of zero samples and one sample, and performs not inferior to XGLM.

Machine translation is a classic subfield within NLP that studies how computer software can be used to translate between different languages ​​without human involvement. Although WeLM mainly uses Chinese texts for pre-training, there are also a large number of English and Japanese characters mixed in.

WeLM's performance in four translation directions: ZH2JA, JA2ZH, ZH2EN and EN2ZH.

In Chinese-Japanese translation, JA2ZH and EN2ZH perform significantly better than ZH2JA and ZH2EN, indicating that WeLM is better at understanding foreign languages ​​than generating them. Compared with XGLM, WeLM performs well in two translation tasks where the target language is Chinese.

WeLM performs poorly on the ZH2JA task due to the scarcity of Japanese text in the pre-training corpus. Experience has proven that WeLM often makes grammatical errors or deviates from source sentences when generating long Japanese texts.

However, WeLM performed very well when translating Japanese and English into Chinese. Even the 1.3B version of WeLM significantly outperforms the 7.5B version of XGLM, albeit with only one-sixth of the parameters.

Cross-language Q&A refers to answering questions in different languages, which can help people better obtain information on the Internet. WeLM was tested on XQuAD and MLQA datasets and performed well.

What impact will it have on the pre-trained language model if context, questions, and answers are used in different languages?

The results show that using the primary language as a hint improves model performance. Meanwhile, WeLM outperforms XGLM in all cases. WeLM performs better when using Chinese questions.

Cross-language text summarization aims to summarize input text in different languages. WeLM performs better than XGLM on the NCLS dataset, but performs worse on zero samples.

Code switching is when a speaker alternates between two or more languages ​​or language varieties.

In modern Chinese, English or Japanese words often appear, so understanding code-switched text is a useful skill for many Chinese natural language processing tasks.

WeLM can correctly understand code-switched text. For example, in the dialogue generation and arbitrary style transfer examples, we modify a Chinese word to the corresponding English word, and WeLM can still understand the utterance and produce the correct response.

WeLM is an AI model that can correctly translate Japanese and English, and it has combined knowledge of multiple languages. This may be because the presence of multiple languages ​​and mixed languages ​​in the pre-training corpus enables WeLM to explore cross-language alignment to reduce training loss.

Research has found that the language mixing ratio has an impact on the performance of language models. During the pre-training process, when the ratio of English and Chinese was adjusted to 1% and 25%, the performance of the model was not as good as the current model of 13% English and 87% Chinese.

In the Chinese task, the model mixed with 13% English text performed best , probably because English words often appear in the Chinese dataset. In English tasks, too much Chinese knowledge is not required, and it is better to focus on absorbing English knowledge.

‍‍‍

05

Multi-task prompt training

By manually writing prompts, WeLM was trained on the mixed labeled dataset, and the WePrompt model was obtained and tested on tasks not included in the training phase. Explicit multi-task learning can adapt unsupervised WeLM to different tasks.

WeLM uses an improved unsupervised language model that performs better under different cues.

training data set

The creation of the training data set consists of two steps:

(1) Select a different set of labeled Chinese NLP tasks;

(2) Create multiple prompts, each with different wording for each task.

A hint is a pattern that converts a labeled sample into a natural sentence.

Prompts are created by an in-house annotator using BigScience Web-based GUI 11.

The annotator is instructed to be open in its style so that the fine-tuned model can become more robust with different hinting modes.

An example of the prompt is shown below.

b0810d98b26723c9001ceb291d73253b.png

For the NLI task, prompts were created as a multiple-choice classification task on all three relations, or as a binary classification task on a single relation.

A complete overview of all 76 tasks (76 tasks in 14 categories creating 1227 manual writing prompts) is shown in the figure. The holdout dataset used for evaluation is shown in purple, and the remaining yellow dataset is used for training. All 76 tasks were repeatedly checked and were not included in WeLM’s pre-training corpus.

WePrompt excels in zero-sample and fine-tuned performance, being able to outperform models 23 times the size of the Ernie 3.0 Titan in most tasks.

WePrompt is a model that can automatically generate prompts for various tasks without any manual annotation.

In strong zero-shot evaluation, WePrompt excludes all tasks in the same category as the test data during training to test its generalization ability to new tasks.

The results show that WePrompt outperforms zero-sample WeLM and Ernie 3.0 Titan on most test data and is able to produce fewer out-of-range answers. Multi-prompt training helps the model understand the general pattern of prompts.

Among them, zero-sample WeLM cannot correctly understand the subject of the question, while strong zero-sample WePrompt can correctly understand the meaning of the question although the answer is wrong.

The weak zero-sample evaluation method of the WePrompt model: that is, when training WePrompt, only the tasks to which the test data belongs are excluded. The results show that weak zero-shot WePrompt performs better than strong zero-shot WePrompt on most tasks, but performs poorly on PD and IFLYTEK tasks.

Officials believe this may be due to similarities between language modeling and fill-in-the-blank tasks.

In addition, WePrompt's improvement in closed-book question and answer tasks is not obvious, because such tasks have frequently appeared in the pre-training corpus.

06

Other competency assessments

The official also provided an evaluation of WeLM’s other three capabilities:‍

  1. Explainability : Whether WeLM is able to explain its decisions by providing explanations, and whether the explanations can improve model performance. 

  2. Self-Calibration Self_Calibration : Whether WeLM can calibrate its prediction results by asking itself to determine whether the prediction is correct. 

  3. Memorization : How well WeLM is able to remember content from the pre-trained corpus, and how frequency affects its memory.

self explanatory

The interpretability of deep neural networks is a very important feature, and a lack of interpretability makes it difficult to trust its predictions.

Recent research has shown that large pre-trained language models can generate predictions and explanations given appropriate instructions.

ec0109660038b0abebc89457a57daaf6.png

The above figure tests whether WeLM can produce reasonable explanations by adding instructions on three tasks, and finds that adding instructions can generally improve performance, but the degree of improvement is unstable and highly dependent on the task and the instructions provided.

On CMNLI, 11-B WeLM performs even worse when providing additional instructions. On OCNLI, 2.7-B WeLM performs worse, but other versions perform better. In the given example, we can see that WeLM can imitate the style given in the prompt, producing reasonable explanations for its predictions.

self-calibration

Self-calibration refers to the prediction results of the calibrated model.

For example, after the model provides predictions, we can provide further input to check its accuracy. WeLM can make different responses based on different model prediction results. This approach can check whether the model's behavior and reasoning are accurate.

1f4782216cda7d4010dd46962a4a1aaa.png

WeLM models can improve their predictive capabilities through self-calibration. It is able to distinguish between its own correct and incorrect predictions, and performs well at identifying text containing impolite words.

memory

Based on WeLM's pre-training on large-scale network content, its memory ability was tested and it was found that the model could remember some content, but the proportion was not high.

Larger models can generally remember more across data sources . Common crawl content takes up more than half of the training data, so WeLM can remember them better.

dc33a304dd3c12894d81b3a2e5fe4252.png

Academic writing is the most difficult to memorize due to its low frequency and unique style in the training data. At the same time, the model can better remember text that appears more frequently.

The model can only remember the text content that appears once and cannot remember too much content.

07

Summarize

WeLM is a pre-trained language model for Chinese that can seamlessly perform different types of tasks without zero or little demonstration.

It performs well on both monolingual (Chinese) and cross-lingual (Chinese-English/Japanese) tasks, outperforming existing pre-trained models of similar size.

The WeChat team collected data for a large set of Chinese supervised datasets using human-written prompts and fine-tuned WeLM through multi-cue training. The resulting model is able to achieve strong generalization on unseen task types and outperform unsupervised WeLM in zero-example learning.

And WeLM has the basic skills to explain and calibrate its own decisions.

Recently, major manufacturers have released their own models, but Tencent has not seen any relevant news. After looking through it, I found this WeChat model WeLM, which is very low-key. Not only is it low-key, but it has not been updated for almost a year. ‍‍‍‍

I personally guess whether the large model direction was stopped within the Goose Factory. Does anyone who is well-informed know the details? ‍‍‍‍‍

Reference address‍‍

https://arxiv.org/abs/2209.10372

https://welm.weixin.qq.com/docs/tutorial/

Reading recommendations:

Research on hallucinations of large language models | Alleviating and avoiding large model LLM hallucinations (2)

Siren Song in the Ocean of Artificial Intelligence: A Review of Hallucination Research in Large Language Model LLM (1)

Hello, I am Baichuan Big Model|The secret of Baichuan2, which is open source and free for commercial use in China

Lei Jun: 99% of questions have standard answers. Find someone who knows and ask.

What is the "intelligence emergence" of AI, and why understanding it is of great value to entrepreneurs, practitioners, and ordinary people

Prompt attack attacks large models again, hypnotized ChatGPT may leak important information - hidden risks of large models

Is artificial intelligence safe? OpenAI is "aligning" large models with humans - ensuring ChatGPT is smarter than humans while still following human intentions

REACT: Collaborating reasoning and action in language models, enabling them to solve a variety of linguistic reasoning and decision-making tasks.

Embrace the future and learn AI skills! Follow me and receive free AI learning resources.

Guess you like

Origin blog.csdn.net/fogdragon/article/details/132820302