Natural Language Processing: An Introduction to Large Language Models

With the development of Natural Language Processing (NLP), this technology has been widely used in text classification, recognition and summarization, machine translation, information extraction, question answering system, sentiment analysis, speech recognition, text generation and other tasks.
Researchers have found that expanding the model size can improve the model's capabilities, thus creating the term - Large Language Model (Large Language Model, LLM), which refers to a large pre-training language model (Pre-training Language Model, PLM), which is usually Contains hundreds of billions (or more) of parameters. One of the most notable progress of large language model is ChatGPT, a chat robot developed by OpenAI based on LLM. In this blog, I will introduce the historical evolution, basic knowledge, core technology and future prospects of large language model, and call API Introduce how ChatGPT is built.

The Historical Evolution of Language Models

The language model (LM) is to model the generation probability of word sequences, so as to predict the probability of future or missing words. Its development mainly has the following three stages:

  • Statistical language model (SLM): Build a word prediction model based on statistical learning methods (such as the Markov assumption), and predict the next word based on the most recent context.
  • Neural Language Model (NLM): The probability of predicting a word sequence is described by a neural network (such as a recurrent neural network RNN).
  • Large language model (LLM): The researchers found that expanding the model scale can improve the model's ability. By using the Transformer architecture to build a large-scale language model, and established the "pre-training and fine-tuning" paradigm, that is, pre-training on a large-scale corpus, Fine-tune the pre-trained language model to adapt to different downstream tasks and improve the performance of LLM.

Big language model basics

Pre-training Pre-training

The pre-training of the model first requires high-quality training data, which often come from web pages, books, dialogues, scientific literature, codes, etc. After collecting these data, it is necessary to preprocess the data, especially to eliminate noise, redundancy, Irrelevant and potentially harmful data. A typical preprocessing data flow is as follows:

  • Quality filtering: delete low-quality data;
  • Deduplication: delete duplicate data;
  • Remove privacy: delete data related to privacy;
  • Tokenization: Divide the original text into word sequences (Tokens), which are then used as input to the large language model.

At present, the mainstream architecture of large language models can be divided into three types: encoder-decoder, causal decoder and prefix decoder, and a hybrid architecture built using the above three architectures:

  • Encoder-Decoder Architecture: Utilizing the traditional Transformer architecture, the encoder utilizes stacked multi-head self-attention layers (Self-attention) to encode the input sequence to learn its potential representations, while the decoder performs cross-attention on these representations ( Cross-attention) and autoregressively generate the target sequence. At present, only a few LLMs are built using this architecture, such as T5 and BART.
  • Causal decoder architecture: It employs a one-way attention mask to ensure that each input token can only focus on past tokens and itself. Input and output tokens are processed in the same way by the decoder. Models such as the GPT series, OPT, BLOOM, and Gopher are developed based on the causal decoder architecture and are currently widely used.
  • Prefix decoder architecture: The prefix decoder architecture, also known as the non-causal decoder architecture, modifies the masking mechanism of the causal decoder to enable it to perform bidirectional attention on prefix tokens and unidirectional attention only on generated tokens In this way, similar to the encoder-decoder architecture, the prefix decoder can bidirectionally encode the prefix sequence and autoregressively predict the output token one by one, where the same parameters are shared in the process of encoding and decoding. Representatives using this architecture: GLM-130B and U-PaLM, etc.
  • Hybrid architecture: Use the hybrid expert (MoE) strategy to extend the above three architectures, such as Switch Transformer and GLaM.

Fine-Tuning

In order to adapt the large language model to a specific task, technical methods such as Instruction Tuning and Alignment Tuning can be used; since the large language model contains a large number of tasks, if the full parameter fine-tuning will have a greater impact Overhead, efficient fine-tuning methods for parameters include: Adapter Tuning, Prefix Tuning, Prompt Tuning, and Low Rank Adaptation (LoRA), etc. Efficient fine-tuning methods will not be expanded here Introduction, interested friends can check relevant information by themselves.

Instruction Tuning Instruction Tuning

Instruction fine-tuning is supervised fine-tuning by using mixed multi-task datasets described in natural language, so that large language models can better complete downstream tasks and have better generalization capabilities. This process is accompanied by parameter updates.

Alignment Tuning

Alignment fine-tuning aims to align the behavior of LLMs with human values ​​or preferences. It requires collecting high-quality human feedback data from human annotators (who need to have a qualified education level or even meet certain academic requirements), and then use this data to fine-tune the model. Typical fine-tuning techniques include: Reinforcement Learning with Human Feedback (RLHF).

In order to make the large language model consistent with human values, scholars have proposed reinforcement learning based on human feedback (RLHF), that is, using collected human feedback data combined with reinforcement learning to fine-tune LLM, which helps to improve the usefulness and honesty of the model. sex and harmlessness. RLHF employs reinforcement learning (RL) algorithms such as Proximal Policy Optimization (PPO) to adapt LLMs to human feedback by learning a reward model.

Prompt

In order to make the language model complete some specific tasks, use the mechanism of adding hints to the input of the model, so that the model can get the expected result or guide the model to get better results. Note that unlike fine-tuning, in the process of hinting, there is no need for additional Training and parameter update.

In-context Learning

In-context Learning (ICL) was officially introduced by GPT-3. Its key idea is to learn from analogies, which connects the question of the query with a contextual hint (some relevant examples) to form a belt A hinted input is given and fed into a language model for prediction.

Chain-of-thought

Chain-of-thought (CoT) is an improved hinting strategy aimed at improving the performance of LLM in complex reasoning tasks, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning. This is done by incorporating intermediate reasoning steps into hints that guide the model to predict the correct outcome. According to related papers, this ability may be obtained by training on code.

Prompt development (call ChatGPT API)

ChatGPT is a web site that uses the large language model developed by OpenAI to chat. Its essence is to call the ChatGPT API to complete various tasks. The following demonstrates the use of the ChatGPT API to complete the summary tasks. In addition, it can also complete reasoning, Translation, question and answer, proofreading, expansion and other tasks, sometimes need to use ICL or CoT to get better results (provided that you need to obtain the API key from the OpenAI official website )

import openai
import os
fron dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
openai.api_key = os.getenv("OPENAI_API_KEY")

def get_completion(prompt, temperature=0, model="gpt-3.5-turbo"):
	messages = [{
    
    "role": "user", "content": prompt)]
	response= openai.ChatCompletion.create(
		model=model,
		messages=messages,
		temperature=temperature, # temperature为模型的探索程度或随机性,其值是范围在0~1的浮点数,值越高则随机性越大,说明更有创造力。
	)
	return response.choices[0].message["content"]

text = f"""
XXXXXXXX
"""
prompt = f"""
Summarize the text delimited by triple backticks into a single sentence.
```{
      
      text}```
"""
response = get_completion(prompt)
print(response)

ChatGPT's web site or chatbot usually contains three role (role) messages (messages), including: user (user) messages, ChatGPT/chat robot (assistant) messages and system (system) messages. Let's take building a "ordering robot" as an example:

  • system messages: used to set the robot's behavior and personality, as a high-level command to guide the robot's dialogue, the user is generally invisible to this;
  • user messages: is the user's input;
  • assistant messages: It is the reply of the robot.

The code example is as follows:

import openai
import os
fron dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
openai.api_key = os.getenv("OPENAI_API_KEY")

def get_completion_from_messages(messages, temperature=0, model="gpt-3.5-turbo"):
	response= openai.ChatCompletion.create(
		model=model,
		messages=messages,
		temperature=temperature, # temperature为模型的探索程度或随机性,其值是范围在0~1的浮点数,值越高则随机性越大,说明更有创造力。
	)
	return response.choices[0].message["content"]

messages = [
	"role": "system",
	"content": "你现在一个订餐机器人,你需要根据菜单收集用户的订餐需求。菜单:汉堡、薯条、炸鸡、可乐、雪碧。",
	"role": "user",
	"content": "你好,我想要一个汉堡。",
	"role": "assistant",
	"content": "请问还有其他需要的吗?",
	"role": "user",
	"content": "再要一份可乐。",
]

response=get_completion_from_messages(messages)
print(response)
# 输出示例:
# 好的,一份汉堡和可乐,已为您下单。

With the help of the above code examples, human-computer interaction can be realized by designing a GUI or web interface, and the behavior of the chatbot can be changed and played different roles by modifying system messages.

Future Prospects for Large Language Models

  • Larger scale: The size of the model may continue to increase, improving the expressiveness and language understanding of the model.
  • Better pre-training: Improve the pre-training strategy to enable the model to better understand semantics and context, and improve the transferability of the model on various tasks.
  • Better fine-tuning: Develop more efficient fine-tuning methods to achieve better performance on specific tasks.
  • Multimodality: Combine the language model with other modalities such as vision and sound to realize cross-domain multimodal intelligent applications.
  • Tools: Use external tools such as search engines, calculators, and compilers to improve the performance of language models in specific domains.

references

A Survey of Large Language Models

Guess you like

Origin blog.csdn.net/weixin_43603658/article/details/132366881