ViT: Google
Swin-Transformer: Microsoft
V-MOE: Google
SAM: Meta
Pangu CV: Huawei
Wenxin UFO: Baidu
…
pre-trained large model
# 导入必要的库和模块
import argparse
import json
import pathlib
# 定义 Stanford Alpaca 使用的 prompt 格式
PROMPT_DICT = {
"prompt_input": "..." # 包含 instruction 和 input 的 prompt 格式,
"prompt_no_input": "..." # 只包含 instruction 的 prompt 格式
}
# 主函数
def main(args_param):
data_path = pathlib.Path(args_param.data_path)
with data_path.open() as f:
data = json.load(f)
# 构造新的对话格式数据
sources = [prompt_input.format_map(example) if example.get("input", "") != "" else prompt_no_input.format_map(example) for example in data]
targets = [example["output"] for example in data]
new_data = []
cnt = 1
for s, t in zip(sources, targets):
new_data.append(
{
"id": str(cnt),
"conversations": [
{
"from": "human", "value": s},
{
"from": "gpt", "value": t},
],
}
)
cnt += 1
# 将新的对话格式数据写入输出文件
json.dump(new_data, open(args_param.output_path, "w"), indent=2)
if __name__ == "__main__":
# 解析命令行参数
parser = argparse.ArgumentParser()
parser.add_argument("--data_path", type=str, default="alpaca-data.json")
parser.add_argument("--output_path", type=str, default="alpaca-data-conversation.json")
args = parser.parse_args()
# 执行主函数
main(args)
process:
-
Data set preparation: First, large-scale text and image data need to be collected as a pre-training training set. The larger the amount of data, the better the effect. Commonly used data sets for large models include Wikipedia, book collections, and online news.
-
Data preprocessing: Preprocessing the collected text data, such as tokenizing, establishing vocabulary, etc.
-
Model construction: build the network structure of the model, and the transformers class model is the most commonly used. Choose an appropriate encoder, such as BERT's Encoder.
-
Pre-training task design: Select an appropriate pre-training task to allow the model to perform self-supervised learning on big data. The common ones are Masked Language Model, Next Sentence Prediction, etc.
-
Model pre-training: Train the model on the pre-training data for the set pre-training task to obtain the optimal parameters. Generally, when training in a large-scale cluster, the model size will be very large.
-
Fine-tuning: On the downstream task dataset, Freeze part of the pre-training parameters, and only Fine-tune part of the task-related parameters, so as to quickly adapt the model to new tasks.
-
Model deployment: Choose an appropriate way to deploy the trained model for production applications such as text generation and text classification.
通过预训练,模型可以学习到通用的语言表示,然后迁移至下游任务中,以提高效果、减少人工标注需求。这就是预训练语言模型的全流程。