General Visual Mockup

ViT: Google
Swin-Transformer: Microsoft
V-MOE: Google
SAM: Meta
Pangu CV: Huawei
Wenxin UFO: Baidu

pre-trained large model

# 导入必要的库和模块
import argparse
import json
import pathlib

# 定义 Stanford Alpaca 使用的 prompt 格式
PROMPT_DICT = {
    
    
    "prompt_input": "..."  # 包含 instruction 和 input 的 prompt 格式,
    "prompt_no_input": "..."  # 只包含 instruction 的 prompt 格式
}

# 主函数
def main(args_param):
    data_path = pathlib.Path(args_param.data_path)
    with data_path.open() as f:
        data = json.load(f)

    # 构造新的对话格式数据
    sources = [prompt_input.format_map(example) if example.get("input", "") != "" else prompt_no_input.format_map(example) for example in data]
    targets = [example["output"] for example in data]

    new_data = []
    cnt = 1
    for s, t in zip(sources, targets):
        new_data.append(
            {
    
    
                "id": str(cnt),
                "conversations": [
                    {
    
    "from": "human", "value": s},
                    {
    
    "from": "gpt", "value": t},
                ],
            }
        )
        cnt += 1

    # 将新的对话格式数据写入输出文件
    json.dump(new_data, open(args_param.output_path, "w"), indent=2)

if __name__ == "__main__":
    # 解析命令行参数
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path", type=str, default="alpaca-data.json")
    parser.add_argument("--output_path", type=str, default="alpaca-data-conversation.json")
    args = parser.parse_args()
    # 执行主函数
    main(args)

process:

  1. Data set preparation: First, large-scale text and image data need to be collected as a pre-training training set. The larger the amount of data, the better the effect. Commonly used data sets for large models include Wikipedia, book collections, and online news.

  2. Data preprocessing: Preprocessing the collected text data, such as tokenizing, establishing vocabulary, etc.

  3. Model construction: build the network structure of the model, and the transformers class model is the most commonly used. Choose an appropriate encoder, such as BERT's Encoder.

  4. Pre-training task design: Select an appropriate pre-training task to allow the model to perform self-supervised learning on big data. The common ones are Masked Language Model, Next Sentence Prediction, etc.

  5. Model pre-training: Train the model on the pre-training data for the set pre-training task to obtain the optimal parameters. Generally, when training in a large-scale cluster, the model size will be very large.

  6. Fine-tuning: On the downstream task dataset, Freeze part of the pre-training parameters, and only Fine-tune part of the task-related parameters, so as to quickly adapt the model to new tasks.

  7. Model deployment: Choose an appropriate way to deploy the trained model for production applications such as text generation and text classification.

     通过预训练,模型可以学习到通用的语言表示,然后迁移至下游任务中,以提高效果、减少人工标注需求。这就是预训练语言模型的全流程。
    

Guess you like

Origin blog.csdn.net/weixin_44659309/article/details/132161473