Detailed explanation of the latest ChatGPT GPT-4 text generation technology (with ipynb and python source code and video explanation) - open source DataWhale releases the essential user guide manual for beginners of ChatGPT technology from 0 to 1 (3)

insert image description here

foreword

Natural Language Generation (NLG) is an important research direction in the field of natural language processing , which refers to the generation of natural language text output by computers from various forms of input through models or algorithms . Most natural language processing tasks can be described as natural language generation tasks or text generation tasks , with a wide range of applications, such as text classification, text error correction, and intelligent question answering. This article will introduce some common text generation tasks, including some tasks that were not once a text generation task but can now be solved using NLG techniques.

Detailed explanation of the latest ChatGPT GPT-4 text generation technology

1 Introduction

  **Natural Language Generation** is an important research direction in the field of natural language processing, which refers to the generation of natural language text output by computers from various forms of input such as text and speech through models or algorithms the process of.

  We know that any knowledge can be described as text or language, so that the wisdom of ancestors is recorded on books and passed on from generation to generation. Therefore, the vast majority of natural language processing tasks can be described as natural language generation tasks, or even text generation tasks, which use text as input and new text as output. This is also the **T5 model (Text -to-Text Transfer Transformer)** original intention. For example, the text classification task can be understood as outputting category names, such as cat/dog, yes/no; the text error correction task can be understood as inputting erroneous text and understanding, and outputting the correct text description; intelligent question answering can be understood as based on Background knowledge and question sentences are used for inference, and corresponding answers are output.

  It can be said that the application of text generation tasks is quite wide. This article will introduce some common text generation tasks, including some tasks that did not belong to text generation tasks, but can now be solved using NLG technology.

# # 安装一些必要的包
# !pip install openai
# # torch install 命令 https://pytorch.org/get-started/locally/
# !pip install torch==2.0.0+cpu torchvision==0.15.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu
# !pip install tokenizers==0.13.2
# !pip install transformers==4.27.4
# !pip install --no-binary=protobuf protobuf==3.20.1
# !pip install sentencepiece==0.1.97
# !pip install redlines
# !pip install tenacity==8.2.2
# 配置openai api key
import openai
OPENAI_API_KEY = "输入你的key"  # TODO
openai.api_key = OPENAI_API_KEY
# 查看API支持的模型
models = openai.Model.list()
print([x.id for x in models.data])
['babbage', 'davinci', 'text-davinci-edit-001', 'babbage-code-search-code', 'text-similarity-babbage-001', 'code-davinci-edit-001', 'text-davinci-001', 'ada', 'babbage-code-search-text', 'babbage-similarity', 'code-search-babbage-text-001', 'text-curie-001', 'code-search-babbage-code-001', 'text-ada-001', 'text-embedding-ada-002', 'text-similarity-ada-001', 'curie-instruct-beta', 'ada-code-search-code', 'ada-similarity', 'code-search-ada-text-001', 'text-search-ada-query-001', 'davinci-search-document', 'ada-code-search-text', 'text-search-ada-doc-001', 'davinci-instruct-beta', 'gpt-3.5-turbo', 'text-similarity-curie-001', 'code-search-ada-code-001', 'ada-search-query', 'text-search-davinci-query-001', 'curie-search-query', 'gpt-3.5-turbo-0301', 'davinci-search-query', 'babbage-search-document', 'ada-search-document', 'text-search-curie-query-001', 'whisper-1', 'text-search-babbage-doc-001', 'curie-search-document', 'text-davinci-003', 'text-search-curie-doc-001', 'babbage-search-query', 'text-babbage-001', 'text-search-davinci-doc-001', 'text-search-babbage-query-001', 'curie-similarity', 'curie', 'text-similarity-davinci-001', 'text-davinci-002', 'davinci-similarity', 'cushman:2020-05-03', 'ada:2020-05-03', 'babbage:2020-05-03', 'curie:2020-05-03', 'davinci:2020-05-03', 'if-davinci-v2', 'if-curie-v2', 'if-davinci:3.0.0', 'davinci-if:3.0.0', 'davinci-instruct-beta:2.0.0', 'text-ada:001', 'text-davinci:001', 'text-curie:001', 'text-babbage:001', 'ada:ft-personal-2023-05-07-07-50-50', 'ada:ft-personal-2023-04-15-13-19-25', 'ada:ft-personal-2023-04-15-13-29-50']

  The model list contains more than 60 built-in available models, as well as models with its own fine tune. The fine tune model starts with "ft-personal".

# 如需删除自己fine tune的模型,可以使用openai.Model.delete命令。
openai.Model.delete('ada:ft-personal-2023-04-15-12-54-03')
<Model model id=ada:ft-personal-2023-04-15-12-54-03 at 0x20506421090> JSON: {
  "deleted": true,
  "id": "ada:ft-personal-2023-04-15-12-54-03",
  "object": "model"
}
models = openai.Model.list()
print([x.id for x in models.data if x.id.find('ft-personal') != -1])
['ada:ft-personal-2023-04-15-13-19-25', 'ada:ft-personal-2023-04-15-13-29-50']

2 Text Summarization Tasks

2.1 What is Text Summarization?

  The task of text summarization refers to summarizing the general idea of ​​the entire article with refined text, so that users can roughly understand the main content of the article by reading the abstract.

2.2 Common Text Summarization Techniques

  In terms of implementation methods, text summarization tasks are mainly divided into the following three types:

  • Extractive summarization: Extract ready-made sentences from the original document as summary sentences.
  • Compressed summary: Filter the redundant information of the original document, and compress the text as a summary.
  • Generative summary: Based on NLG technology, according to the content of the source document, the algorithm model generates a natural language description by itself.

  The following is an example of text summarization based on the mT5 model (multilingual version of the T5 model).

**Note:** The downloaded model is relatively large, you can go to huggingface->Hosted inference API for online testing. https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum

import re
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 
# 载入模型 
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")
model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))

text = """自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望"""
text = WHITESPACE_HANDLER(text)
input_ids = tokenizer([text], return_tensors="pt", padding="max_length", truncation=True, max_length=512)["input_ids"]

# 生成结果文本
output_ids = model.generate(input_ids=input_ids, max_length=84, no_repeat_ngram_size=2, num_beams=4)[0]
output_text = tokenizer.decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print("原始文本: ", text)
print("摘要文本: ", output_text)
C:\Softwares\Programming\Python\Anaconda\envs\chatgpt\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
C:\Softwares\Programming\Python\Anaconda\envs\chatgpt\lib\site-packages\transformers\convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(


原始文本:  自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望
摘要文本:  自动信任协商(AI)是互信关系建立的最新研究工作的一部分。

2.3 Text summary experiment based on OpenAI interface

2.3.1 Easy-to-use version: call the pre-trained model

GPT 3.5

def summarize_text(text):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"请对以下文本进行总结,注意总结的凝炼性,将总结字数控制在20个字以内:\n{
      
      text}",
        temperature=0.3,
        max_tokens=500,
    )

    summarized_text = response.choices[0].text.strip()
    return summarized_text

text = "自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望。"""
output_text = summarize_text(text)
print("原始文本: ", text)
print("摘要文本: ", output_text)
print("摘要文本长度: ", len(output_text))
原始文本:  自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望。
摘要文本:  自动信任协商解决跨安全域信任建立问题,但面临多种安全威胁,需要分析攻击方式及防御措施。
摘要文本长度:  43

ChatGPT

def summarize_text(text):
    content = f"请对以下文本进行总结,注意总结的凝炼性,将总结字数控制在20个字以内:\n{
      
      text}"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", 
        messages=[{
    
    "role": "user", "content": content}],
        temperature=0.3
    )
    summarized_text = response.get("choices")[0].get("message").get("content")
    return summarized_text

text = """自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望。"""
output_text = summarize_text(text)

print("原始文本: ", text)
print("摘要文本: ", output_text)
print("摘要文本长度: ", len(output_text))
# 注意,chatgpt并不能完美限制摘要输出的字数
原始文本:  自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望。
摘要文本:  自动信任协商解决跨域信任建立,但面临多方面安全威胁,需分类防御。研究不足,未来展望。
摘要文本长度:  42

2.3.2 Advanced optimized version: fine tune based on custom corpus

For data or tasks in vertical fields, sometimes using LLM directly does not work well.

Of course, due to the powerful internal understanding ability of ChatGPT, in some cases, using a better Prompt can also get a good result through Zero-Shot or Few-Shot.

Here we briefly introduce how to fine tune the model through a custom corpus. At present, OpenAI only opens the fine tune interface of four small models of ada, babbage, curie, and davinci, and the largest davinci model has about 17.5 billion parameters.

https://platform.openai.com/docs/guides/fine-tuning

# 查看所有的fine tune模型
!openai api fine_tunes.list
{
  "data": [
    {
      "created_at": 1681558899,
      "fine_tuned_model": null,
      "hyperparams": {
        "batch_size": null,
        "learning_rate_multiplier": null,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-dsshfnyndpY14OqgnRya8ExI",
      "model": "davinci",
      "object": "fine-tune",
      "organization_id": "org-U35hu1wdD7w3HnkgJ5fdBW8m",
      "result_files": [],
      "status": "failed",
      "training_files": [
        {
          "bytes": 380384,
          "created_at": 1681558899,
          "filename": "./dataset/csl_summarize_finetune_prepared.jsonl",
          "id": "file-0akQ6d59yrShNHtrCrh93U4w",
          "object": "file",
          "purpose": "fine-tune",
          "status": "processed",
          "status_details": null
        }
      ],
      "updated_at": 1681558911,
      "validation_files": []
    },
    {
      "created_at": 1681559488,
      "fine_tuned_model": "ada:ft-personal-2023-04-15-11-57-25",
      "hyperparams": {
        "batch_size": 1,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-lcIkh8dG2t1V4GAWWsFC644V",
      "model": "ada",
      "object": "fine-tune",
      "organization_id": "org-U35hu1wdD7w3HnkgJ5fdBW8m",
      "result_files": [
        {
          "bytes": 114680,
          "created_at": 1681559846,
          "filename": "compiled_results.csv",
          "id": "file-CwFfIH8HHXqpk3YCg0x1Yphn",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 380384,
          "created_at": 1681559487,
          "filename": "./dataset/csl_summarize_finetune_prepared.jsonl",
          "id": "file-0d35aGDx0Mn33tZ6x070HzmV",
          "object": "file",
          "purpose": "fine-tune",
          "status": "processed",
          "status_details": null
        }
      ],
      "updated_at": 1681559846,
      "validation_files": []
    },
    {
      "created_at": 1681562888,
      "fine_tuned_model": "ada:ft-personal-2023-04-15-12-54-03",
      "hyperparams": {
        "batch_size": 1,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-cSvqpGrrohBdPPmE7oxR2Xy3",
      "model": "ada",
      "object": "fine-tune",
      "organization_id": "org-U35hu1wdD7w3HnkgJ5fdBW8m",
      "result_files": [
        {
          "bytes": 114651,
          "created_at": 1681563244,
          "filename": "compiled_results.csv",
          "id": "file-Zfbqegb2TiJztX9R30ikbiG1",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 380384,
          "created_at": 1681562888,
          "filename": "./dataset/csl_summarize_finetune_prepared.jsonl",
          "id": "file-Cet8LADkX8SiGnUtAcktsoZ7",
          "object": "file",
          "purpose": "fine-tune",
          "status": "processed",
          "status_details": null
        }
      ],
      "updated_at": 1681563245,
      "validation_files": []
    },
    {
      "created_at": 1681564395,
      "fine_tuned_model": "ada:ft-personal-2023-04-15-13-19-25",
      "hyperparams": {
        "batch_size": 1,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-UzytubaVgNI8SAwqLuZba4T9",
      "model": "ada",
      "object": "fine-tune",
      "organization_id": "org-U35hu1wdD7w3HnkgJ5fdBW8m",
      "result_files": [
        {
          "bytes": 114523,
          "created_at": 1681564766,
          "filename": "compiled_results.csv",
          "id": "file-kq3Dhk5R95SI6taHP0ERWqkc",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 380384,
          "created_at": 1681564395,
          "filename": "./dataset/csl_summarize_finetune_prepared.jsonl",
          "id": "file-ajKeL2f93LaU09WRNj4kF6uV",
          "object": "file",
          "purpose": "fine-tune",
          "status": "processed",
          "status_details": null
        }
      ],
      "updated_at": 1681564767,
      "validation_files": []
    },
    {
      "created_at": 1681565036,
      "fine_tuned_model": "ada:ft-personal-2023-04-15-13-29-50",
      "hyperparams": {
        "batch_size": 1,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-LoKi6mOxlkOtfZcZTrmivKDa",
      "model": "ada",
      "object": "fine-tune",
      "organization_id": "org-U35hu1wdD7w3HnkgJ5fdBW8m",
      "result_files": [
        {
          "bytes": 112280,
          "created_at": 1681565392,
          "filename": "compiled_results.csv",
          "id": "file-TTRfuuyBXuZ4BKwX9I4i2zA4",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 380384,
          "created_at": 1681565036,
          "filename": "./dataset/csl_summarize_finetune_prepared.jsonl",
          "id": "file-S3SIEZoJbqPXTGT16YxPThVO",
          "object": "file",
          "purpose": "fine-tune",
          "status": "processed",
          "status_details": null
        }
      ],
      "updated_at": 1681565392,
      "validation_files": []
    }
  ],
  "object": "list"
}

Source of data set: CSL abstract data set, which is the abstract and title data of papers in the computer field, including 3500 pieces of data,

  • Title: the average number of words is 18, the standard deviation of the number of words is 4, the maximum number of words is 41, and the minimum number is 6;
  • Text: the average number of words is 200, the standard deviation of the number of words is 63, the maximum number of words is 631, and the minimum number is 41;

Data source address: https://github.com/liucongg/GPT2-NewsTitle CSL summary data set in the project

import json
with open('dataset/csl_data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
data[-1]
{'title': '自动信任协商中的攻击与防范',
 'content': '自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望。'}
import pandas as pd
df = pd.DataFrame(data)
df = df[['content', 'title']]
df.columns = ["prompt", "completion"]
df_train = df.iloc[:500]
df_train.head(5)
prompt completion
0 A new detail-preserving deformation algorithm is proposed, which can make the mesh model deform as rigidly as possible to reduce the distortion of geometric details in the deformation... Mesh Rigid Deformation Algorithm Preserving Detail
1 Real-time clothing animation generation technology can generate realistic clothing dynamic effects for three-dimensional virtual characters in real time, and is widely used in game entertainment, virtual clothing design... A real-time virtual human clothing animation method based on hybrid model
2 A face occlusion detection and removal method based on fuzzy principal component analysis (FPCA) is proposed. First, the occluded face is projected onto... Face Occlusion Area Detection and Reconstruction
3 Image matching technology has a wide range of application backgrounds in the fields of computer vision, remote sensing, and medical image analysis. For traditional correlation matching algorithms... An Image Matching Algorithm Based on Singular Value Decomposition
4 An anisotropic diffusion image denoising method based on slice similarity is proposed. Traditional anisotropic image denoising methods are based on a single pixel... Image Denoising by Slice Similarity Anisotropic Diffusion
df_train.to_json("dataset/csl_summarize_finetune.jsonl", orient='records', lines=True, force_ascii=False)
!openai tools fine_tunes.prepare_data -f dataset/csl_summarize_finetune.jsonl -q
Analyzing...

- Your file contains 500 prompt-completion pairs
- More than a third of your `prompt` column/key is uppercase. Uppercase prompts tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- More than a third of your `completion` column/key is uppercase. Uppercase completions tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples.
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Lowercase all your data in column/key `prompt` [Y/n]: Y
- [Recommended] Lowercase all your data in column/key `completion` [Y/n]: Y
- [Recommended] Add a suffix separator ` ->` to all prompts [Y/n]: Y
- [Recommended] Add a suffix ending `\n` to all completions [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `dataset/csl_summarize_finetune_prepared (1).jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "dataset/csl_summarize_finetune_prepared (1).jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 9.31 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
import os
os.environ.setdefault("OPENAI_API_KEY", OPENAI_API_KEY)
!openai api fine_tunes.create \
    -t "./dataset/csl_summarize_finetune_prepared.jsonl" \
    -m ada\
    --no_check_if_files_exist
Uploaded file from ./dataset/csl_summarize_finetune_prepared.jsonl: file-gPzuOBUizUDCGO7t0oDYoWQB


Upload progress:   0%|          | 0.00/380k [00:00<?, ?it/s]
Upload progress: 100%|██████████| 380k/380k [00:00<00:00, 239Mit/s]



Created fine-tune: ft-px9hve11l6YjizCQ8I6MyLCK
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-05-07 20:27:26] Created fine-tune: ft-px9hve11l6YjizCQ8I6MyLCK
[2023-05-07 20:27:45] Fine-tune costs $0.43
[2023-05-07 20:27:45] Fine-tune enqueued. Queue number: 0
[2023-05-07 20:27:46] Fine-tune started

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-px9hve11l6YjizCQ8I6MyLCK
# 根据上一步的输出,得到fine tune运行的key ft-LoKi6mOxlkOtfZcZTrmivKDa,
# 我们可以通过get来获取当前执行进度,
# 如发现与openai的连接断开,可通过follow重新排队连接
# !openai api fine_tunes.follow -i ft-LoKi6mOxlkOtfZcZTrmivKDa
!openai api fine_tunes.get -i ft-LoKi6mOxlkOtfZcZTrmivKDa
{
  "created_at": 1681565036,
  "events": [
    {
      "created_at": 1681565036,
      "level": "info",
      "message": "Created fine-tune: ft-LoKi6mOxlkOtfZcZTrmivKDa",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565045,
      "level": "info",
      "message": "Fine-tune costs $0.43",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565045,
      "level": "info",
      "message": "Fine-tune enqueued. Queue number: 0",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565046,
      "level": "info",
      "message": "Fine-tune started",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565139,
      "level": "info",
      "message": "Completed epoch 1/4",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565216,
      "level": "info",
      "message": "Completed epoch 2/4",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565293,
      "level": "info",
      "message": "Completed epoch 3/4",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565369,
      "level": "info",
      "message": "Completed epoch 4/4",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565391,
      "level": "info",
      "message": "Uploaded model: ada:ft-personal-2023-04-15-13-29-50",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565392,
      "level": "info",
      "message": "Uploaded result file: file-TTRfuuyBXuZ4BKwX9I4i2zA4",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1681565392,
      "level": "info",
      "message": "Fine-tune succeeded",
      "object": "fine-tune-event"
    }
  ],
  "fine_tuned_model": "ada:ft-personal-2023-04-15-13-29-50",
  "hyperparams": {
    "batch_size": 1,
    "learning_rate_multiplier": 0.1,
    "n_epochs": 4,
    "prompt_loss_weight": 0.01
  },
  "id": "ft-LoKi6mOxlkOtfZcZTrmivKDa",
  "model": "ada",
  "object": "fine-tune",
  "organization_id": "org-U35hu1wdD7w3HnkgJ5fdBW8m",
  "result_files": [
    {
      "bytes": 112280,
      "created_at": 1681565392,
      "filename": "compiled_results.csv",
      "id": "file-TTRfuuyBXuZ4BKwX9I4i2zA4",
      "object": "file",
      "purpose": "fine-tune-results",
      "status": "processed",
      "status_details": null
    }
  ],
  "status": "succeeded",
  "training_files": [
    {
      "bytes": 380384,
      "created_at": 1681565036,
      "filename": "./dataset/csl_summarize_finetune_prepared.jsonl",
      "id": "file-S3SIEZoJbqPXTGT16YxPThVO",
      "object": "file",
      "purpose": "fine-tune",
      "status": "processed",
      "status_details": null
    }
  ],
  "updated_at": 1681565392,
  "validation_files": []
}
# 保存openai fine tune过程的记录
!openai api fine_tunes.results -i ft-cSvqpGrrohBdPPmE7oxR2Xy3 > dataset/metric.csv
def summarize_text(text, model_name):
    response = openai.Completion.create(
        engine=model_name,
        prompt=f"请对以下文本进行总结,注意总结的凝炼性,将总结字数控制在20个字以内:\n{
      
      text}",
        temperature=0.7,
        max_tokens=100,
    )

    summarized_text = response.choices[0].text.strip()
    return summarized_text

text = "自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望。"""
print("原始文本: ", text)
print("ada摘要文本: ", summarize_text(text, model_name='ada'))
print("ada fine-tune摘要文本: ", summarize_text(text, model_name='ada:ft-personal-2023-04-15-13-29-50'))
原始文本:  自动信任协商主要解决跨安全域的信任建立问题,使陌生实体通过反复的、双向的访问控制策略和数字证书的相互披露而逐步建立信任关系。由于信任建立的方式独特和应用环境复杂,自动信任协商面临多方面的安全威胁,针对协商的攻击大多超出常规防范措施所保护的范围,因此有必要对自动信任协商中的攻击手段进行专门分析。按攻击特点对自动信任协商中存在的各种攻击方式进行分类,并介绍了相应的防御措施,总结了当前研究工作的不足,对未来的研究进行了展望。
ada摘要文本:  因此,为了在未来进行研究,本次研究也许能给学术界其他学者带来建议,更多读者本次研究期间的能查
ada fine-tune摘要文本:  -> 分布式防御措施的自动信任协商

面向自动信任协商的防御措施研究

自动信任协商的攻击面临

  Due to the cost and efficiency reasons, this experiment is fine tuned based on the Ada model. It can be seen that the original Ada model has almost no get to the text summarization task at all, but only generates a new piece of text on the text background. After a simple fine tune, although the generated text is still far from ChatGPT or other large models that have been fine-tuned on this task, it can already generate a pretty good summary to a certain extent.

If you need to continue fine-tuning on a fine-tuning model, just change the -m parameter of fine_tunes.create to the fine-tuned model name. For example, for the above case, use:

!openai api fine_tunes.create \
    -t "./dataset/csl_summarize_finetune_prepared.jsonl" \
    -m ada:ft-personal-2023-04-15-13-29-50\
    --no_check_if_files_exist
Uploaded file from ./dataset/csl_summarize_finetune_prepared.jsonl: file-adsjU97Wo9bPNmdAa1LTMkQC
Created fine-tune: ft-d6qvvl7cr6WYvkSOBu7YVO2p
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-05-07 15:44:48] Created fine-tune: ft-d6qvvl7cr6WYvkSOBu7YVO2p
[2023-05-07 15:45:03] Fine-tune costs $0.43
[2023-05-07 15:45:03] Fine-tune enqueued. Queue number: 0
[2023-05-07 15:45:05] Fine-tune started

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-d6qvvl7cr6WYvkSOBu7YVO2p




Upload progress:   0%|          | 0.00/380k [00:00<?, ?it/s]
Upload progress: 100%|██████████| 380k/380k [00:00<?, ?it/s]

3 Text Correction Task

3.1 What is text error correction?

  In daily life, whether it is WeChat chats, Weibo tweets or even published books, we will more or less find typos in the text.

  These typos may be due to accent deviations during voice input, such as "aircraft" being recognized as "gray machine"; it may also be caused by accidentally touching adjacent keys or selecting the wrong result during pinyin input, such as "aircraft" being recognized as "deji" and "fat chicken"; or when writing in handwriting, it is written in a similar character, such as "shudder" is recognized as "war millet"...

  Common error types include:

  • Spelling mistakes: Chinese course -> Chinese language; Tomorrow's meeting -> Tomorrow will be easy
  • Syntax error: he went to a meeting yesterday -> he was going to a meeting yesterday
  • Punctuation error: Hello, please advise! -> Hello, please give me more advice???
  • Intellectual errors: Shanghai Huangpu District -> Shanghai Huangpu District
  • Repetitive error: Hello, are you free today? -> Hello, are you free today?
  • omission error: he went to the meeting yesterday -> he went to the meeting yesterday
  • Sequence error: He went to the meeting yesterday -> He went to the meeting yesterday
  • Multilingual error: he went to the conference yesterday -> he went to huiyi yesterday
  • ……

  In conclusion, text errors can be all kinds of strange things. For humans, relying on common sense and context, it is not difficult to achieve semantic understanding, sometimes it only slightly affects the reading experience; and for some specific text downstream tasks, such as named entity recognition or intent recognition, an unprocessed wrong input Text can lead to diametrically opposed recognition results. |

  The task of text error correction refers to the process of detecting and correcting errors in text through natural language processing technology. At present, it has become an important branch in the field of natural language processing, and is widely used in various fields such as search engines, machine translation, and intelligent customer service. Even though due to the diversity of text errors, it is often difficult for us to identify and correct all errors successfully, but if we can identify as many errors in the text as possible and correctly, the cost of manual review can be greatly reduced, which is also a good thing~

3.2 Common Text Error Correction Techniques

  Common text error correction technologies mainly include the following:

  1. Rule-Based Text Error Correction Technology
  2. Text Error Correction Technology Based on Language Model
  3. Text Error Correction Technology Based on MLM
  4. Text Error Correction Technology Based on NLG

3.2.1 Rule-based text error correction technology

  This text error correction technology checks common errors such as spelling, grammar, and punctuation marks in the text by implementing defined rules. If "pyramid" is often mistakenly written as "pyramid", the mapping between the two is added to the database relation. Because this traditional method requires a lot of manual work and a deep understanding of language by experts, it is difficult to deal with massive text or more complex language errors.

3.2.2 Text error correction technology based on language model

  The language model-based text error correction technology includes error detection and error correction. This method is also relatively simple and crude, with fast speed, strong scalability and general effect. A common model is Kenlm.

  • Error detection: use jiebathe Chinese tokenizer to segment the sentences, and then combine the suspected error results in terms of word granularity and word granularity to form a candidate set of suspected error locations.

  • Error correction: Traverse all candidate sets and replace words in wrong positions with sound-like and shape-like dictionaries, then calculate the sentence perplexity through the language model, and finally compare and sort the results of all candidate sets to obtain the optimal corrected word.

3.2.3 Text error correction technology based on MLM

  We know that BERT uses Masked Language Model mask language model (MLM) and Next Sentence Prediction (NSP) in the pre-training stage. In the MLM task, 15%*10% of Token will be replaced by Random other words force the model to rely more on contextual information to predict Mask words, giving the model the ability to correct errors to a certain extent.

  Therefore, we simply modify the MLM task of BERT, design the input as wrong vocabulary, output as correct vocabulary, and do a simple fine tune to easily realize the text error correction function.

  For example, the Soft-Masked BERT model of ACL2020 ( paper notes ), designed a double network for text error correction, where the "error detection network" identifies the probability of each character error through Bi-GRU, and the "error correction network" tends to Mask the words with higher error probability and predict the real words.

3.2.4 NLG-based text error correction technology

  The Mask method mentioned above can only be used when the input and output are of the same length, but in practical applications, there are often situations where the two are not equal in length, such as typos or multiple characters. One possible solution is to embed a layer of Transformer Decoder behind the original BERT model, that is, the task of "text error correction" is equivalent to "translating wrong text into correct text". At this time, we cannot guarantee the output text It must be completely consistent with the correct part of the original text, and a new expression may be generated under the condition that the semantics remain unchanged.

3.2.5 A text error correction toolset: pycorrector

  pycorrector is a text error correction toolset with built-in KenLM, MacBERT, Transformer and other text error correction models.

  • pycorrector project address: https://github.com/shibing624/pycorrector
  • An online demo based on MacBERT: https://huggingface.co/spaces/shibing624/pycorrector

  pycorrector can not only be invoked by "import pycorrector", but also provides a Huggingface pre-trained model calling method. The following is a MacBERT4CSC call example based on Huggingface.

from transformers import BertTokenizer, BertForMaskedLM

# 载入模型
tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")

text = "大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课乘吧!"
input_ids = tokenizer([text], padding=True, return_tensors='pt')

# 生成结果文本
with torch.no_grad():
    outputs = model(**input_ids)
output_ids = torch.argmax(outputs.logits, dim=-1)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True).replace(' ', '')

print("原始文本: ", text)
print("纠错文本: ", output_text)
原始文本:  大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课乘吧!
纠错文本:  大家好,一起来参加datawhale的《chatgpt使用指南》组队学习课程吧!
# 查看修改点
import operator
def get_errors(corrected_text, origin_text):
    sub_details = []
    for i, ori_char in enumerate(origin_text):
        if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
            # add unk word
            corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
            continue
        if i >= len(corrected_text):
            continue
        if ori_char != corrected_text[i]:
            if ori_char.lower() == corrected_text[i]:
                # pass english upper char
                corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
                continue
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

correct_text, details = get_errors(output_text[:len(text)], text)
print(details)
[('乘', '程', 37, 38)]

3.3 Text error correction experiment based on OpenAI interface

def correct_text(text):
    content = f"请对以下文本进行文本纠错:\n{
      
      text}"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", 
        messages=[{
    
    "role": "user", "content": content}]
    )
    corrected_text = response.get("choices")[0].get("message").get("content")
    return corrected_text

text = "大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课乘吧!"
output_text = correct_text(text)
print("原始文本: ", text)
print("纠错文本: ", output_text)
原始文本:  大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课乘吧!
纠错文本:  大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课程吧!
from redlines import Redlines
from IPython.display import display, Markdown

diff = Redlines(' '.join(list(text)),' '.join(list(output_text)))
display(Markdown(diff.output_markdown))

Hello everyone, let's take part in the team study course of Data W hale's "Chat GPT User Guide"!

4 Machine Translation Tasks

4.1 What is machine translation?

  Machine translation, also known as automatic translation, is the process of using a computer to convert one natural language (source language) into another natural language (target language). According to incomplete statistics, there are about 7,000 languages ​​in the world, and there are about 700 0 2 7000^2 pairs7000There are two combinations of these languages. There are many phenomena such as polysemy and vertical knowledge in these languages. Therefore, it is possible to use less labeled data, or unsupervisedly allow the computer to truly understand the meaning of the input language, and "believe" and "reach" ", "Elegant" into the output language, has always been the research focus of scholars.

  As we all know, machine translation has always been a research direction that has attracted much attention in the field of natural language processing, and it is also one of the earliest tasks for which natural language processing technology has emerged. Nowadays, there are endless machine translation tools on the market, such as Baidu Translate and Google Translate, which are commonly used by everyone, and even AI simultaneous interpretation that only appeared in science fiction movies when I was a child, such as Xunfei Hear Simultaneous Interpretation. Simply put, it can be divided into general domain (multilingual), vertical domain, term customization, domain adaptation, artificial adaptation, speech translation, etc.

4.2 Common machine translation techniques

  From the perspective of the development of machine translation, it has mainly gone through the following stages:

  • rule-based approach
  • Statistical methods
  • Neural Network Based Approach

  Rule-based methods need to build various knowledge bases to describe the lexical, syntactic and semantic knowledge of the source language and the target language, and sometimes knowledge-independent world knowledge.

  Based on statistical methods, it is considered that for a source language RRR , any target languageTTT may be its translation, but the probability is high or low. For each word ri r_iin the source languageriand each word tj t_j in the target languagetj, to judge the probability of word alignment, and then obtain the alignment method with the maximum probability of word alignment through the expected maximum algorithm (such as EM algorithm). This is the word-based translation model. Obviously, it is not grammatical to design the smallest unit of translation as a word, so a phrase-based translation method was later extended to design the smallest translation unit as a continuous string of words.

  In 2013, a new end-to-end encoder-decoder architecture for machine translation came out, using CNN for hidden representation mining and RNN for converting hidden vectors into target languages, marking the beginning of neural machine translation . Later, technologies such as Attention, Transformer, and BERT were successively proposed, which greatly improved the quality of translation.

  The following is a simple example of machine translation based on transformers.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-zh-en")

text = "大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课程吧!"

inputs = tokenizer(text, return_tensors="pt", )
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print('原始文本: ', text)
print('翻译文本: ', translated_sentence)
原始文本:  大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课程吧!
翻译文本:  Hey, guys, let's join the ChatGPT team at DataWhale.

4.3 Machine translation experiment based on OpenAI interface

4.3.1 Easy-to-use version: English translation of short texts

def translate_text(text):
    content = f"请将以下中文文本翻译成英文:\n{
      
      text}"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", 
        messages=[{
    
    "role": "user", "content": content}]
    )
    translated_text = response.get("choices")[0].get("message").get("content")
    return translated_text

text_to_translate = "大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课程吧!"
translated_text = translate_text(text_to_translate)
print("原始文本: ", text_to_translate)
print("输出文本: ", translated_text)
原始文本:  大家好,一起来参加DataWhale的《ChatGPT使用指南》组队学习课程吧!
输出文本:  Hello everyone, let's join the team learning course of "ChatGPT User Guide" organized by DataWhale together!

  It can be seen that ChatGPT is obviously better than Helsinki-NLP in Chinese-English translation, and the "ChatGPT User Guide" is translated more specifically.

4.3.2 Advanced in-depth version: English translation of long books

import books

Data source: https://github.com/LouisScorpio/datamining/tree/master/tensorflow-program/nlp/word2vec/dataset

with open("dataset/哈利波特1-7英文原版.txt", "r") as f:
    text = f.read()
print('全书字符数: ', len(text))
# 整本书的字符数约有635万,但我们知道,chatgpt的api调用是根据token数量来的,
# tokenizer本身的作用是将句子切分成单词,再将单词转化为数值型的输入,我们可以简单地使用tokenizer来统计token数量
全书字符数:  6350735
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")  # GPT-2的tokenizer和GPT-3是一样的
token_counts = len(tokenizer.encode(text))
print('全书token数: ', token_counts)

# chatgpt的api调用价格是 1000 token 0.01美元,因此可以大致计算翻译一本书的价格
translate_cost = 0.01 / 1000 * token_counts
print(f'翻译全书约需{
      
      translate_cost}美元')
Token indices sequence length is longer than the specified maximum sequence length for this model (1673251 > 1024). Running this sequence through the model will result in indexing errors


全书token数:  1673251
翻译全书约需16.73251美元
# 翻译全书约需115.14 rmb成本,有点贵了,我们试着只翻译第一本
end_idx = text.find('2.Harry Potter and The Chamber Of Secrets.txt')
text = text[:end_idx]
print('第一册字符数: ', len(text))

tokenizer = GPT2Tokenizer.from_pretrained("gpt2") 
token_counts = len(tokenizer.encode(text))
print('第一册token数: ', token_counts)

translate_cost = 0.01 / 1000 * token_counts
print(f'翻译第一册约需{
      
      translate_cost}美元')
第一册字符数:  442815


Token indices sequence length is longer than the specified maximum sequence length for this model (119873 > 1024). Running this sequence through the model will result in indexing errors


第一册token数:  119873
翻译第一册约需1.19873美元

The token limit of GPT-3 is about 4096 (it is said that GPT-4 can input up to 32,000 tokens), so it is impossible to directly input the text of 120,000 tokens.

We can use a simple method to divide the text into several parts, use chatgpt to translate each part, and finally stitch them together.

First of all, it is best to ensure the semantic coherence of each text itself. If the context is split into two pieces from the middle of a sentence, there will be ambiguity in translation.

A more intuitive idea is to treat each paragraph as a text block and translate one paragraph at a time.

However, there are so many paragraphs in this book, and translating paragraph by paragraph will obviously reduce the efficiency of translation. At the same time, since each paragraph has less context, the possibility of translation errors rises.

paragraphs = text.split('\n')
print('段落数: ', len(paragraphs))

ntokens = []
for paragraph in paragraphs:
    ntokens.append(len(tokenizer.encode(paragraph)))
print('最长段落的token数: ', max(ntokens))
段落数:  3038
最长段落的token数:  275

Therefore, we choose a threshold, such as 500, and add a text paragraph at a time. If the total exceeds 500, a text block is opened.

def group_paragraphs(paragraphs, ntokens, max_len=1000):
    """
    合并短段落为文本块,用于丰富上下文语境,提升文本连贯性,并提升运算效率。
    :param paragraphs: 段落集合
    :param ntokens: token数集合
    :param max_len: 最大文本块token数
    :return: 组合好的文本块
    """
    batches = []
    cur_batch = ""
    cur_tokens = 0

    # 对于每个文本段落做处理
    for paragraph, ntoken in zip(paragraphs, ntokens):
        if ntoken + cur_tokens + 1 > max_len:  # '1' 指的是'\n'
            # 如果加入这段文本,总token数超过阈值,则开启新的文本块
            batches.append(cur_batch)
            cur_batch = paragraph
            cur_tokens = ntoken
        else:
            # 否则将段落插入文本块中
            cur_batch += "\n" + paragraph
            cur_tokens += (1 + ntoken)
    batches.append(cur_batch)  # 记录最后一个文本块
    return batches

batchs = group_paragraphs(paragraphs, ntokens, max_len=500)
print('文本块数: ', len(batchs))

new_tokens = []
for batch in batchs:
    new_tokens.append(len(tokenizer.encode(batch)))
print('最长文本块的token数: ', max(new_tokens))
文本块数:  256
最长文本块的token数:  500
# 展示第一段文本
print(batchs[0])
1.Harry Potter and the Sorcerer's Stone.txt

  Harry Potter and the Sorcerer's Stone
  CHAPTER ONE
  THE BOY WHO LIVED
  Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.
  Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.
  The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.
  When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair.
# 实操中发现,使用chatgpt翻译长文本很慢,这里改用Davinci实现,感兴趣的同学可以尝试优化
# 速率限制见:https://platform.openai.com/docs/guides/rate-limits/overview
# def translate_text(text):
#     content = f"请将以下英文文本翻译成中文:\n{text}"
#     response = openai.ChatCompletion.create(
#         model="gpt-3.5-turbo", 
#         messages=[{"role": "user", "content": content}]
#     )
#     translated_text = response.get("choices")[0].get("message").get("content")
#     return translated_text
def translate_text(text):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"请将以下英文翻译成中文:\n{
      
      text}",
        max_tokens=2048
    )

    translate_text = response.choices[0].text.strip()
    return translate_text
print(translate_text(batchs[0]))
欣欣夫妇对4号普里维特路的房子非常自豪,他们乐呵呵地表示,自己完全是一家正常家庭,也就不沾任何奇怪或神秘的事情。要是提到像这样的废话,他们实在是有点不屑一顾。欣欣先生当时正在格林宁公司任出品部主任,他长得又胖又壮,脖子很粗,只有一撮大胡子。欣欣太太瘦得非常,并且,出乎意料的是,她的脖子竟然站地比正常人都长,这样就非常省事,因为欣欣太太经常会在花园栅栏上俯瞰,窥探邻居的一举一动。欣欣夫妇有一个小儿子,叫达力,他们认为,没有比这孩子更好的了。

欣欣夫妇有很多东西,但其实他们也拥有一个秘密,最怕的是被有人发现。他们实在不敢想象,如果波特家来他们街上,会有什么样的下场……波特太太就是欣欣太太的妹妹,但是他们已经好几年没有见过面了。实际上,欣欣太太压根就假装没有妹妹,因为妹妹和他那没用的丈夫,实在是与欣欣家的一切都格格不入。同时,欣欣夫妇还知道波特家有一个小儿子,但是他们从来都没有见过。这孩子可就是欣欣夫妇珍爱达力的另一个绝好理由,不要让达力和他有接触。

当本故事发生的比较乏味的周二早晨,欣欣夫妇醒来,天空阴沉沉的,毫不暗示即将发生的奇异神秘的事情。欣欣先生嗯嗯哼哼地挑出今天上班穿的最没气质的领带,欣欣太太讨论着什么有趣的闲事,忙着把尖叫的达力抱进高脚高位椅里。

  Next, we translate each text block and combine the results.

from tqdm import tqdm
translated_batchs = []
# 有的时候由于VPN等问题,可能会出现断联,也即443 timeout,可以在断点batch处重连
translated_batchs_bak = translated_batchs.copy()
cur_len = len(translated_batchs)
for i in tqdm(range(cur_len, len(batchs))):
    translated_batchs.append(translate_text(batchs[i]))
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [07:50<00:00, 58.79s/it]
# 另一种方法是,参考openai的util函数,加入retry机制,如果失败则尝试重连
from tenacity import retry, stop_after_attempt, wait_random_exponential
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def translate_text(text):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"请将以下英文翻译成中文:\n{
      
      text}",
        temperature=0.3,
        max_tokens=2048
    )

    translate_text = response.choices[0].text.strip()
    return translate_text
for i in tqdm(range(len(batchs))):
    translated_batchs.append(translate_text(batchs[i]))
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [25:31<00:00, 76.55s/it]
# 保存结果至txt文件
result = '\n'.join(translated_batchs)

with open('dataset/哈利波特1中文版翻译.txt','w', encoding='utf-8') as f:
    f.write(result)

Related literature

References

ChatGPT User Guide: Text Generation @玉林

Related video explanation

Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/130572976