Question-answer data pairs for building your own unstructured datasets based on large models

Before the instruction gpt comes out, the input for text generation is only the original text. After the instruct gpt appeared. We need to do a feature enrichment project. Improving the performance of text generation tasks through feature enrichment engineering. If it's just a question and answer, then don't make such a large model. The general length of the question and answer can be solved within 1024. Have you ever seen a conversation that is endless. I think it is better to use a series of text generation tasks to express the current generative language model. The tasks I am doing include constructing patent text interpretation tasks based on text sliding windows. There is also the task of professional medical interaction based on drug instructions. When interacting with the generative language model today, I found that the greatest ability of the 5-10B parameter text generative language model is that the generative language model with a smaller number of parameters has more accurate, clear and structured generation results. But the generalization ability and fantasy ability of generating tasks is a game problem. The key point of the game is that if there are fewer fantasies, generalization will inevitably be affected. If there are too many fantasies, the accuracy will inevitably be affected.

We can take a look at the data of this structure. The data of this structure is the result of the interpretation and interaction of the large generative language model for civil aviation patents. The first column is the original patent text, the second column is the questions generated based on the original patent text and instructions, and the third column is the relevant explanation based on the paper and the generated questions. Here we just use civil aviation patents as an example. I'm even thinking. Is it to build a three-stage multi-step generation system? Training without training at all is training after the data has been adjusted. And how to adjust this data, how to find a group of people to jointly build an effective instruction data set with multi-segment token within 2048. This is very important.

output = []

def main():
    history = []
    maxlen = 386
    global stop_stream
    import pandas as pd
    # dataset = pd.read_excel("../dataset/summary.xlsx", engine='openpyxl')
    out_list = []
    #     autodl-tmp/人工整理文本
    for l in dataset.values.tolist():
        for l in l:
            if not isinstance(l,str):
                continue
            out = ""
            for l_one in l.split("。"):
                if len(l_one) > maxlen:
                    out_list.append(out)
                    out = ""
                    continue
                if len(out + l_one + "。") > maxlen:
                    out_list.append(out)
                    out = ""
                else:
                    out += l_one + "。"
    for query in list(set(out_list)):
        print(query)
        response, _ = model.chat(tokenizer, "面对以下内容提出几个问题,不需要给出答案," + query, history=[])
        # os.system(clear_command)
        # print(build_prompt(history), flush=True)
        for response_one in response.split("\n"):
            if response_one.endswith("?"):
                response, _ = model.chat(tokenizer, "面对以下内容" + query + "。给出问题" + response_one + "的答案。",
                                         history=[])

                output.append({"knowledge": query, "question": response_one, "document_answer": response})
                pd.DataFrame(output).to_excel("knowledge_question_answer.1111.xlsx")
                torch_gc()
        torch_gc()


if __name__ == "__main__":
    main()

Therefore, what traps the big model is an upper-level product architecture built by instructions. Before the popularity of large models, I also trained many 1 billion parameter-level models. There are also more than sota. What does it mean to be market-oriented? is a set of valuable text generation scenarios with differentiation. After training, some index content was also measured. This is scientific research. But what we need to survive is an effective market for natural language processing applications. For short input, we need to use some methods to effectively lengthen it. In the face of long input, we need to use some methods to effectively remove short sequences that do not contribute meaningfully. What you are talking about is the construction of the data set. You can use chatgpt to construct the data set. After the data set is built. To modify the data set can be regarded as adding some differences between our vertical domain generative language model and general domain generative language model. I have always believed that these general-purpose large models are the upper-level embodiment of the product design of the same set of instructions. Then if we get through a new set of command product design, then we may dig out a generative language model with more market value.

GitHub - ymcui/Chinese-LLaMA-Alpaca: Chinese LLaMA&Alpaca Large Language Model + Local CPU/GPU Deployment (Chinese LLaMA & Alpaca LLMs)​github.com/ymcui/Chinese-LLaMA-AlpacaUploading...ReuploadCancel

For example, this work mentioned how to add vocabulary and continue pre-training the model. Bloom also has a cropped vocabulary to continue pre-training.

The value of a professional text generation task depends on its context and target audience. Here are some professional text generation tasks that may be valuable:

1. Academic papers: Academic papers are one of the most common targets in professional text generation tasks. These texts often require precise, clear and structured writing, so producing high-quality academic papers can help ensure their accuracy and readability.

2. Business report: A business report needs to describe the company's financial performance, market analysis, competitive strategy, etc. These texts require a high degree of accuracy and logic, so specialized text generation tasks can ensure that the generated business reports meet their requirements.

3. Technical documentation: Technical documentation needs to describe the usage and functions of software, tools and technologies. These texts usually require detailed and detailed writing in order to attract readers.

4. Legal documents: Legal documents need to describe legal texts such as contracts, patents, and trademarks. These texts need to be written clearly, concisely and structured in order to engage readers and ensure compliance with legal requirements.

5. Product description: The product description needs to describe the features, functions and usage of the product. These texts need to be written in detail, clearly and understandably in order to attract readers and ensure that they meet the requirements of the product description.

It should be noted that different professional text generation tasks may require different types of text structure and language style, so it is necessary to choose the most suitable task and text generation tool according to the specific task and target audience.

The empowerment of the large text model generation model in the field of academic papers is mainly reflected in the following aspects:

1. Automated writing: The large text model generation model can automatically generate academic papers without manual intervention. For researchers, this can save a lot of time and effort while increasing productivity.

2. Accuracy and readability: The large text model generation model can generate accurate, clear and easy-to-understand academic papers, which can ensure that they meet the requirements of academic papers. For researchers, this can improve the reliability and acceptance of research results.

3. Customized writing: The large text model generation model can generate customized academic papers according to specific tasks and target audiences. For example, academic papers can be produced on a particular research topic, or for a certain readership group.

4. Knowledge map: The large text model generation model can combine a large number of knowledge bases to generate more intelligent academic papers. For example, it can combine the knowledge base of the research field to generate academic papers for this field, and can also combine historical documents, news articles, etc. to generate more comprehensive academic papers.

Text large model generative models can bring many advantages in the field of academic papers, including automated writing, accuracy and readability, customized writing, and knowledge graphs. These advantages can greatly increase productivity, while increasing the reliability and acceptance of research results.

The empowerment of the large text model generation model in the field of business reporting is mainly reflected in the following aspects:

1. Automated writing: The large text model generation model can automatically generate business reports without manual intervention. For business report writers, this can save a lot of time and effort while increasing productivity.

2. Accuracy and readability: The text large model generation model can generate accurate, clear and easy-to-understand business reports, which can ensure that it meets the requirements of business reports. For business report writers, this can improve the reliability and acceptance of research results.

3. Customized writing: The large text model generation model can generate customized business reports according to specific tasks and target audiences. For example, a business report can be generated for a specific market, or for a certain customer group.

4. Knowledge map: The large text model generation model can combine a large number of knowledge bases to generate more intelligent business reports. For example, it can combine the knowledge base of the research field to generate a business report for this field, and can also combine historical documents, news articles, etc. to generate a more comprehensive business report.

Text large model generation models can bring many advantages in the field of business reporting, including automated writing, accuracy and readability, customized writing, and knowledge graphs. These advantages can greatly increase productivity, while increasing the reliability and acceptance of research results.

The empowerment of the large text generation model in the field of technical documents is as follows:

1. Automating technical documentation: Text generation large models can help automate technical documentation. These models can automatically generate a large number of technical documents, including documents introducing how to use the software, summaries and content overviews of documents, etc. This can greatly improve the efficiency and accuracy of documentation, making it easier to understand and use.

2. Personalized documents: The large text generation model can generate personalized documents based on user input. For example, instead of simply generating a large number of documents, a model could generate specific documents based on a user's question or need. This personalized approach can help users better understand and use documents.

3. Improve document quality: Large text generation models can generate high-quality documents. These models learn the structure and rules of language and documents to produce more accurate and natural text. This improves the quality and readability of the documentation, helping users better understand and use the documentation.

4. Combination with other tools: The large text generation model can be combined with other tools, such as natural language processing and machine learning models, to generate more intelligent and personalized documents. These models can be used in applications such as automated document generation, intelligent question answering, and intelligent recommendation.

The text generation large model has a wide range of empowerment in the field of technical documents, and can be used in applications such as automatic document generation, personalized document generation, intelligent question answering, and intelligent recommendation. These models can greatly improve the efficiency and accuracy of documents, and help users better understand and use documents.

Legal documents refer to normative documents formulated by legal institutions or individuals to ensure the accuracy, completeness and effectiveness of the document content, including contracts, agreements, legal documents, power of attorney, lawyer letters, etc.

In legal documents, text generation large models can be used to automatically generate the following:

1. Contract text: One of the most important contents of a legal document is the contract. The large-scale text generation model can automatically generate contract text, including contract terms, contract subject, contract time and other details, so that users can quickly generate contracts and conduct legal review.

2. Agreement text: The large text generation model can also be used to generate various types of agreement texts, such as non-disclosure agreements, employment agreements, cooperation agreements, etc., to help users quickly generate various types of agreements and ensure that the contents of the agreements are accurate, complete and efficient.

3. Power of Attorney: The large text generation model can generate various types of power of attorney, including personal authorization, company authorization, lawyer authorization, etc., to ensure that the content of the power of attorney is accurate, complete and valid.

4. Lawyer's letter: The large text generation model can generate various types of lawyer's letters, helping users quickly generate lawyer's letters and conduct legal review to ensure that the content of lawyer's letters is accurate, complete and effective.

The application of large text generation models in the field of legal documents can help users improve the efficiency and accuracy of legal documents, thereby protecting the rights and interests of users and companies.

A product description is an important document that introduces a product, service or solution to potential or existing users, and usually includes basic information, features and functions of the product, as well as instructions for use and user manuals, etc. Product descriptions can help companies or institutions better understand and attract potential or existing users, and improve product sales and user satisfaction.

In the product description, the text generation large model can be used to automatically generate the following content:

1. Basic product information: The product description needs to include the basic information of the product, such as product name, product type, product model, etc. The large text generation model can automatically generate basic product information to help users quickly understand the product.

2. Product features and functions: The large text generation model can be used to automatically generate product features and functions. For example, a model can generate product characteristics, such as product features, functions, performance, etc., so that users can better understand the product.

3. Instructions for use: The large text generation model can be used to automatically generate instructions for use. For example, a model can generate instructions for a product, including how to install, how to use, and precautions, so that users can better understand the product.

4. User manual: The large text generation model can be used to automatically generate user manuals. For example, a model can generate a user manual, including product instructions, operation guides, and FAQs, so that users can better understand the product.

The large text generation model has a wide range of application scenarios in product descriptions. It can help users quickly generate documents such as basic product information, product features and functions, instructions for use, and user manuals, so as to better introduce products to potential or existing users.

 

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131601640