Automatic text generation Research progress and trends from data to text generation

data-to-text generation

1 Current status of international research

        Data-to-text generation technology refers to the generation of relevant texts based on given numerical data, such as generating weather forecast texts, sports news, financial reports, medical reports, etc. based on numerical data. Data-to-text generation technology has strong application prospects. At present, great research progress has been made in this field, and multiple generation systems for different fields and applications have been developed in the industry. Research units for data-to-text generation technology are mainly concentrated in a few units, such as the University of Aberdeen, the University of Brighton, the University of Edinburgh, etc. The relevant research results are mainly published in professional academic conferences such as INLG and ENLG .

Ehud Reiter of the University of Aberdeen proposed a general framework for data-to-text generation systems based on the three-stage pipeline model [76]

       Among them: the input of the signal analysis module (Signal Analysis) is numerical data, and the basic pattern in the data is detected by using various data analysis methods, and the discrete data pattern is output. Examples include peaks in stock data, longer-term growth trends, etc. This module is related to specific application fields and data types, and the output data patterns for different application fields and data types are different.

       The input of the data interpretation module (Data Interpretation) is basic patterns and events. By analyzing the basic patterns and input events, it infers more complex and abstract messages, at the same time infers the relationship between them, and finally outputs high-level messages and messages. relationship between. For stock data, for example, you could create a message if the drop exceeds a certain value. It is also necessary to detect the relationship between messages, such as causal relationship, timing relationship, etc. It is worth noting that the data interpretation module is not required in all text generation systems. For example, in the weather forecast text generation system, the basic model is sufficient to meet the requirements, so the data interpretation module is not required.

       The input of the document planning module (Document Planning) is messages and relationships. It analyzes and determines which messages and relationships need to be mentioned in the text, and at the same time determines the structure of the text, and finally outputs the messages that need to be mentioned and the document structure. At a higher level, the signal analysis and data interpretation module will generate a large number of messages, patterns and events, but the text is usually limited in length and can only describe a part of it, so the document planning module must determine the messages that need to be explained in the text . Generally, selection and determination can be made according to expert knowledge, importance and novelty of the news, etc. Of course, this module is also very related to the field. The factors considered in the selection of messages in different fields are different, and the structure of the document will also be different.

       The input of the Microplanning and Realization module is the selected message and structure, and the final text is output through natural language generation technology. This module mainly involves sentence planning and sentence realization, and requires that the final realized sentence has correct grammar, morphology and spelling, and uses accurate referential expressions. There are quite a lot of researches on the techniques used in academia, please refer to Section 3 "Meaning-to-Text Generation" of this paper for details.

      At present, the industry has developed data-to-text generation systems for multiple fields. The frameworks of these systems are not much different from the above-mentioned general frameworks. Some systems combine the two modules in the above-mentioned frameworks into one module, or omit the one of the modules.

      Data-to-text generation technology has been most successfully applied in the field of weather forecasting. Several systems have been developed in the industry to summarize weather forecast data and generate weather forecast text. For example, the FoG system [78] can generate bilingual weather forecast text from the data operated by the user; the SumTime system [79] can generate marine weather forecast text, and the experimental evaluation shows that users are sometimes more inclined to read the weather forecast generated by SumTime, rather than weather forecasts written by experts [80]. In addition, Anja Belz of the University of Aberdeen proposed a probabilistic generative model for weather language text generation [81]. Anja Belz and Eric Kow further compared a variety of data-to-text generation systems based on the analysis of weather forecast data. The results showed that the use of a higher degree of automation will not reduce the quality of text generation, and at the same time, the automatic evaluation method of text quality will underestimate the quality of text based on manual methods. rules-based systems, while overestimating automated systems [82].

     The industry has also developed multiple text generation systems for other fields, such as the text generation system for air quality[83], the text generation system for financial data[84], the text generation system for medical diagnostic data TOPAZ[85], Suregen[85], 86], BT-45 [87], etc. Among them, the BT-45 can generate text summaries for monitoring data in the neonatal intensive care unit (NICU) to help doctors make decisions. The following two figures show the input sample and the generated text of the BT-45 system respectively. NICU data sample, HR, TcPO2, TcPCO2, SaO2, T1 & T2, and Mean BP respectively from top to bottom [Portet et al., 2009]

Due to the huge application value of data-to-text generation technology, the industry has established a number of companies engaged in text generation, which can generate industry reports or news reports for multiple industries based on industry data, thereby saving a lot of manpower. Well-known companies include ARRIA[12], AI[13], NarrativeScience[14], etc. Among them, ARRIA is a company headquartered in Europe. It was formerly known as Data2Text. It was founded by two professors Ehud Reiter and Yaji Sripada from the University of Aberdeen. Later, Robert Dale, another scientist in the field of natural language generation, also joined the company. company, the company's core technology is the ARRIA NLG engine. AI (Automated Insights) is an American artificial intelligence company founded by Robbie Allen, a former Cisco engineer. It was the first to generate text summaries based on sports data. It can currently provide services including finance, personal fitness, business intelligence, website analysis, etc. Generate text reports from data in multiple fields, and its core technology is WordSmith NLG engine. At present, AI companies have generated hundreds of millions of news reports for many units such as the Associated Press, resulting in huge influence. NarrativeScience was developed based on StatsMonkey, a research project of Northwestern University in the United States, and its core technology is the Quill NLG engine. Forbes is a typical client of NarrativeScience. There is a special NarrativeScience page[15] on the website, and all articles are automatically generated by NarrativeScience.

2 Current status of domestic research

         Domestic academic circles have little research on data-to-text generation, and few relevant academic achievements have been published in important academic conferences and journals. Some units in the domestic industry have developed a template-based text generation system. For example, Xinhua News Agency has developed a system for generating corporate annual reports from financial report data. The system is based on manual templates and fills in the required data into the written templates to generate annual financial reports. Since the templates used are relatively fixed, the financial reports and annual reports generated for different companies are relatively similar and not vivid enough.

3 Development Trend and Prospect

       The generation technology from data to Chinese text is of great research significance, and at the same time, it is very practical. If the generation of Chinese news from data can be achieved, it will greatly ease the burden on editors and reporters, and realize the transformation of the media and publishing industry. The realization of such a system must rely on the cooperation between scientific research institutes and news publishing organizations. News publishing agencies can provide a large amount of data and expert knowledge, while scientific research institutes are good at theories and methods of natural language understanding and generation.

     In addition, it is quite complicated and difficult to develop a general-purpose data-to-text generation system for different fields. Therefore, a better approach is to select one or two fields (such as finance and sports) for system development, and wait for the system to mature. Then consider migrating the system to other areas.

Guess you like

Origin blog.csdn.net/jinhao_2008/article/details/115948554