TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages Reading Notes

Don’t dare to hit the beginning of the space after #一个"#":
Thesis address: TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages
The works of the big brothers need to be worshipped, so the content of this article is the original content. Under each category, summarize The content of the original text. The content is largely the translation of English papers, but not overly pursuing the accuracy of translation, but want to explore the ideas of writing. Therefore, under each directory, a translation, a note; a translation, a note...
The end of this article is my own summary.

0. abstract

Translation: TextTiling is a tool for topic segmentation altogether. The transfer between subtopics mainly lies in the co-occurrence and distribution of parts of speech in the sentences before and after. The algorithm is complete and performs well on 12 articles. It is very helpful for other NLP tasks: information retrieval and summarization.
Notes: 1. What is TextTiling 2. The key to identifying sub-topic transfer 3. The performance and function of the algorithm

1. Introduction

Translation: The work dealt with in the previous chapters only analyzes the relationship between clauses and sentences. They are very helpful for reference resolution and dialogue generation. However, important chapter information also appears between paragraphs. This paper proposes a paragraph-level model and an algorithm to divide multiple articles into multiple paragraphs based on subtopic transfer.
Notes: 1. Limitations of previous work 2. Put forward directions for improvement 3. Explain the contribution of this article

Translation: In this work, the article is a sequence of subtopic discussions. For example:
Insert picture description here
Notes: 1. Introduce the research object-expository text 2. Examples

Translation: Headings and subheadings are common in scientific articles, but there are few structured boundaries in explanatory texts, so subtopic segmentation is very useful.
Notes: 1. Based on the introduction of the previous explanatory text, a further question is raised to further explain the importance of the direction discussed in this article

Translation: Subtopics only appear in one main heading, or between multiple main headings.
Notes: Can't understand

Translation: Texttiling uses the mode of part-of-speech co-occurrence and distribution. The algorithm has three parts: 1. Divide the article into sentence units 2. Calculate a score for each sentence unit 3. Get the boundary of the subtopic according to the graph obtained by "against scores" between sentence units. The methods of calculating scores are: 1. blocks 2. vocabulay introductions 3. chains. These calculation methods only use the part-of-speech collinearity and distribution pattern within the text.
Notes: 1. The core of Texttiling 2. The three steps of the algorithm 3. Several methods of calculating scores in the algorithm

Translation: The ultimate goal of paragraph-level structure construction is not only to identify one subtopic unit, but also to identify the topic category. This article only discusses the former.
Notes: 1. Explain that topic segmentation actually has two tasks: 1) Identify the sub-topic unit 2) Identify the type of sub-topic

Translation: Section 2 demonstrates the needs and application scenarios of topic segmentation. Section 3 specifically describes what "subtopic" is, and describes the models behind the model proposed in this article. Section 4 introduces the overall framework for detecting subtopic transfers using part-of-speech co-occurrence information, and Describes other related work of "Empirical Chapter Analysis" section5, specifically introduces texttiling section6 introduces model performance, section7 summarizes the work, and looks forward to the future
Note: It’s good that every introduction has this at the end, maybe this is the relationship between long articles?

2. Why Multi-paragraph Units?

Translation: What we are taught in the school is that a paragraph needs to be a unit that is internally coherent and externally independent. In real life, many paragraph marks are just used to change the appearance of the article and help reading. An example is the newspaper. In addition, different article types have different functions of paragraph marks.
Notes: 1. Contact life and explain background

Translation: The granularity of most topic segmentation is finer than this article. Texttiling is oriented to that kind of explanatory text without a lot of structured markup, because explanatory text is very helpful for information retrieval and abstraction. A typical example of explanatory text is: a 5-page scientific paper or a 20-page environmental impact report.
Notes: 1. Working characteristics of other topic segmentation 2. A more detailed introduction to the explanatory text targeted by texttling

Translation: texttiling can be used for hypertext display and information extraction. There is also text summarization. The previous work on text summarization only uses the relative position of the sentence in the text and then directly extracts the sentence, but does not use the sub-topic structure in the text. Some people also discovered the necessity of sub-topic structure, but did not give an algorithm.
Notes: 1.hypertext display 2. IR 3. text summarization but I don’t know what these three are

Translation: Another area is automatic text generation. Some people are based on the concept of "Basic Blocks": 1. A paragraph must contain a point of interest such as person or location 2. A paragraph is a collection of concepts about that object. They emphasized that it is fine-grained to build a high-level structure based on the local information of the article, and then combine the necessary relevant information.
Notes: I don't know what is fine-grained or coarse-grained

2.1 Online Text Display and Hypertext

Slightly, look back again.

2.2 Information Retrieval

Slightly, look back again.

3. Coarse-Grained Subtopic Structure

3.1 What is Subtopic Structure?

Explanation: These are not important points for me, so many paragraphs have been translated together.
Translation: If you explain "topic" as; a unified principle that makes one paragraph about something and the next paragraph about something else, then it can be explained very clearly. But if you don't consider the context, a single "topic" is difficult to explain. /If we consider "topic shift", then our question shifts to how to identify formal signs of topic shift in the text. /Data shows that sometimes the transfer between topics is obvious, sometimes it is less obvious. /Texttling also supports the above position, so the question now is how to detect the transfer of subtopics. Some people think that two marks should be specifically considered: 1. Adverbial clause 2. Some kind of prosodic mark. This article shows that the part-of-speech co-occurrence model is also a good entry point.
Note: This is definitely not mentioned in a short paper. . . But it's still quite valuable.

3.2 Relationship to Segmentation in Hierarchical Discourse Models

Translation: Previous work on empirical chapter processing has adopted a hierarchical chapter model. The most important ones are "attentional/intentional structure" and "Rhetorical Structure Theory". They mainly study the phrase unit and clause unit, and the objects they analyze are all short texts. Fine-grained topic segmentation is useful for text analysis: dialogue generation and turn-taking. The hierarchical model is about the division of "utterance-level" (?). The latest development is the use of machine learning methods + some carefully selected text clues to achieve automatic segmentation.
Notes: 1. The previous hierarchical model was fine-grained 2. Fine-grained segmentation left and right 3. What is the hierarchical model 4. The latest development. Then it will lead to texttiling

Translation: texttiling recognizes the main subtopic boundaries, which is linear, not hierarchical. When combining paragraph-level units instead of "utterances"-level units, lower complexity algorithms need to be invented. Texttiling only considers part-of-speech distribution information, instead of prosodic clues (pitch, pause, and duration) and text marks (oh, well,
ok, however) and so on. This method is very good, because 1. The amount of calculation is small. 2. The direct use of the above clues in the text can be very misleading.
Notes: 1. Clarify the difference between the hierarchical model and 2. Clarify the role of only considering part of speech distribution information.

4. Detecting Subtopic Change via Lexical Co-occurrence Patterns

Translation: TextTiling believes that in a text with a given topic, a series of words of a specific part of speech will be used. When the topic changes, a large proportion of the vocabulary will change. The algorithm recognizes the boundary by deciding where to change those thematic components to the greatest extent. However, other researchers study thematic factors such as setting (environment), time (time), characters (characters) and so on. On the contrary, I try to find those places where a relatively larger collection of event themes changes at the same time , instead of only considering the types of thematic factors. This is because, in the explanatory text, the topic is more conducive to the structure of the chapter than the environment, time, and task. For example:...
Notes: 1. A mainstream idea 2. Other researchers' methods based on this idea 3. How did I use this idea 4, why finally started to get to the topic.

Translation: The figure below shows the flow of subtopic structure based on part-of-speech co-occurrence. Blanks do not appear. There are roughly three types of words: 1. They appear frequently in all sentences (life\moon), they usually symbolize a big theme 2. They do not appear frequently, but are evenly distributed (form\scientist), they do not Use 3. The rest, they usually appear frequently in consecutive sentences, and they are useful. The question now is: how to determine the group of these words, where it starts and where it ends.
Insert picture description here

Notes: 1. Frequent occurrences of major topics throughout the text; uniform and infrequent occurrences throughout the text; clustered occurrences representing subtopics 2. How to determine the topic split point Wow, it is starting to be interesting.

Translation: The observation of the above figure shows that simple part-of-speech co-occurrence relations can be used to determine the boundaries of subtopics. However, it is not enough to only consider recurring words. Even saying that it is not enough to combine some words into chains, because there are often multiple active themes in a segmentation. Example: The sentences 37-51 are cross-topic (move, continent, shoreline, time, species, and life are not all the same topic), and for sentences 57-71, space, star, binary,
trinary, astronomer, orbit These words have a high degree of semantic similarity, which is incomparable in short.
Notes: 1. It is not enough to just look at the distribution of words or phrases, because a segmentation will involve many topics.

Translation: Since the words that indicate the structure of subtopics do not necessarily have a conceptual connection, this article considers the co-occurrence of multiple consecutive topics. The previous related work is to consider the word overlap between sentences. Word overlap can form a structure. If the structure is a fully connected graph, it means that the topic is discussed hot. If it is a slender connection chain, it means that it should be a topic. Split point. The core idea is: a function of word connection mode to get the structure of the sentence.
Note: I feel like the time is running out. Let me just look at its model. There were originally four paragraphs under this catalog.

4.1 Comparing Adjacent Blocks of Text

As shown in the figure below, we define a window of size 2. The figure below shows the lexical score of the gap between sentence 2 and sentence 3; sentence 4 and sentence 5; sentence 6 and sentence 7. According to the text: This blocks is like a sliding window across these sentences, one sentence at a time, then if the window size is k, each sentence will appear in the calculation of 2*k gap scores? ? (I don't understand this). Having said that, the score of the gap in sentence 2 and sentence 3 is actually the inner product of two vectors: vector 1=[2,1,2,1,1], vector 2=[1,1,1,1,2 ], so the answer=2+1+2+1+2=8
Insert picture description here
The following figure records my understanding and doubts:
Insert picture description here

4.2 Vocabulary Introductions

4.3 Lexical Chains

4.4 Vector Space Similarity Comparisons

4.5 Other Related Approaches

5. The TextTiling Algorithm

5.1 Tokenization

5.2 Determining Scores

5.2.1 Blocks

5.2.2 Vocabulary Introduction

5.3 Boundary Identification

5.4 Smoothing the Plot

5.5 Determining the Number of Boundaries

6. Evaluation

6.1 Reader Judgments

6.2 Parameter Settings

6.3 Results: Qualitative Analysis

6.4 Results: Quantitative Analysis

6.5 Detecting Breaks between Consecutive Documents

7. Summary and Future Work

Guess you like

Origin blog.csdn.net/jokerxsy/article/details/110953045