Text summarization using NLP

1. Description

        Text summarization is the process of generating short, fluent and most importantly accurate summaries of longer text documents. The main idea behind automatic text summarization is to be able to find a small subset of the most important information from the entire collection and present it in a human-readable format. With the growth of online text data, automatic text summarization methods can be very useful because more useful information can be read in a short time.

2. Why automatic text summarization?

  1. Abstracts reduce reading time.
  2. When researching documents, summaries make the selection process much easier.
  3. Automatic summarization increases the effectiveness of the index.
  4. Automatic summarization algorithms are less biased than human summarization.
  5. Personalized summaries are very useful in question answering systems because they provide personalized information.
  6. The use of automatic or semi-automatic summarization systems enables commercial summarization services to increase the number of text documents they can process.

3. The basis for the text summary 

        In the figure below, there are at least three links, 1) document classification, 2) document purpose classification, and 3) subject information extraction.

3.1 Based on input type:

  1. Single  Document  with shorter input length. Many early summarization systems dealt with single-document summarization.
  2. Multiple documents, the input can be arbitrarily long.

3.2 Classification according to purpose

  1. Generic, the model makes no assumptions about the domain or content of the text to summarize and treats all inputs as homogeneous. Most of the work that has been done revolves around generic abstractions.
  2. Domain-specific, the model uses domain-specific knowledge to form more accurate summaries. For example, summarizing research papers in a particular field, biomedical literature, etc.
  3. Based on a query, where the summary contains only information that answers natural language questions about the input text.

3.3 Depending on the output type:

  1. Extraction, which selects important sentences from the input text to form summaries. Most summarization methods today are extractive in nature.
  2. Abstraction, where the model forms its own phrases and sentences to provide more coherent summaries, just like a human would generate. This approach is certainly more attractive, but much more difficult than extracting the digest.

4. How to perform text summarization

  • text cleaning
  • sentence tokenization
  • word tokenization
  • word frequency table
  • Summarize

4.1 Text cleaning:

# !pip instlla -U spacy
# !python -m spacy download en_core_web_sm
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
stopwords = list(STOP_WORDS)
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(text)

4.2 Word Tokenization:

tokens = [token.text for token in doc]
print(tokens)
punctuation = punctuation + ‘\n’
punctuation
word_frequencies = {}
for word in doc:
if word.text.lower() not in stopwords:
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
print(word_frequencies)

4.3 Sentence Tokenization:

max_frequency = max(word_frequencies.values())
max_frequency
for word in word_frequencies.keys():
word_frequencies[word] = word_frequencies[word]/max_frequency
print(word_frequencies)
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

4.4 Create a word frequency table:

sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word.text.lower()]
else:
sentence_scores[sent] += word_frequencies[word.text.lower()]
sentence_scores

4.5 Summary of subject information:

from heapq import nlargest
select_length = int(len(sentence_tokens)*0.3)
select_length
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
summary
final_summary = [word.text for word in summary]
summary = ‘ ‘.join(final_summary)

Enter the original document:

text = “””
Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: ‘I don’t really hide any feelings too much.
I think everyone knows this is my job here. When I’m on the courts or when I’m on the court playing, I’m a competitor and I want to beat every single person whether they’re in the locker room or across the net.
So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players.
I have not a lot of friends away from the courts.’ When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men’s tour than the women’s tour? ‘No, not at all.
I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players.
I think every person has different interests. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life.
I think everyone just thinks because we’re tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do.
There are so many other things that we’re interested in, that we do.’
“””

4.6 Output (Final Summary): Summary

I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players. Maria Sharapova has basically no friends as tennis players on the WTA Tour. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life. I think everyone just thinks because we’re tennis players So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. When she said she is not really close to a lot of players, is that something strategic that she is doing?

For the full code, check out my repository:

V. Conclusion

        This article at least briefly tells you what key links are needed for automatic article summarization.

        Creating datasets can be a lot of work and is often an overlooked part of learning data science, where the actual work pays attention. That's another blog post, though. Anup Singh

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132258609