NLP: Don't reinvent the wheel

Author|Abhijit Gupta Compilation|VK Source|Towards Data Science

Introduction

Natural language processing (NLP) is a daunting field name. It is difficult to generate useful conclusions from unstructured text, and there are countless techniques and algorithms, each with its own use cases and complexity. As a developer with the least exposure to NLP, it is difficult to know which methods to use and how to implement them.

If I provide the best possible result with the smallest effort. Using the 80/20 principle, I will show you how to deliver solutions quickly (20%) without significantly sacrificing results (80%).

"The 80/20 principle believes that a small number of causes, inputs or efforts usually lead to most results, outputs or returns"

-Richard Koch, author of the 80/20 principle

How will we achieve this goal? There are some great Python libraries! We may stand on the shoulders of giants and innovate quickly instead of reinventing the wheel. Through pre-tested implementations and pre-trained models, we will focus on applying these methods and creating value.

The target audience of this article is developers who want to quickly integrate natural language processing into their projects. While emphasizing ease of use and fast effects, performance will also decrease. According to my experience, 80% of the technology is sufficient for the project, but you can also find relevant methods from other places

Needless to say, let's get started!


What is NLP?

Natural language processing is a branch of linguistics, computer science, and artificial intelligence that allows automatic processing of text through software. NLP enables machines to read, understand, and respond to unstructured, unstructured text.

People usually think of NLP as a subset of machine learning, but the reality is more subtle.

Some NLP tools rely on machine learning, and some even use deep learning. However, these methods often rely on large data sets and are difficult to implement. Instead, we will focus on a simpler, rule-based approach to speed up the development cycle.

the term

Starting from the smallest data unit, characters are single letters, numbers, or punctuation marks. A word is a list of characters, and a sentence is a list of words. A document is a list of sentences, and a corpus is a list of documents.

Pretreatment

Preprocessing is probably the most important step in an NLP project. It involves cleaning up the input so that the model can ignore noise and focus on the most important content. A powerful preprocessing pipeline will improve the performance of all models, so its value must be emphasized.

Here are some common preprocessing steps:

  • Segmentation : Given a long string of characters, we can use spaces to separate documents, periods to separate sentences, and spaces to separate words. Implementation details will vary from dataset to dataset.
  • Use lowercase : Uppercase usually does not increase performance and makes string comparisons more difficult. So change everything to lowercase.
  • Remove punctuation : We may need to remove commas, quotation marks, and other punctuation that does not add meaning.
  • Remove stop words : Stop words are words like "she", "the", and "of", which do not increase the meaning of the text and distract attention from the keywords.
  • Delete other irrelevant words : Depending on your application, you may want to delete some irrelevant words. For example, if you are evaluating course reviews, words like "professor" and "course" may not be useful.
  • Stemming/Rootization : Both stemming analysis and rooting will generate the root form of inflectional words (for example: "running" to "run"). Stemming is faster, but there is no guarantee that the root is an English word. Rooting uses a corpus to ensure that the root is a word, but at the cost of speed.
  • Part-of-speech tagging : Part-of-speech tagging is based on part of speech (nouns, verbs, prepositions), and marks words according to word meaning and context. For example, we can focus on nouns for keyword extraction.

For a more comprehensive introduction to these concepts, please review the following guides:

https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

These steps are the basis for successful pretreatment. Depending on the data set and task, you can skip some steps or add new steps. Observe the data manually through preprocessing and correct it when problems occur.


Python library

Let's take a look at the two main Python libraries of NLP. These tools will play a very important role during preprocessing

NLTK

The Natural Language Toolkit is the most widely used NLP library in Python. NLTK was developed by UPenn for academic purposes, and it has a large number of features and corpora. NLTK is very suitable for processing data and running preprocessing: https://www.nltk.org/

NLTK is the leading platform for building Python programs to process human language data. It provides easy to use API

>>> import nltk

>>> sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good."

>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('or', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]

This is an example on the NLTK website, which shows how easy it is to mark sentences and mark parts of speech.

SpaCy

SpaCy is a modern library. Although NLTK has multiple implementations for each feature, SpaCy retains the best-performing implementation. Spacy supports multiple functions. For more information, please read the document: https://spacy.io/

With just a few lines of code, we can use SpaCy to perform named entity recognition. Many other tasks can be completed quickly using SpaCy api.

import spacy

nlp = spacy.load("en_core_web_sm")
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him seriously")

doc = nlp(text)
for entity in doc.ents:
    print(entity.text, entity.label_)

# 输出
# Sebastian Thrun
# 谷歌组织
# 2007日期

GenSim

Unlike NLTK and SpaCy, GenSim specializes in information retrieval (IR) problems. GenSim's development focus is on memory management. It contains many document similarity models, including Latent Semantic Indexing, Word2Vec and FastText: https://github.com/RaRe-Technologies/gensim

Gensim is a Python library for similarity retrieval of topic models, document indexes, and large corpora: https://github.com/RaRe-Technologies/gensim

The following is an example of a pre-trained GenSim Word2Vec model, which can find the similarity of words. Don't worry about the chaotic details, we can get results quickly.

import gensim.downloader as api
wv = api.load("word2vec-google-news-300")

pairs = [
    ('car', 'minivan'),    # 小型货车是一种汽车
    ('car', 'bicycle'),    # 也是有轮子的交通工具
    ('car', 'airplane'),   # 没有轮子,但仍然是交通工具
    ('car', 'cereal'),     # ... 等等
    ('car', 'communism'),
]

for w1, w2 in pairs:
    print('%r\t%r\t%.2f % (w1, w2, wv.similarity(w1, w2)))

# 输出
# 'car'   'minivan'    0.69
# 'car'   'bicycle'    0.54
# 'car'   'airplane'   0.42
# 'car'   'cereal'     0.14
# 'car'   'communism'  0.06

there are more…

This list is not comprehensive, but it covers some use cases. I recommend checking this repository for more tools and references: https://github.com/keon/awesome-nlp


application

Now that we have discussed preprocessing methods and Python libraries, let's put them together with a few examples. For each algorithm, I will introduce several NLP algorithms, choose one according to our rapid development goals, and use one of the libraries to create a simple implementation.

Application 1: Preprocessing

Preprocessing is a key part of any NLP solution, so let's see how to use the Python library to speed up processing. Based on my experience, NLTK has all the tools we need, customized for unique use cases. Let's load a sample corpus:

import nltk

# 加载brown语料库
corpus = nltk.corpus.brown

# 访问语料库的文件
print(corpus.fileids())

# 输出
['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16',
 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25', 'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32',
 'ca33', 'ca34', 'ca35', 'ca36', 'ca37', 'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'c...

Following the pipeline defined above, we can use NLTK to implement segmentation, remove punctuation and stop words, perform stemming, and so on. See how easy it is to delete stop words:

from nltk.corpus import stopwords

sw = stopwords.words("english")
sw += ""  # 空字符串

def remove_sw(doc):
    sentences = []
    for sentence in doc:
        sentence = [word for word in sentence if word not in sw]
        sentences.append(sentence)
    return sentences

print("With Stopwords")
print(doc1[1])
print()

doc1 = remove_sw(doc1)

print("Without Stopwords")
print(doc1[1])

# 输出
# 有停用词
# ['the', 'jury', 'further', 'said', 'in', 'presentments', 'that', 'the', 'city', 'executive', 'committee', 'which', 'had',
# 'charge', 'of', 'the', 'election', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'city', 'of', 'atlanta', 'for', 
# 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted']

# 没有停用词
# ['jury', 'said', 'presentments', 'city', 'executive', 'committee', 'charge', 'election', 'deserves', 'praise', 'thanks', 'city',
# 'atlanta', 'manner', 'election', 'conducted']

The entire preprocessing pipeline took me less than 40 lines of Python. See the full code here. Remember, this is a general example, and you should modify the process according to the needs of your specific use case.

Application 2: Document clustering

Document clustering is a common task in natural language processing, so let's discuss some methods. The basic idea here is to assign a vector representing the topic in question to each document:

If the vector is two-dimensional, we can visualize the document as above. In this example, we see that documents A and B are closely related, while D and F are loosely related. Even if these vectors are 3-dimensional, 100-dimensional, or 1000-dimensional, we can calculate similarity using the distance metric.

The next question is how to construct these vectors for each document using unstructured text input. Here are a few options, from the simplest to the most complex:

  • Bag of words: Assign an index to each unique word. The vector of a given document is the frequency of each word.

  • TF-IDF: Strengthen the representation according to the commonness of words in other documents. If two documents share a rare word, they are more similar than sharing a common word.

  • Latent Semantic Index (LSI): Bag of words and TF-IDF can create high-dimensional vectors, which reduces the accuracy of distance measurement. LSI compresses these vectors to a more manageable size while minimizing information loss.

  • Word2Vec: Use neural networks to learn the associations of words from a large text corpus. Then add the vectors of each word to get a document vector.

  • Doc2Vec: Built on the basis of Word2Vec, but using a better method to approximate the document vector from the word vector list.

Word2Vec and Doc2Vec are very complex and require a large number of data sets to learn word embedding. We can use pre-trained models, but they may not be well adapted to tasks in the domain. Instead, we will use bag of words, TF-IDF and LSI.

Now select our library. GenSim is built specifically for this task and it contains simple implementations of all three algorithms, so let's use GenSim.

For this example, let's use the Brown corpus again. It has documents in 15 text categories, such as "Adventure", "Edit", "News", etc. After running our NLTK preprocessing routine, we can start applying the GenSim model.

First, we create a dictionary that maps identifiers to unique indexes.

from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(corpus)
dictionary.filter_n_most_frequent(1)  # removes ""
num_words = len(dictionary)
print(dictionary)
print()

print("Most Frequent Words")
top10 = sorted(dictionary.cfs.items(), key=lambda x: x[1], reverse=True)[:10]
for i, (id, freq) in enumerate(top10):
    print(i, freq, dictionary[id])

# 输出
# Dictionary(33663 unique tokens: ['1', '10', '125', '15th', '16']...)

# 频率最高的词
# 0 3473 one
# 1 2843 would
# 2 2778 say
# 3 2327 make
# 4 1916 time
# 5 1816 go
# 6 1777 could
# 7 1665 new
# 8 1659 year
# 9 1575 take

Next, we iteratively apply bag of words, TF-IDF and latent semantic indexing:

corpus_bow = [dictionary.doc2bow(doc) for doc in corpus]
print(len(corpus_bow[0]))
print(corpus_bow[0][:20])

# 输出
# 6106
# [(0, 1), (1, 3), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 2), (13, 2), (14, 2), (15,
# 1), (16, 2), (17, 2), (18, 3), (19, 1)]

tfidf_model = models.TfidfModel(corpus_bow)
corpus_tfidf = tfidf_model[corpus_bow]
print(len(corpus_tfidf[0]))
print(corpus_tfidf[0][:20])

# 输出
# 5575
# [(0, 0.001040495879718581), (1, 0.0011016669638018743), (2, 0.002351365659027428), (3, 0.002351365659027428), (4, 
# 0.0013108697793088472), (5, 0.005170600993729588), (6, 0.003391861538746009), (7, 0.004130105114011007), (8, 
# 0.003391861538746009), (9, 0.008260210228022013), (10, 0.004130105114011007), (11, 0.001955787484706956), (12, 
# 0.0015918258736505996), (13, 0.0015918258736505996), (14, 0.008260210228022013), (15, 0.0013108697793088472), (16, 
# 0.0011452524080876978), (17, 0.002080991759437162), (18, 0.004839366251287288), (19, 0.0013108697793088472)]

lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=20)
corpus_lsi = lsi_model[corpus_tfidf]
print(len(corpus_lsi[0]))
print(corpus_lsi[0])

# 输出
# 15
# [(0, 0.18682238167974372), (1, -0.4437583954806601), (2, 0.22275580411969662), (3, 0.06534575527078117), (4, 
# -0.10021080420155845), (5, 0.06653745783577146), (6, 0.05025291839076259), (7, 0.7117552624193217), (8, -0.3768886513901333), (9, 
# 0.1650380936828472), (10, 0.13664364557932132), (11, -0.03947144082104315), (12, -0.03177275640769521), (13, 
# -0.00890543444745628), (14, -0.009715808633565214)]

In about 10 lines of Python code, we processed three independent models and extracted vector representations for the documents. Using cosine similarity for vector comparison, you can find the most similar documents.

categories = ["adventure", "belles_lettres", "editorial", "fiction", "government", 
              "hobbies", "humor", "learned", "lore", "mystery", "news", "religion",
              "reviews", "romance", "science_fiction"]
num_categories = len(categories)


for i in range(3):
    print(categories[i])
    sims = index[lsi_model[corpus_bow[i]]]
    top3 = sorted(enumerate(sims), key=lambda x: x[1], reverse=True,)[1:4]
    for j, score in top3:
        print(score, categories[j])
    print()

# 输出
# adventure
# 0.22929086 fiction
# 0.20346783 romance
# 0.19324714 mystery

# belles_lettres
# 0.3659389 editorial
# 0.3413822 lore
# 0.33065677 news

# editorial
# 0.45590898 news
# 0.38146105 government
# 0.2897901 belles_lettres

That's it, we have the result! Adventure novels and romantic novels are most similar, while editorials are similar to news and government. View the complete code here: https://github.com/avgupta456/medium_nlp/blob/master/Similarity.ipynb.

Application 3: Sentiment analysis

Sentiment analysis is the interpretation of unstructured text as positive, negative or neutral. Sentiment analysis is a useful tool for analyzing reviews, measuring brands, and building artificial intelligence chatbots.

Unlike document clustering, in sentiment analysis, we do not use preprocessing. The punctuation, flow, and context of paragraphs can reveal a lot of information about emotions, so we don’t want to delete them.

For simplicity and effectiveness, I recommend using pattern-based sentiment analysis. By searching for specific keywords, sentence structure, and punctuation, these models measure the positive and negative aspects of text. Here are two libraries with built-in sentiment analyzers:

VADER sentiment analysis:

VADER is the abbreviation of Valence Aware Dictionary and sEntiment Recognizer, and is an extension of NLTK for sentiment analysis. It uses patterns to calculate emotions, especially for emoticons and SMS slang. It is also very easy to implement.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

print(analyzer.polarity_scores("This class is my favorite!!!"))
print(analyzer.polarity_scores("I hate this class :("))

# 输出
# {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.5962}
# {'neg': 0.688, 'neu': 0.312, 'pos': 0.0, 'compound': -0.765}
TextBlob sentiment analysis:

A similar tool is TextBlob for sentiment analysis. TextBlob is actually a versatile library, similar to NLTK and SpaCy. In sentiment analysis tools, it is different from VADER in reporting sentiment polarity and subjectivity. From my personal experience, I prefer VADER, but everyone has their own strengths and weaknesses. TextBlob is also very easy to implement:

from textblob import TextBlob

testimonial = TextBlob("This class is my favorite!!!")
print(testimonial.sentiment)

testimonial = TextBlob("I hate this class :(")
print(testimonial.sentiment)

# 输出
# Sentiment(polarity=0.9765625, subjectivity=1.0)
# Sentiment(polarity=-0.775, subjectivity=0.95)

Note: The pattern-based model cannot handle such small text well in the above example. I suggest sentiment analysis for texts with an average of four sentences. To quickly demonstrate this, please refer to Jupyter Notebook: https://github.com/avgupta456/medium_nlp/blob/master/Sentiment.ipynb

other apps

Here are a few additional topics and some useful algorithms and tools to accelerate your development.

  • Keyword extraction : Named entity recognition (NER) uses SpaCy, rapid automatic keyword extraction (RAKE) uses ntlk-rake
  • Text summary : TextRank (similar to PageRank) uses PyTextRank SpaCy extension, TF-IDF uses GenSim
  • Spell check : PyEnchant, SymSpell Python port

Hope these examples help demonstrate the vast resources available for natural language processing in Python. Regardless of the problem, someone developed a library to simplify the process. Using these libraries can produce good results in a short time.


Tips and tricks

With an introduction to NLP, an overview of Python libraries, and some sample applications, you can almost meet your own challenges. Finally, I have some tips and tricks to make the most of these resources.

  • Python tools : I recommend Poetry for dependency management, Jupyter Notebook for testing new models, Black and/or Flake8 for maintaining code style, and GitHub for version management.
  • Keep it organized : Jump from one library to another, copy the code to the current code you write, although it is easy to implement, but not good. I suggest you take the appropriate and more deliberate approach, because you don't want to miss a good solution in a hurry.
  • Pretreatment : garbage in, garbage out. It is important to implement a powerful preprocessing pipeline to clean up the input. Visually inspect the processed text to make sure everything is working as expected.
  • Display results : Choosing how to display your results will make a big difference. If the output text looks a bit rough, consider displaying aggregate statistics or numerical results.

Original link: https://towardsdatascience.com/natural-language-processing-nlp-dont-reinvent-the-wheel-8cf3204383dd

Welcome to follow Panchuang AI blog site: http://panchuang.net/

sklearn machine learning Chinese official document: http://sklearn123.com/

Welcome to the Panchuang blog resource summary station: http://docs.panchuang.net/

Guess you like

Origin blog.csdn.net/fendouaini/article/details/109280301