7 technologies data scientists need to know about natural language processing

Author | George  S eif

Translator | Sun Wei, Editor-in-Chief | Tu Min

图 | CSDN Download from Oriental IC

Selling | CSDN (ID: CSDNnews)

The following is the translation:

Modern companies have to deal with large amounts of data. These data come in different forms, including documents, spreadsheets, recordings, emails, JSON, and more. One of the most common ways of recording such data is through text, which is usually very similar to the natural language we use every day.

Natural language processing (NLP) is a study of computer programming, exploring ways to process and analyze large amounts of natural text data. Knowledge of natural language processing is crucial for data scientists because text is an extremely easy-to-use and commonly used medium in data storage.

Faced with the task of performing analysis and model building on text data, we must know how to perform basic data science tasks, including cleaning, formatting, parsing, analyzing, performing visualization, and modeling text data. When the data is still in the form of the original numbers, in addition to the conventional methods of these tasks, additional steps are required.

This guide will provide a basic introduction to the use of natural language processing in data science, including the 7 most commonly used technologies when processing text data, such as NLTK and Scikit Learn.

(1) Tokenization

Tokenization refers to dividing the text into sentences or words. In this process, we will also discard punctuation marks and redundant symbols.

This step is not as simple as it seems. For example: In the example above, the word "New York" has been split into two tags, but New York is a pronoun and may be very important in our analysis, so it is best to keep only one tag . Pay attention to this in this step.

The advantage of tokenization is that it will convert the text into a format that is easier to convert to the original number and is more suitable for actual processing. This is also the obvious first step in text data analysis.

import nltk
sentence = "My name is George and I love NLP"
tokens = nltk.word_tokenize(sentence)
print(tokens)

# Prints out ['My', 'name', 'is', 'George', 'and', 'I', 'love', 'NLP']

(2) Delete Stop Words (Stop Words Removal)

After tokenization, the next step is naturally to delete stop words. The goal of this step is similar to the previous step, which is to convert text data into a format that is easier to process. This step will delete common prepositions in English, such as "and", "the", "a", etc. Later, when analyzing the data, we can eliminate the interference and focus on words that have practical meaning.

It is very easy to perform stop word deletion by comparing words in a predefined list. The important thing to note is that there is no universal stopword list. Therefore, this list is generally created from scratch and customized for the application to be processed.

import nltk
from nltk.corpus import stopwords

sentence = "This is a sentence for removing stop words"
tokens = nltk.word_tokenize(sentence)

stop_words = stopwords.words('english')
filtered_tokens = [w for w in tokens if w not in stop_words]
print(filtered_tokens)

# Prints out ['This', 'sentence', 'removing', 'stop', 'words']

(3) Steming extraction

Another technique to clean up text data is to extract the backbone. This method is to restore the words to the root form, the purpose is to reduce the words that are slightly different due to the context but have the same meaning to the same mark for unified processing. For example: Consider the case of using the word "cook" in a sentence-there are many ways to write the word cook, depending on the context:

All the meanings of cook in the above picture are basically the same, so in theory, we can map it to the same mark during analysis. In this example, we mark cook, cooks, cooked, and cooking as "cook", which will greatly simplify our further analysis of text data.

import nltk

snowball_stemmer = nltk.stem.SnowballStemmer('english')

s_1 = snowball_stemmer.stem("cook")
s_2 = snowball_stemmer.stem("cooks")
s_3 = snowball_stemmer.stem("cooked")
s_4 = snowball_stemmer.stem("cooking")

# s_1, s_2, s_3, s_4 all have the same result

(4) Word Embeddings

From the above three steps, we have cleaned up the data and can now convert it into a format that can be used for actual processing.

Word embedding is a way of expressing words numerically, so that words with similar meanings will also be expressed similarly. Today's word embedding represents a single word as a real-valued vector in a predefined vector space.

The vector length of all words is the same, but the values ​​are different. The distance between the vectors of two words represents their semantic proximity. For example: the vectors for the words "cook" and "bake" are very close, but the vectors for the words "football" and "bake" are completely different.

A common method for creating word embeddings is called GloVe, which stands for "global vector". GloVe captures global and local statistics of text corpora to create word vectors.

GloVe uses the so-called co-occurrence matrix. The co-occurrence matrix indicates how often each pair of words appears together in the corpus. For example: suppose we want to create a co-occurrence matrix for the following three sentences:

  • I like data science (I love Data Science).

  • I like coding (I love coding).

  • I should learn natural language processing (I should learn NLP).

The co-occurrence matrix of the text library is as follows:

In real-world datasets, the matrix will be much larger. The advantage is that the word embedding only needs to count the data once, and then it can be saved to the disk.

After that, we need to train GloVe to learn the fixed-length vector of each word, so that the dot product of any two words is equal to the co-occurrence probability of the log words in the co-occurrence matrix. Expressed in the objective function of the following paper:

In the equation, X represents the value of position (i, j) in the co-occurrence matrix, and w is the word vector to be derived. Therefore, with this objective function, GloVe can minimize the difference between the dot product and co-occurrence of two word vectors, thereby effectively ensuring that the vector to be obtained is related to the co-occurrence value in the matrix.

In the past few years, GloVe has proved to be a very powerful and versatile word embedding technology because of its extremely effective encoding of word semantics and similarities. For data science applications, this is a proven method to convert words into a format that we can process and analyze.

Click here for a complete tutorial on how to use GloVe in Python:

https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db

(5) Term Frequency-Inverse Document Frequency (TF-IDF)

The term "word frequency-inverse document frequency" (often called TF-IDF) is a weighting factor that is often used in applications such as information retrieval and text mining. TF-IDF uses statistical data to measure the importance of a word to a particular document.

  • TF-word frequency: measure the frequency of a certain string in a document. Calculation method: Divide the total number of occurrences in the document by the total length of the document (to standardize).

  • IDF-Inverse document frequency: measures the importance of a string in a document. For example, certain character strings such as "is", "of", and "a" appear many times in many documents, but have little practical meaning-they are not adjectives or verbs. Therefore, IDF will weight each string according to its importance. The calculation method is: divide the total number of documents in the data set by the number of documents containing the string (the denominator needs to be +1 to avoid the denominator being 0), and then you will get The quotient of is calculated by taking the logarithm.

  • TF-IDF: The final calculation result is simply multiplying TF and IDF.

TF-IDF can achieve a perfect balance, taking into account the local and global statistical level of the target word. The more frequent a word appears in the document, the higher its weight, but the premise is that the word does not appear frequently in the entire document.

Due to its powerful degree, TF-IDF technology is usually used by search engines to specify the relevance of a certain document when it is used to input keywords and rank. In data science, we can use this technique to understand which words and related information in text data are more important. 

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf(vectorizer):
  feature_names = vectorizer.get_feature_names()
  dense_vec = vectors.todense()
  dense_list = dense_vec.tolist()
  tfidf_data = pd.DataFrame(dense_list, columns=feature_names)
  return tfidf_data


vectorizer = TfidfVectorizer()

doc_1 = "TF-IDF uses statistics to measure how important a word is to " \
        "a particular document"
doc_2 = "The TF-IDF is perfectly balanced, considering both local and global " \
        "levels of statistics for the target word."
doc_3 = "Words that occur more frequently in a document are weighted higher, " \
        "but only if they're more rare within the whole document."
documents_list = [doc_1, doc_2, doc_3]

vectors = vectorizer.fit_transform(documents_list)

tfidf_data = get_tf_idf(vectorizer)

print(tfidf_data)
# Prints the TF-IDF data for all words across all documents

(6) Topic Modeling (Topic Modeling)

In natural language processing, topic modeling is the process of extracting main topics from a collection of text data or documents. Essentially, this is a form of dimensionality reduction because we reduce large amounts of text data to a small number of topics. Topic modeling is useful in many data science scenarios.

Here are a few examples:

  • Textual data analysis-extract potential trends and main components of data;

  • Categorized text-Similar to the way dimensionality reduction deals with classic machine learning problems, as we will compress text into key functions, topic modeling is also useful here.

  • Build recommendation system-topic modeling will automatically provide some basic grouping for text data, and even provide additional functions for building and training models.

Topic modeling is usually done by implicit Dirichlet distribution (LDA). With the help of LDA, we model each text document according to the multiple distribution of topics, and each topic is modeled according to the multiple distribution of words (single characters cleaned up by multiple techniques such as tokenization, stop word deletion, and trunk extraction).

LDA assumes that the document is composed of multiple topics, and these topics will then generate words based on their probability distribution.

First, we will tell LDA how many topics each document should have and how many words each topic should consist of. For the data set of the specified document, LDA will try to determine which combination and distribution of topics can accurately reconstruct the corresponding document and all the text in it. The actual document can be constructed to determine which topic is valid, and in the case of the specified topic, the word is sampled according to the probability distribution of the word to complete the construction.

Once LDA finds the topic distribution that can accurately reconstruct all documents and their contents in the data set, we finally determine the topic with the proper distribution.

from sklearn.decomposition import LatentDirichletAllocation as LDA

NUM_TOPICS = 3

# Here we create and fit the LDA model
# The "document_word_matrix" is a 2D array where each row is a document
# and each column is a word. The cells contain the count of the word within
# each document
lda = LDA(n_components=NUM_TOPICS, n_jobs=-1)
lda.fit(document_word_matrix)

(7) Sentiment Analysis

Sentiment analysis is a natural language analysis technology that aims to identify and extract subjective information in text data. Similar to topic modeling, sentiment analysis can turn unstructured text into a basic summary of information embedded in the data.

Most sentiment analysis techniques belong to one of the following two categories: rules-based and machine learning methods. Rule-based methods require simple steps to obtain results. After performing some preprocessing steps such as tokenization, stop word elimination, and trunk extraction, the rule-based method may follow the following steps:

  1. For different emotions, define word lists. For example, if we plan to define whether a paragraph is negative or positive, we might define words such as "bad" and "terrible" for negative emotions, and words such as "awesome" and "amazing" for positive emotions ;

  2. Browse the text and count the number of positive and negative emotional words separately.

  3. If the number of words marked as positive emotions is more than negative, the text emotions are positive and vice versa.

Rule-based methods work well when sentiment analysis is used to obtain rough meaning. However, today's most advanced systems usually use deep learning, or at least classic machine learning techniques to automate the entire process.

Through deep learning technology, the sentiment analysis is modeled according to the classification problem. Encoding text data into an embedding space (similar to the word embedding described above) is a form of function extraction. Then transfer these functions to the classification model to classify the text sentiment.

This learning-based approach is very powerful because we can automate it as an optimization problem. We continuously send data to the model for continuous improvement, which is also a huge benefit. More data can continue to optimize function extraction and sentiment classification.

There are a lot of excellent tutorials on how to use sentiment analysis through machine learning models. Here are a few of them:

  • With Logistic Regressionhttps://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

  • With Random Forest:https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/

  • With Deep Learning LSTM:https://towardsdatascience.com/sentiment-analysis-for-text-with-deep-learning-2f0a0c6472b5

Original: https://towardsdatascience.com/an-introductory-guide-to-nlp-for-data- Scientists-with-the Common-7-Techniques-584d623c40f0

This article is a CSDN translation, please indicate the source.

【End】

More exciting recommendations

The number of GitHub developers in China is growing 37% annually, the fastest in the world

To understand static code analysis, it is enough to read this article!

☞Xunfei Intelligent Voice Pioneer: When human-computer interaction is as natural as human communication, the real intelligent age will come!

Dig deep into Google DeepMind and the technology behind it

☞From Spring Cloud to Service Mesh, how has the microservice architecture governance system evolved?

☞Interview to build aircraft series: see how architects design microservice interfaces

Every "watching" you order, I take it seriously

Published 1969 original articles · 40,000+ praises · 18.25 million views

Guess you like

Origin blog.csdn.net/csdnnews/article/details/105608897