In the era of text big data, every developer needs to understand how to analyze text

Now, using Python and open source tools can be very convenient for text analysis, so in this era of text big data, every developer needs to understand how to analyze text.

Recommend related books: "Natural Language Processing and Computing Language"

In the era of text big data, every developer needs to understand how to analyze text

 

This book introduces how to apply natural language processing and computational linguistic algorithms to reason about existing data and get some interesting analysis results. These algorithms are based on the current mainstream statistical machine learning and artificial intelligence technologies, and implementation tools are readily available, such as tools such as Gensim and spaCy in the Python community.

This book starts with learning data cleaning, learning how to perform computational linguistics algorithms, and then using real language and text data, using Python to explore more advanced topics of NLP and deep learning. We will also learn to use open source tools to tag, parse, and model text. Readers will master practical knowledge of excellent frameworks, how to choose tools similar to Gensim for topic models, and how to conduct deep learning through Keras.

Who is suitable for reading this book?

I hope readers have a certain understanding of Python. If not, it doesn't matter. This book will introduce some basic knowledge of Python. In addition, it is helpful to understand basic statistical methods. Since the main content of this book involves natural language processing, it is very helpful to understand basic linguistics.

The main content of this book

Chapter 1, What is text analysis. The development of today's technology allows developers to easily obtain massive text data from the Internet, and use powerful, free open source tools for machine learning and computational linguistics research. This field is developing at an unprecedented speed. This chapter will discuss in detail what text analysis is and the motivation for learning and understanding text analysis.

Chapter 2, Python text analysis techniques. As mentioned in Chapter 1, this book will use Python as a tool because it is an easy-to-use and powerful programming language. This chapter will introduce the basics of Python for text analysis. Why are basic knowledge of Python important? Although we hope that readers have certain knowledge of Python and high school mathematics, some readers may not have written Python code for a long time. There is also a part of Python developer's experience based on web frameworks such as Django, which is different from the skills required for text analysis and string processing.

Chapter 3, spaCy language model. Although Chapter 2 has introduced the concept of text analysis, it has not specifically discussed any technical details of building a text analysis process. This chapter will introduce the language model of spaCy. This will be the first step in text analysis and the first component in the NLP process. In addition, this chapter will also introduce the spaCy open source library, how to use spaCy to help developers complete text analysis tasks, and discuss some more powerful functions, such as POS tagging and NER. This chapter will use an example to illustrate how to use spaCy to preprocess data quickly and effectively.

Chapter 4, Gensim: Tools for text vectorization, vector transformation and n-grams. Although the previous chapters have led readers to process raw text data, any machine learning or information retrieval related algorithms will not use raw text as an input format. So this chapter will use a data structure called a vector to help the algorithm model understand the text, and choose Gensim and scikit-learn as conversion tools. While beginning to vectorize text, pre-processing techniques such as bi-grams, tri-grams, and n-grams will also be introduced. Word frequency can filter out uncommon words in the document.

Chapter 5, Part of Speech Tagging and Its Application. Chapters 1 and 2 introduce text analysis and Python. Chapters 3 and 4 help readers set up code for more advanced text analysis. This chapter will discuss the first advanced NLP technology: POS-tagging. We will study what is part of speech, how to recognize the part of speech of words, and how to use part of speech tags.

Chapter 6, NER labeling and its application. The last chapter introduced how to use spaCy to complete part of speech tagging. This chapter will explore another interesting usage: NER annotation. This chapter will discuss what NER labeling is from the perspective of language and text analysis, and explain in detail its use examples and how to train your own NER labeling with spaCy.

Chapter 7, Dependency Analysis. Chapter 5 and Chapter 6 introduce how spaCy's NLP performs various complex computational linguistic algorithms, such as POS labeling and NER labeling. However, this is not all spaCy packages. This chapter will explore the power of dependency analysis and how to use it in various contexts and application scenarios. Before continuing to use spaCY, we will study the theoretical basis of dependency analysis and train a dependency analysis model.

Chapter 8, Topic Model. So far, we have learned some knowledge of computational linguistic algorithms and spaCy, and learned how to use these computational linguistic algorithms to label data and understand sentence structure. Although these algorithms can capture the details of the text, they still lack a comprehensive understanding of the data. In each corpus, which words appear more frequently than others? Is it possible to group data or find potential themes? This chapter will try to answer these questions.

Chapter 9, Advanced Topic Modeling. In the previous chapter, we saw the power of topic models, and understood and explored the intuitive way of data. This chapter will further explore the practicality of these topic models and how to create a more efficient topic model to better encapsulate the topics that may appear in the corpus. Topic modeling is a way to understand corpus documents, and it provides more room for developers to analyze documents.

Chapter 10, Text Clustering and Text Classification. The previous chapter introduced the topic model and its process of organizing and understanding documents and their substructures. This chapter will continue to discuss new text machine learning algorithms, as well as two specific tasks-text clustering and text classification, discuss the intuitive reasoning of these two algorithms, and how to use the popular Python machine learning library scikit-learn to model .

Chapter 11, query word similarity calculation and text summarization. Once the text can be vectorized, the similarity or distance between the text documents can be calculated. This is exactly what this chapter will introduce. There are many different vector representation technologies in the industry, from standard word bag representation, TF-IDF to topic model representation of text documents. This chapter will also introduce the knowledge about how to use Gensim to implement text summarization and keyword extraction.

Chapter 12, Word2Vec, Doc2Vec and Gensim. The previous chapters have discussed the topic of vectorization many times-how to understand vectorization, and how to use mathematical forms to represent text data. The foundation of all machine learning methods we use relies on these vector representations. This chapter will go a step further and use machine learning techniques to generate vectorized representations of words to better encapsulate the semantic information of words. This technology is commonly known as word embedding, and Word2Vec and Doc2Vec are two mainstream variants of this technology.

Chapter 13, Use deep learning to process text. So far, we have explored the application of machine learning in various contexts, such as topic modeling, clustering, classification, text summarization, and even POS labeling and NER labeling are inseparable from machine learning. This chapter will introduce one of the cutting-edge technologies of machine learning: deep learning. Deep learning is a branch of machine learning. The technology is inspired by biological structures and uses neural networks to generate algorithms and structures. The fields of text generation, text classification and word embedding are all areas where deep learning can be combined. This chapter will learn the basics of deep learning and an example of implementing a text deep learning model.

Chapter 14, Deep Learning with Keras and spaCy. The previous chapter introduced deep learning techniques for text, and tried to use neural networks to generate text. This chapter will study the deep learning of text more deeply, especially how to build a Keras model capable of text classification, and how to integrate deep learning into the spaCy process.

Chapter 15, Sentiment Analysis and Chatbots. So far, we have mastered the basic skills needed to start a text analysis project and can try more complex projects. Among them, there are two text analysis scenarios that have not been covered before, but many of them are common concepts: sentiment analysis and chatbots. This chapter will serve as a guide to guide readers to complete the above two applications independently. This chapter does not provide complete code for building chatbots or sentiment analysis, but focuses on the introduction of various related technical principles.

Sample chapter trial reading:

Chapter 10 Text Clustering and Text Classification

The previous chapter explored how to use topic models to better organize and understand documents. This chapter will continue to discuss clustering and classification tasks in machine learning algorithms and their working principles, and how to use the popular Python machine learning library scikit-learn to perform these tasks. The topics introduced in this chapter are as follows:

  • Text clustering
  • Text Categorization.

10.1 Text clustering

Previously, we analyzed the text to better understand what the text or corpus is composed of. For example, POS tagging or NER tagging will tell us what words appear in the document; the topic model will tell us the potential topics hidden in the text What is it. Of course, developers can also use the topic model to cluster documents, but this is not the strong point of the topic model. It is unrealistic to try and expect better results. Note that since the purpose of topic modeling is to find hidden topics in the corpus, not to group documents, there is currently no particularly good way to optimize them in terms of clustering. For example, after performing topic modeling, the document can be composed of topics 1, 2, and 3, which account for 30%, 30%, and 40%, respectively. This information is not enough for clustering.

Below we begin to introduce two more quantitative machine learning algorithms: clustering and classification. Clustering is a popular machine learning task, and the techniques used in classic clustering tasks can also be used for text. As the name suggests, clustering is the task of grouping or clustering data points in the same group, where points in the same group are more similar than points in other groups. To expand, the data point can be a document or a word. Clustering is an unsupervised learning problem. Before we start assigning data points to clusters or groups, we don't know their category (although we may know what we will find).

The classification task is somewhat similar to the clustering task. It uses a training data set containing known sample categories (or instances) to determine which category of a set of categories an unknown sample belongs to. For example, assign received emails to spam or non-spam categories, or assign newspaper articles to designated categories or groups.

A well-known clustering or classification task data set is called Iris, which contains flower petal length and category information. Another very popular data set is called MNIST, which contains handwritten digits, which should be classified according to the digits it represents.

Text clustering follows most of the principles of standard clustering problems, but there are too many dimensions in the field of text analysis. For example, in the Iris dataset, only 4 features can be used to identify classes or clusters. For text, we have to deal with the entire vocabulary when modeling the problem. Of course, we will try our best to use some technologies, such as SVD, LDA and LSI to reduce dimensions.

In the previous chapters, Gensim and spaCy of computational linguistics were used extensively to perform quantitative tasks in computational linguistics. From now on, we will start using a more traditional machine learning library, scikit-learn. The previous chapters of this book have already introduced part of scikit-learn.

To study clustering and classification algorithms, we have to mention Word2Vec and Doc2Vec, two methods of representing words and documents as vectors. This is a new vector representation method for words and documents, which is more complicated than the previous algorithms. We will discuss Word2Vec and Doc2Vec again in Chapter 12 and use them for clustering and classification.

10.2 Preparation before clustering

The most important preparation is still the pre-processing step, that is, the removal of stop words and stemming. Then the document is converted to a vector representation.

This section uses scikit-learn to complete the three tasks of clustering, classification and preprocessing. First you need to determine which data set to use. There are many choices. Here we choose the 20 most popular newsgroup data sets. Since the data set itself is built into scikit-learn, it is easy to load and use.

Readers can refer to the cluster classification example of Jupyter Notebook after class, from which we extracted code snippets to explain the process.

The code to load the data set is as follows:

from sklearn.datasets import fetch_20newsgroups
categories = [    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]dataset = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=42)
labels = dataset.targettrue_k = np.unique(labels).shape[0]
data = dataset.data

The above code uses the import statement to access the 20 NG dataset. In this example, we only selected 4 categories. Create a data set by selecting all subsets, and at the same time sort out the data set to ensure that its state is random. The text data is then converted into a formal vector that the machine learning algorithm can understand.

Here is the built-in TfidfVectorizer class of scikit-learn to simplify the work:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english',
use_idf=True)
X = vectorizer.fit_transform(data)

The X object is the input vector and contains the TF-IDF representation of the data set. During the TF-IDF conversion, we are still dealing with high-dimensional data. In order to better understand the nature of the data, we visualize it. We can use PCA (Principal Component Analysis) to map the data in the dataset to a two-dimensional space. PCA can find irrelevant components in the data set (in mathematical terms it is called linearly irrelevant). By identifying irrelevant components in high-dimensional data sets, data can be effectively reduced in dimensionality. Of course, the main purpose of this example is visualization. In clustering scenarios, we will use other dimensionality reduction techniques:

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])X_visualise = pipeline.fit_transform(newsgroups_train.data).todense()pca = PCA(n_components=2).fit(X_visualise)
data2D = pca.transform(X_visualise)plt.scatter(data2D[:,0], data2D[:,1], c=newsgroups_train.target)

Simply browse the code above. The data set is loaded first, but only two categories (the categories we want to visualize) are loaded. On this basis, count vectorization and TF-IDF transformation are run, and a PCA model that only requires two key components is fitted. After drawing the effect diagram, you can clearly see the separation of the clusters in the data set as shown in Figure 10.1.

In the era of text big data, every developer needs to understand how to analyze text

 

Figure 10.1 Data set visualization results

The two coordinate axes in Figure 10.1 represent the two key components after PCA conversion.

Let's review the original vector X again and use it to implement clustering. In the chapter discussing topic models, we discussed some dimensionality reduction techniques, such as SVD and LSA/LSI (readers can review Chapter 8). This example will use these techniques for dimensionality reduction.

 

After performing the SVD operation on the data set, normalization is also required.

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
n_components = 5
svd = TruncatedSVD(n_components)normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)X = lsa.fit_transform(X)

After the three operations of cleaning, TF-IDF conversion, and dimensionality reduction, the finally generated X vector is the input we need, and the clustering process can be started in the next step.

10.3 K-means

K-means is a classic clustering learning algorithm, and its principle is easy to understand. It will achieve the clustering effect by reducing the distance between the cluster center point and the other points in the cluster according to the number of clusters set by the user. As an iterative algorithm, it will continue to perform this process until the cluster center point is stable. We need to briefly understand the principle behind this algorithm.

It is very simple to implement K-means with scikit-learn. The scikit-learn library provides two implementation methods, one is standard K-means and the other is small batch K-means. The following code includes two implementations, users can switch freely:

minibatch = True
if minibatch:    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000)
else:    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100,
                n_init=1)
km.fit(X)

By executing the fit function, we trained 4 different clusters. Before we visualized the clustering results, here we only print out the subject terms of each category:

original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]

 

The first code bit in the code must be reserved because it is needed for LSI conversion.

terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
Cluster 0:
 graphics space image com university nasa images ac programpostingCluster 1:
 god people com jesus don say
 believe think bible justCluster 2:
 space henry toronto nasa access com digex pat gov alaskaCluster 3:
 sgi livesey keith solntze wpd jon com caltech morality moral

 

The results of each training may be different, because the machine learning algorithm does not generate the exact same result every time.

We can see that each cluster represents the 4 categories originally selected, and the clustering results are good. We can further use the trained model to predict which cluster the new document belongs to. We only need to ensure that the same preprocessing step is performed on the new document before prediction.

km.predict(X_test)

Revisit the steps of clustering: first load the data set, then select 4 categories, run preprocessing steps, visualize the data, train a K-means model, and print out the most important words for each cluster to see if they have Meaning, and got good results. Because we have set the number of categories, K=4 in K-means.

Next, readers can try different preprocessing steps to get different clustering effects. Another form of clustering is discussed below.

10.4 Hierarchical clustering

Before introducing hierarchical clustering, it is recommended to study the clustering tutorial in scikit-learn. Switching to different models in scikit-learn is a simple matter, and the other steps in the clustering process always remain the same.

We will use Ward's algorithm to try hierarchical clustering. The algorithm is based on the idea of ​​reducing the internal variance of each cluster, and is implemented using distance metrics. Ward method is one of the earliest methods used in various hierarchical clustering algorithms. Its core idea is to construct clusters and arrange them hierarchically. This example will use a tree diagram to represent hierarchical clustering.

Before using the data set, create a matrix of pairwise distances through scikit-learn:

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(X)

After establishing the distance matrix, we will start to call the ward and dendrogram functions in the SciPy library:

from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist)fig, ax = plt.subplots(figsize=(10, 15)) # set size
ax = dendrogram(linkage_matrix, orientation="right")

SciPy encapsulates all the complex steps and shows us a beautiful diagram (see Figure 10.2). The tree diagram presents a concept that documents can be arranged. The x- axis in the graph represents the name or index of the document. Because there are too many documents, these names or indexes cannot be seen through the graph. The y- axis refers to the distance between each hierarchical structure of the cluster.

In the era of text big data, every developer needs to understand how to analyze text

 

Figure 10.2 An example of a text clustering dendrogram generated by SciPy's Ward algorithm

Because of the number of documents, it is difficult for us to determine whether the clustering results in the graph are optimal, nor can we understand the relationship between documents and clusters, so we can use a smaller corpus to further confirm the effect.

Here again, readers can try different dimensionality reduction and vector representation methods before inputting the corpus into the clustering algorithm. Both Word2Vec and Doc2Vec provide very interesting ways to achieve this, and Gensim can also provide support for this.

The text classification algorithm is introduced below, which is another important field in text machine learning algorithms.

10.5 Text classification

The previous section discussed clustering, which is an unsupervised learning algorithm. And classification is a supervised learning algorithm. What does supervised and unsupervised mean? In the previous example, some samples have label data to indicate which category the current document actually belongs to. But you may notice that we have never used this information. When we train the clustering model, we never use labels. This kind of learning is called unsupervised learning, and clustering is a common example of unsupervised learning tasks.

In classification problems, we know which classes to assign documents or data points to, and use this information to train our model. In fact, there is almost no difference between our clustering and classification methods. In addition to paying attention to labels, we just use different machines or models for training.

Before starting to enter text into any machine learning process, we need to make sure that the text cleaning and vectorization steps have been completed. Although no new steps have been introduced, developers can make appropriate adjustments to improve model accuracy or performance.

We will use NaiveBayes classifier and support vector machine classifier to assist in the classification task. The mathematical properties of these models are beyond the scope of this book. Interested readers can refer to the learning documents of scikit-learn.

SVM maps the input space to another space through the kernel function, so that we can draw a line (or a hyperplane) in the space for classification, as shown in Figure 10.3. The kernel function is composed of mathematical functions and is used to complete vector conversion.

In the era of text big data, every developer needs to understand how to analyze text

 

Figure 10.3 How SVM performs vector conversion through kernel function

The NaiveBayes classifier works by applying Bayes' Theorem. It assumes that each feature is independent, and we can predict which category a document may belong to. It must be noted that this independence is usually assumed. If this is not true, it is called naive. Use tags to calculate the prior probability of whether a document belongs to a certain class. Essentially, we are trying to find out which words can be used to predict the category. The code itself is very simple, the only difference is that we use tags to train the model. Only part of the code snippets are listed here. If you want to get the complete and runnable code, please refer to Jupyter Notebook. Don't forget to convert the data before training the model. If it is a sparse array, please run X=X.to_array() first:

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()gnb.fit(X, labels)from sklearn.svm import SVC
svm = SVC()svm.fit(X, labels)

The trained GNB and SVM models predict the category of unknown documents by calling the predict() method.

The forecast code of NaiveBayes is as follows:

gnb.predict(X_test)

The output result is category data. If the data set contains 4 categories, the output result is as follows:

array([0, 3, 3, ..., 3, 3, 3])

Similar to SVM, run the following code:

svm.predict(X_test)

The results are as follows:

array([0, 3, 3, ..., 3, 3, 3])

Although clustering is a more explanatory process, in the classification process, we often hope to improve the accuracy or success rate of predicting the correct class. GridSearchCV is a scikit-learn function that allows us to select the best parameters for the classifier object, and can use the classificaiton_report object to check the performance of the classifier.

The official scikit-learn document gives the following example:

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()clf = GridSearchCV(svc, parameters)clf.fit(iris.data, iris.target)

In this example, we have performed parameter optimization on the SVM with linear and rbf kernel functions, and the C values ​​are 1 and 10 respectively.

Another piece of code on the official website can run the same data set with multiple classifiers and compare the differences in the results of the classifiers. Figure 10.4 compares the training and prediction time of these classifiers.

In the era of text big data, every developer needs to understand how to analyze text

 

Figure 10.4 The performance of different classifiers on the 20NG dataset

For readers who want to use more powerful machine learning tools, you can refer to related materials to learn how to use Word2Vec to classify documents. Chapter 12 will introduce this part in detail.

10.6 Summary

All in all, readers can now create their own classification programs, such as classifying emails as spam and normal emails. We have learned various clustering algorithms, such as K-means and hierarchical clustering algorithms, discussed what are supervised and unsupervised learning algorithms, and learned how to use scikit-learn to run these two algorithms.

In addition, we can also use the clustering and topic modeling tools provided in the book to explore text data in various ways. The next chapter will try to build a simple information retrieval system to find similar documents.

Guess you like

Origin blog.csdn.net/epubit17/article/details/108366602