Natural language processing-introduction to text classification

In NLP tasks, text classification is the most common application. Introductory natural language processing\machine learning tasks mostly talk about spam recognition, and this is a text classification task. Common applications of text classification include sentiment analysis, classification label acquisition, news classification, and so on.

For classification and clustering problems (or most machine learning tasks), there are three steps

  1. Get features,
  2. Feed the machine learning algorithm,
  3. Evaluate and debug algorithm parameters.

We feature-based in two different ideas for this article: feature-based methods and end to end approach .

1. Classification based on features

1. Get features

In machine learning tasks, features are absolutely the key to solving the task (the code below shows that the machine learning model is really simple to apply), and the process of processing features is also feature engineering .
The text data structure is relatively simple (only contains character strings), so the feature engineering is relatively simple without adding additional features, and the vector space model is generally used to represent the text.

Text preprocessing

Text preprocessing is mainly to clean the text content and remove irrelevant content. The most basic cleaning is to clean the non-text content of the crawled data (the front-end html language strings are definitely not big for the relevant lines of web text classification, and will only increase the noise of the data), or the stop words after the word segmentation Remove.

Stop words refer to words that appear frequently and are not related to the task, such as "you, me, or that". But stop words are also extremely related to the task: in news classification, "good or bad" is regarded as a stop word; in sentiment analysis, "good or bad" is definitely a key feature word.

Participle

The process of processing a text string into a single word list is word segmentation. The most basic word segmentation method is to call jieba for word segmentation.

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))
## 他, 来到, 了, 网易, 杭研, 大厦

The algorithm used by jieba's github is described as follows:

Prefix word dictionary FIG efficient scanning, the sentence generating all possible characters into a directed acyclic graph (DAG) based on configuration word situation
using dynamic programming to find the most probable path, to find a combination of segmentation based on the maximum term frequency
to not The login word uses the HMM model based on the ability of Chinese characters to form words, and uses the Viterbi algorithm

According to my simple understanding, jieba word segmentation is a combination of dictionary and HMM model for word segmentation. See blog for details

For common types of text, the effect of jieba word segmentation has been better, and the word segmentation speed has advantages compared with other tools. But for texts in specific fields (such as military and finance), the word segmentation effect and word segmentation granularity are not necessarily what we want. Word segmentation itself is also a basic task of NLP. We can build our own word segmentation data set (a lot of manual) and build our own word segmentation tool.

Common word segmentation methods include: HMM, CRF, LSTM, BI-LSTM-CRF, BERT+BI-LSTM+CRF. BI-LSTM-CRF is the most common in real application scenarios.
Of course : Generally speaking, we only need to use the jieba custom dictionary to basically achieve word segmentation according to our ideas, and the workload is far less than doing the word segmentation tool ourselves.

Feature construction

If every word is a characteristic of us, then not every word is a characteristic we need. There are three problems here:

  1. Which words to choose as features
  2. What is the weight of each feature
  3. Can you add additional features

Question 1
generally ranks the original feature items independently according to a certain index, and selects some features (words) with higher scores. Common indicators generally include: (Statistical Natural Language Processing-Zong Chengqing Chapter 13)

  • Document frequency: Count the frequency of words appearing in the document, and remove words with too high or too low frequency. The frequency of occurrence is too high, every document appears, and each document in each category cannot be distinguished; the frequency of occurrence is too low, if the word appears only once, it certainly cannot represent the category of the document it appears in, and the feature dimension is added but not provided Enough information.
  • Mutual information: Count the mutual information of each word and document category. Leave the mutual information big.
  • Information gain: Count whether each word appears to increase the document category information. Leaving information gains a lot.
  • x2 Statistics: Count the mutual information of each word and document category. Leave the relevance to fight.

In fact, mutual information, information gain, and x2 statistics are all related to statistical training set words and document categories. If the correlation between words and document categories in the training set is small, due to the contract distribution of the training and test set, then the words and the document categories in the test set The relevance is also small, so we don't care about the word. In the feature encoding of machine learning, the target encoding feels similar.

Question 2
selected some words as features before, and then we should confirm the weight of each feature.
Commonly used as weights are:

  • Word frequency: How often the word appears in the article
  • TF-IDF value: TF-IDF is a statistical method used to evaluate the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in the document, but at the same time it decreases in inverse proportion to the frequency of its appearance in the corpus.

Question 3
In addition to the previous basic features, we can also add some additional features, such as the length of the article, the source of the article, and even whether the article appears in capital letters. (It feels that the dimensionality of the vector space model is relatively large. In fact, the effect of adding some features is not Significantly)
can also add some other features: for example, syntactic analysis information.

The most basic realization:

# 数据来源为 https://github.com/skdjfla/toutiao-text-classfication-dataset
def load_dataset(filepath: str = 'data/头条分类数据.txt', sample: bool or int = False) -> Tuple:
    texts, labels = [], []
    with open(filepath, 'r') as f:
        for i, line in enumerate(f):
            if sample and i == sample:
                break
            _, _, label, text, _ = line.split('_!_')
            text = ' '.join(jieba.cut(text))
            texts.append(text)
            labels.append(label)
    return texts, labels
def apply(instance, train, test):
    """ 对train和test分别处理"""
    train = instance.fit_transform(train)
    test = instance.transform(test)
    return train, test
texts, labels = load_dataset(sample=True)
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# 划分训练集,测试集
X_train, X_test, y_train, y_test = train_test_split(trainDF['text'], trainDF['label'], test_size=0.05, stratify=labels,
                                                    random_state=0)
# 标签列处理
label_encoder = preprocessing.LabelEncoder()
y_train, y_test = apply(label_encoder, y_train, y_test)
# 数据列处理
count_vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train, X_test = apply(count_vectorizer, X_train, X_test)
X_train, X_test = apply(tfidf_transformer, X_train, X_test)

2. Feed the algorithm

Through the previous processing, the text data has been completely processed into numerical features that can be easily processed by the algorithm. Almost all classification algorithms can handle this type of data, as follows:

class ModelTest():
    def __init__(self, X_train, y_train, X_test, y_test):
        self.X_train, self.y_train, self.X_test, self.y_test = X_train, y_train, X_test, y_test

    def eval(self, classifier):
        """测试模型"""
        classifier.fit(self.X_train, self.y_train)
        predictions = classifier.predict(self.X_test)

        score = metrics.f1_score(predictions, self.y_test, average='weighted')

        print('weighted f1-score : %.03f' % score)

    def apply(self, instance):
        """ 对train和test分别处理"""
        self.X_train = instance.fit_transform(self.X_train)
        self.X_test = instance.transform(self.X_test)
modeltest = ModelTest(X_train, y_train, X_test, y_test)

models = OrderedDict([
    ('KNN', neighbors.KNeighborsClassifier()),
    ('logistic回归', linear_model.LogisticRegression()),
    ('svm', svm.SVC()),
    ('朴素贝叶斯', naive_bayes.MultinomialNB()),
    ('决策树', tree.DecisionTreeClassifier()),
    ('决策树bagging', BaggingClassifier()),
    ('随机森林', RandomForestClassifier()),
    ('adaboost', AdaBoostClassifier()),
    ('gbdt', GradientBoostingClassifier()),
    ('xgboost', XGBClassifier()),
])
for name, clf in models.items():
    modeltest.eval(clf)

3. Debug algorithm parameters

As you can see above, using sklearn for machine learning is really simple, and only some of the machine learning algorithms are listed above. But there are two problems:

  1. Which algorithm works best on the current task
  2. How to set the parameters of the algorithm

Question 1
Which type of problem to use which algorithm? This requires task experience plus an understanding of the algorithm. According to my understanding, Naive Bayes should be the most classic way to solve text classification, and xgboost is a great artifact of kaggle competition, but it is not clear who can do the best. The only thing that is certain is that KNN should not be the best. way of doing.
Question 2
How to debug algorithm parameters? Try one by one to find out. sklearn includes grid parameter search and random search tools (I feel that I can write it quickly)
reference code

parameters = {
    
    'gamma': [0.001, 0.01, 0.1, 1], 'C':[0.001, 0.01, 0.1, 1,10]}
gs = GridSearchCV(SVC(), parameters, refit = True, cv = 5, verbose = 1, n_jobs = -1)
gs.fit(X,y)
print('最优参数: ',gs.best_params_)
print('最佳性能: ', gs.best_score_)

For the case where the algorithm converges quickly and the data set is small, grid search can indeed obtain the optimal parameters of the custom parameter space, but when the algorithm is time-consuming, the amount of data is large and the custom parameter space is relatively large, random search is often The best results can be tapped faster. Therefore, random search is generally used to find the key parameters, and then grid search is used to find the optimal parameters.

Two, end-to-end classification

It can also be seen from the foregoing that, in fact, most of the workload of machine learning tasks is in the feature engineering section (the previous features are extremely basic and may not be obvious), while another type of method uses less feature engineering and completely uses models to achieve classification. That is: the input is text, the output is category, the middle is all models, and there are fewer feature projects.
Relationship classification task
Specifically, comparing the SVM and "Our CNN" models through this picture, SVM is a feature-based method. Obviously, the article adds a large number of external features, while the CNN method only uses text strings. The performance gap between the two is actually not very obvious, but CNN has fewer features than SVM, which saves the time to build artificial features.
The author feels that for most basic tasks, the end-to-end implementation is now able to complete the task better, and save a lot of time for feature engineering. For tasks in the field of partial application or personalization, artificial features are still needed to obtain better results.

Of course, using an end-to-end approach is not abandoning textual information. Pre-training word vectors used in common NLP tasks actually add richer basic semantic information (except fasttext). Next, we will introduce the basic end-to-end text classification model.

1.fasttext text classification

fasttext is a word vector and text classification tool open sourced by Facebook. The typical application scenario is "supervised text classification problem". Specifically, it is probably N-gram subword feature + CBOW + hierarchical SoftMax + negative sampling. Its training speed is fast, and it does not require pre-trained word vectors, and the effect is good. The specific code is as follows (it feels similar to sklearn):

Ensure that the training test text format is as follows

How about raising sheep in 2018? __label__0
The first private research university in China was established, and undergraduates will be recruited in 2023 __label__3

# 训练模型
model = fasttext.train_supervised('data/fasttext.train.txt',epoch=10)
model.save_model("data/model_filename.bin")
# 测试模型
model = fasttext.load_model("data/model_filename.bin")
texts_test, labels_test = [], []
with open('data/fasttext.test.txt', 'r') as f:
    for line in f:
        *text, label = line.strip().split(' ')
        text = ' '.join(text)
        texts_test.append(text)
        labels_test.append(label)  
label_encoder = preprocessing.LabelEncoder()
labels_test = label_encoder.fit_transform(labels_test)
predits = list(zip(*(model.predict(texts_test)[0])))[0]
predits = label_encoder.transform(predits)

score = metrics.f1_score(predits, labels_test, average='weighted')
print('weighted f1-score : %.03f' % score)

2. CNN text classification

Fasttext does not require pre-trained word vectors. It trains vector representations of words and n-gram substrings while training the classification task. Most other tasks require pre-training word vectors.
CNN performs text classification and feels that it is the most cost-effective end-to-end model. CNN's text classification model is relatively simple, the training time is short and the effect is very good.
The simplest CNN text classification model is shown below. CNN text classification
This model can be said to be the simplest version of CNN. Compared with the dozens or hundreds of layer rolls in CV, this model can achieve good results. The explanation of the specific model can be found in this blog

3.RNN text classification

RNNs are rarely used directly in general scenarios, and the most commonly used RNN should be the LSTM model. Because RNN is inherently suitable for processing sequence data, and text is sequence data. LSTM processing classification can directly use the output of the last sequence as the feature of the sentence, or connect the output of all words as the feature of the sentence, followed by the classification layer to achieve classification. An attention layer can also be added after LSTM to improve the effect.

Insert picture description here

4. Combination model

Now we can find that a large number of models can be spliced. Because of the relative uniformity of features, we can completely connect RNN after CNN, or we can put CNN after RNN. It is reasonable to say that certain effects can be achieved. But there is indeed a problem here, that is, it increases the complexity of the model, its fitting ability is stronger, and it also requires more data. In reality, the size of our data set often cannot meet the requirements of model complexity.

5.HAN classification

In addition, for long texts, features can be gradually acquired at the level of article-sentence-words. For common end-text classification, the text is equivalent to a sentence.
Have
HAN uses RNN to obtain sentence features first, and then obtains text features based on sentence features. Its feature extraction is more in line with common sense, and its classification effect is better in the classification of long texts.

6. More powerful word vectors

The various methods mentioned above actually use word2vec pre-trained word vectors or glove word vectors (except fasttext), that is, the basis of the feature representation of these model sentences is the word vector, but the model method for obtaining sentence features based on the word vector is different . But this kind of static word vector (not contextualized word embedding) cannot handle the problem of polysemous words. That is to say, "Apple is delicious" and "Apple has a new version" are two apples. And this kind of polysemy is very common.
The dynamic word vector (contextualized word embedding) can actually understand the correct representation of the word in the sentence environment, that is, to obtain the representation of the word vector, you also need to see what the entire sentence is saying. Such dynamic word vectors are not only the process of obtaining a word vector, but also the process of understanding the semantics of the entire sentence. Therefore, it is more accurate for each word.
(Existing generic term vector word vector is generated in the language training model that we believe represents its vector and vector under the words represent a specific task or less, so you can get the words vector specific task by way of fine-tuning representation)
Common The dynamic word vector representation:

  • ELMO
  • GPT
  • BERT
  • ALBERT
    BERT

Specifically for me, different word vectors are just a word representation process, and their functions in the model are the same as the previous word vectors, and it is now extremely simple to call in various languages.

For most NLP tasks, BERT can obtain a large performance improvement, of course, it also consumes machine performance and electricity bills than simple static word vectors.
Compared with BERT, ALBERT has a smaller amount of parameters and has little effect on the effect.

Thank you all for reading.
It is the first time I write an article, and I am still far from getting started with NLP, so there may be a lot of errors. I welcome your criticism and correction.

The following is the official account, welcome to scan the QR code, thank you for your attention, thank you for your support!

Official account name: Python into the pit NLP
No public
This official account is mainly dedicated to natural language processing, machine learning, coding algorithms and some knowledge sharing of Python. I am just a side dish. I hope everyone can make progress together while recording the process of my study and work. Welcome to exchange and share.

Guess you like

Origin blog.csdn.net/lovoslbdy/article/details/104877576