Datawhale-zero-based introduction NLP-news text classification Task03

The text is of indefinite length, and the method of representing the text as a number or vector that can be calculated is called Word Embedding. Word embedding converts text of variable length into a space of fixed length. In order to solve the problem of converting the original text into a fixed-length feature vector, scikit-learn provides the following methods:

Tokenizing: Divide each possible token into a string and assign an integer-shaped id, using spaces and punctuation as token separators.
Counting the number of occurrences of each word token in the document.
Normalizing is a weight that reduces the number of occurrences of important word tokens.

Use traditional machine learning methods for text classification. Ideas include Count Vectors + classification algorithm (LR/SVM/XGBoost, etc.), TF-IDF+ classification algorithm (LR/SVM/XGBoost, etc.)

1 Count Vectors + classification algorithm

1.1 Count Vectors

The countVectorizer class implements tokenization (word segmentation) and occurrence counting (occurrence counting) in a single class:

The function is:

Detailed parameters:

input：string {‘filename’, ‘file’, ‘content’}, default=’content’

Define the format of the input data. If it is filename, read the list of file names to obtain the original content to be analyzed; if it is'file', the sequence item must have a'read' method (file-like object), which is Call to get the bytes in the memory; if it is'content', the input should be a string or byte type sequence item.

encoding：string,default='utf-8'

When analyzing, use this type for decoding.

lowercase：bool,default=True

Before tokenizing, convert characters to lowercase

ngram_range：tuple (min_n, max_n), default=(1, 1)

The upper and lower boundaries of the n-value range of different word n-grams or character n-grams to be extracted.

analyzer：string, {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

Whether the analysis is composed of word n-gram or character n-gram,'char_wb' is a mixed state.

max_df：float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary, ignore terms whose document frequency is strictly higher than a given threshold (corpus-specific stop words). If it is a float, this parameter represents the proportion of the document, an integer absolute count. If the vocabulary is not none, this parameter is ignored.

min_df：float in range [0.0, 1.0] or int, default=1

When constructing the vocabulary, ignore terms whose document frequency is strictly below the given threshold. This value is also called the cut-off value in the literature. If it is a float, this parameter represents the proportion of the document, an integer absolute count. If the vocabulary is not none, this parameter is ignored.

example:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
          'This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?'
        ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.toarray())


vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())

print(X2.toarray())

1.2 Classification algorithm

Here, the ridge regression in the linear model is used for classification, and classification algorithm models such as SVM, LR, XGBoost, etc. can also be used. Later, it will be used for parameter traversal with grid search (GridSearchCV).

Comprehensive analysis:

import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import  SVC
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline


train_df = pd.read_csv('data/data45216/train_set.csv',sep='\t',nrows=15000)
print(train_df.shape)

vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000],train_df['label'].values[:10000])
val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:],val_pred,average='macro'))

The output result is: 0.65441877581244

2 TF-IDF+ classification algorithm

2.1 TF-IDF

TF-IDF is term frequency-inverse document frequency, meaning that if a word or phrase appears in an article with a high frequency of TF and rarely appears in other articles, it is considered that the word or phrase has a good category distinction Ability, suitable for classification. The assumption of TF-IDF is that high-frequency words should have high weight, unless it is also a high document frequency. Inverse document fear frequency is to use the document frequency of the term to offset the influence of the term frequency of the word on the weight, and get a lower weight.

Term Frequency (Term Frequency, TF) refers to the frequency with which a given word appears in the file. This number is the normalization of the term count (Term Count) to prevent it from being biased towards long documents. For a word in a particular document, its importance can be expressed as:

Among them, the numerator is the number of occurrences of the word in the file, and the denominator is the sum of the number of occurrences of all the words in the file.

Inverse Document Frequency (IDF) is a measure of the universal importance of words. The IDF of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient:

Among them ,: the total number of documents in the corpus; j: the number of documents containing a word. If the word is not in the corpus, the denominator will be 0. Therefore, it is often used as the denominator, and then the product of TF and IDF is calculated.

The function is:

Detailed parameters:

input：string {‘filename’, ‘file’, ‘content’}, default=’content’

encoding：string,default='utf-8'

When analyzing, use this type for decoding.

lowercase：bool,default=True

Before tokenizing, convert characters to lowercase

ngram_range：tuple (min_n, max_n), default=(1, 1)

The upper and lower boundaries of the n-value range of different word n-grams or character n-grams to be extracted.

analyzer：string, {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

Whether the analysis is composed of word n-gram or character n-gram,'char_wb' is a mixed state.

max_df：float in range [0.0, 1.0] or int, default=1.0

min_df：float in range [0.0, 1.0] or int, default=1

norm：{‘l1’, ‘l2’}, default=’l2’

Regularization,'l2' is the square value,'l1' is the absolute value

use_idf：bool, default=True

Enable reverse document frequency reweighting

smooth_idf：bool, default=True

The idf weights are smoothed by adding one to the document frequency, as if an extra document contains every term in the collection exactly once. Prevent division by zero

sublinear_tf：bool, default=False

Apply sublinear tf scaling, ie replace tf in with 1 + log(tf).

example:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
          'This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?'
        ]

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names)

2.2 Classification algorithm

Comprehensive analysis:

tfidf = TfidfVectorizer(ngram_range=(1,3),max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000],train_df['label'].values[:10000])
val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:],val_pred,average='macro'))

The output result is: 0.8719098297954606

Homework in this chapter:

To adjust the parameters, consider using grid search for hyperparameter traversal, and traverse the text processing method and classification algorithm respectively. Hyperparameters are parameters that are not directly learned in the estimator. They are passed as the parameters of the constructor in the estimator class. The hyperparameter space is searched in order to obtain the best "cross-validation". The search includes:

Estimator (regressor or classifier)
Parameter space
Methods of searching or sampling candidates
Cross-validation scheme
Scoring function

The scikit-learn package provides two general methods for sampling search candidates. GridSearchCV considers all parameter combinations. RandomizedSearchCV can extract a given number of candidates from a parameter space with a specified distribution. Here is an example of using GridSearchCV:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import  SVC
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

#遍历TF-IDF的参数
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', SGDClassifier()),
])

parameters = {
    #'tfidf__max_df': (0.5, 0.75, 1.0),
    'tfidf__max_features': (None, 5000, 10000, 50000),
    'tfidf__ngram_range': ((1, 1), (1, 2),(1,3)),  # unigrams or bigrams
    'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters,  verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
grid_search.fit(train_df['text'].tolist()[:10000],train_df['label'].values[:10000])

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Traverse the classifier:

import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import  SVC
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

tfidf = TfidfVectorizer(ngram_range=(1,3),max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])

#定义多个分类函数
classifiers = [('xgb',xgb.XGBClassifier(),{
    'max_depth': [5, 10, 15, 20, 25],
    'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
    'n_estimators': [50, 100, 200, 300, 500],
}),
                     ('lgb',lgb.LGBMClassifier(),{
    'max_depth': [5, 10, 15, 20, 25],
    'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
    'n_estimators': [50, 100, 200, 300, 500],
}),
                     ('cat',cat.CatBoostClassifier(),{
    'max_depth': [5, 10, 15, 20, 25],
    'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
    'n_estimators': [50, 100, 200, 300, 500],
}),
                     ('svc',SVC(),{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]})]

for name,clf,params in classifiers:
    grid_search = GridSearchCV(clf,params,n_jobs=1,verbose=1)
    grid_search.fit(train_test[:10000],train_df['label'].values[:10000])

Thinking: The verification result of my TF-IDF+ classifier is 0.87, the submission result is 0.1773, the verification result using Fasttext is 0.82, and the submission result is 0.833. What is the reason?

Datawhale-zero-based introduction NLP-news text classification Task03

Guess you like