Classification of news datasets based on logistic regression

1. About the author

Sheng Siyu, male, School of Electronic Information, Xi'an Polytechnic University, graduate student in 2021
Research direction: Research on industrial image defect detection based on unsupervised network
Email: [email protected]

Wu Yanzi , female, School of Electronic Information, Xi'an Polytechnic University, 2021 graduate student, Zhang Hongwei Artificial Intelligence Research Group
Research direction: Pattern Recognition and Artificial Intelligence
Email: [email protected]

2. Logistic regression

2.1 Logistic regression

In the regression model, the dependent variables are all numerical interval variables, and the established model describes the linear relationship between the expectation of the dependent variable and the independent variable. For example, the common linear regression model:
insert image description here
in the use of regression models to analyze practical problems, the variables studied are often not all interval variables but ordinal variables or attribute variables, such as binomial distribution problems. By analyzing age, gender, body mass index, average blood pressure, disease index and other indicators to determine whether a person has changed diabetes, Y=0 means no disease, Y=1 means disease, the response variable here is a two-point (0- 1) distribution variable, it cannot predict the dependent variable Y with the continuous value of the h function (it can only take 0 or 1).
In short, the linear regression model usually deals with the problem that the dependent variable is a continuous variable. If the dependent variable is a qualitative variable, the linear regression model is no longer applicable, and the logistic regression model needs to be used to solve it.
Logistic regression (Logistic Regression) is used to deal with regression problems where the dependent variable is a categorical variable. The common problem is binary or binomial distribution, and it can also deal with multi-category problems. It is actually a classification method.
insert image description here
The relationship between the probability and independent variables of the two-class problem is often an S-shaped curve, as shown in the figure above, which is implemented by the Sigmoid function.
Here we define the function as follows:
insert image description here
the definition domain of the function is all real numbers, the value range is between [0, 1], and the result corresponding to the x-axis at 0 point is 0.5. When the value of x is large enough, it can be regarded as a 0 or 1 type of problem. If it is greater than 0.5, it can be regarded as a type 1 problem. Otherwise, it is a type 0 problem. If it is just 0.5, it can be divided into type 0 or type 1. For 0-1 type variables, the probability distribution formula of y=1 is defined as follows:
insert image description hereThe probability distribution formula of y=0 is defined as follows:
insert image description here
The expected value formula of discrete random variables is as follows:
insert image description here
The linear model is used for analysis, and the formula transformation is as follows:
insert image description hereThe practical application In , the probability p and the dependent variable are often nonlinear. In order to solve this kind of problem, we introduce the logit transformation, so that there is a linear correlation between the logit§ and the independent variable. The logistic regression model is defined as follows:
insert image description here
By derivation, the probability p is transformed as follows, which is consistent with the Sigmoid function and also reflects the nonlinear relationship between the probability p and the dependent variable. Taking 0.5 as the limit, when the prediction p is greater than 0.5, we judge that y is more likely to be 1 at this time, otherwise y is 0.
insert image description here
After obtaining the required Sigmoid function, it is only necessary to fit the n parameters θ in the formula as in the previous linear regression.
The dependent variable of logistic regression can be binary or multi-class, but binary is more commonly used and easier to interpret. Therefore, the most commonly used in practice is the logistic regression of binary classification. For multi-class classification problems, it can be regarded as a two-class classification problem: keep one of them, and the rest as the other.

2.2 Logistic regression algorithm

The LogisticRegression regression model is under the Sklearn.linear_model subclass, and the steps to call the sklearn logistic regression algorithm are relatively simple, namely:
import the model: call the logistic regression LogisticRegression() function.
fit() training: call the fit(x,y) method to train the model, where x is the attribute of the data and y is the type.
predict() prediction: Use the trained model to predict the data set and return the prediction result.

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

1. Regularization selection parameters: penalty
LogisticRegression and LogisticRegressionCV have regularization terms by default. The optional values ​​of the penalty parameter are "l1" and "l2". They correspond to L1 regularization and L2 regularization respectively. The default is L2 regularization.
When tuning parameters, if our main purpose is to solve overfitting, it is usually enough to choose L2 regularization for penalty. However, if L2 regularization is selected and it is found that it is still overfitting, that is, when the prediction effect is poor, L1 regularization can be considered. In addition, if the model has a lot of features, and we want some unimportant feature coefficients to be zeroed, so that the model coefficients are sparse, L1 regularization can also be used.

2. Optimization algorithm selection parameters: The solver
solver parameter determines our optimization method for the logistic regression loss function. There are 4 algorithms to choose from, namely:
liblinear: using the open source liblinear library implementation, the coordinate axis descent method is used internally to Iteratively optimize the loss function.
lbfgs: A kind of quasi-Newton method, which uses the second derivative matrix of the loss function, that is, the Hessian matrix, to iteratively optimize the loss function.
newton-cg: It is also a kind of Newton's method family. It uses the second derivative matrix of the loss function, that is, the Hessian matrix, to iteratively optimize the loss function.
sag: Stochastic Average Gradient Descent, which is a variant of the gradient descent method. The difference from the ordinary gradient descent method is that each iteration only uses a part of the sample to calculate the gradient. It is suitable for when there are many sample data. SAG is a linear convergence algorithm , which is much faster than SGD. For the understanding of SAG, refer to the blog post SAG, SVRG (Stochastic Gradient Descent) of Linear Convergence Stochastic Optimization Algorithms In
summary, liblinear supports L1 and L2, only supports OvR for multi-classification, "lbfgs", "sag" "newton-cg" "Only supports L2, supports OvR and MvM for multi-classification.

3. Classification method selection parameter: multi_class The
multi_class parameter determines the choice of our classification method. There are two values ​​ovr and multinomial to choose from. The default is ovr.
ovr is the aforementioned one-vs-rest (OvR), and multinomial is the aforementioned many-vs-many (MvM). If it is binary logistic regression, there is no difference between ovr and multinomial, the difference is mainly in multiple logistic regression.
The idea of ​​OvR is very simple, no matter how many meta logistic regression you are, we can think of it as binary logistic regression. The specific method is that for the classification decision of the Kth class, we take all the samples of the Kth class as positive examples, and all the samples except the Kth class samples are regarded as negative examples, and then do binary logistic regression on it to get the Kth Class classification model. Classification models of other classes are obtained and so on.

MvM is relatively complicated, here is a special case of MvM one-vs-one (OvO) for explanation. If the model has T class, we select two types of samples from all T class samples each time, which may be recorded as T1 class and T2 class, put all the samples whose output is T1 and T2 together, and take T1 as a positive example , T2 is used as a negative example, and binary logistic regression is performed to obtain model parameters. We need a total of T(T-1)/2 classifications.
It can be seen from the above description that OvR is relatively simple, but the classification effect is relatively poor (this refers to most sample distributions, and OvR may be better under some sample distributions). The MvM classification is relatively accurate, but the classification speed is not as fast as OvR.

If ovr is selected, the four optimization methods of loss function liblinear, newton-cg, lbfgs and sag can be selected. But if you choose multinomial, you can only choose newton-cg, lbfgs and sag.

4. Type weight parameter: class_weight The
class_weight parameter is used to indicate the various types of weights in the classification model. It can be omitted, that is, the weights are not considered, or the weights of all types are the same. If you choose to input, you can choose balanced to let the class library calculate the type weight by itself, or we can input the weight of each type by ourselves. For example, for the binary model of 0,1, we can define class_weight={0:0.9, 1:0.1}, This way type 0 has a weight of 90% and type 1 has a weight of 10%.
If class_weight chooses balanced, then the class library will calculate the weight according to the training sample size. The larger the sample size of a certain type, the lower the weight, and the smaller the sample size, the higher the weight.

5. Sample weight parameter: sample_weight
is not an unbiased estimate of the overall sample due to the imbalance of the sample itself under the problem of sample imbalance, which may lead to a decline in the predictive ability of our model. In this case, we can try to solve this problem by adjusting the sample weight. There are two ways to adjust the sample weight. The first is to use balanced in class_weight. The second is to adjust the weight of each sample by sample_weight when calling the fit function.

3. Experimental procedure

3.1 Introduction to the fetch_20newsgroups (20 categories of news text) dataset

The 20newsgroups dataset contains more than 18,000 news articles, involving a total of 20 topics, so it is called the 20newsgroups text dataset. It is divided into two parts: the training set and the test set, which are usually used for text classification, and are evenly divided into 20 news of different topics. group collection. The 20newsgroups dataset is one of the international standard datasets used for text classification, text mining and information retrieval research. Some newsgroups have very similar topics (eg, others are completely unrelated (eg /soc.religion.christian).

3.2 Experimental code

The whole process is divided into four parts: data collection, feature extraction, model training, and model evaluation.
1. Data collection

from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', '', '']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

data_home refers to the address of the dataset. By default, all data will be in the '~/scikit_learn_data' folder.
The subset is train, test, and all three options, corresponding to the training set, test set and all samples respectively.
Categories: refers to categories. If a category is specified, only the target category will be extracted. If it is the default, all categories will be extracted.
shuffle: Whether to shuffle the order of samples, if they are independent of each other.
random_state: random seed for shuffling the order
remove: is a tuple used to remove some stop words, such as title references.
download_if_missing: Whether to download if data is missing.
After testing, we know: is a list type, and each element is a str type, which is an article. is its label.
2. Feature extraction

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

3. Model training

from sklearn.linear_model import LogisticRegression  # 逻辑回归
clf = LogisticRegression().fit(X_train_tfidf,
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

Use the logistic regression algorithm to train and perform prediction tests on two categories of sentences to get the results
insert image description here
4. Model evaluation
The evaluation of the model is generally evaluated using PRF (precision rate, recall rate, F1 value) and Acc value (accuracy value), Therefore, we can easily obtain this information using the metrics.classification_report method, and we can use this method to compare the difference between the two targets:

from sklearn import metrics
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test =
X_test_counts = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict(X_test_tfidf)
print(metrics.classification_report(, predicted,target_names=twenty_test.target_names))
print("accurary\t"+str(np.mean(predicted ==

3.3 Running Results

insert image description here

refer to

1. [Python] 20Newsgroup text classification (TF-IDF vectorization, ten sklearn classifiers)
2.Dataset: fetch_20newsgroups (20 types of news texts) data set introduction, installation, detailed guide on how to use
3. Logistic regression (Logistic regression) Regression)

Guess you like