[Machine Learning Practice] Logistic Regression----digits Handwritten Digit Classification

【Import libraries and datasets】

As with linear regression, first import the required libraries and datasets.
Import the library:

##用于可视化图表
import matplotlib.pyplot as plt
##用于做科学计算
import numpy as np
##用于做数据分析
import pandas as pd
##用于加载数据或生成数据等
from sklearn import datasets
##加载线性模型
from sklearn import linear_model
###用于交叉验证以及训练集和测试集的划分
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import cross_val_predict
#from sklearn.cross_validation import cross_val_score
###这个模块中含有评分函数,性能度量,距离计算等
from sklearn import metrics
###用于做数据预处理
from sklearn import preprocessing

Data set:
This time the data set is still a small data set provided in sklearn-digits handwritten font data set. Let’s take a look at the official introduction first:
write picture description here
Through the official data set introduction, we can know that this digits handwritten data set is actually a collection of pixels of 1797 groups of 8*8 handwritten digital images. There are 10 categories, representing "0, 1, 2, ..., 9" which is a number. The feature dimension is 64, corresponding to 8*8 pixels of each set of data. Knowing this, we can check it out in detail.

digits = datasets.load_digits()#导入digits数据集
print(digits.keys())#查看digits中有哪些属性
输出为:dict_keys(['images', 'target_names', 'DESCR', 'data', 'target'])
(1797, 64)
[0 1 2 3 4 5 6 7 8 9]
0
[[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]
1
[[  0.   0.   0.  12.  13.   5.   0.   0.]
 [  0.   0.   0.  11.  16.   9.   0.   0.]
 [  0.   0.   3.  15.  16.   6.   0.   0.]
 [  0.   7.  15.  16.  16.   2.   0.   0.]
 [  0.   0.   1.  16.  16.   3.   0.   0.]
 [  0.   0.   1.  16.  16.   6.   0.   0.]
 [  0.   0.   1.  16.  16.   6.   0.   0.]
 [  0.   0.   0.  11.  16.  10.   0.   0.]]

There are 1797 data in the digits dataset, and the classification labels are 0~9. Print the labels and data of the first and second images, and you can find that each pixel is between 0 and 16. However, this does not seem intuitive enough, you can draw a picture to see it.

plt.gray()
for i in range(0,2):
    plt.matshow(digits.images[i])
    plt.show()
    print(digits.target[i])

The output is:
write picture description here

You can probably see what the numbers look like. We can print these images together with their labels to see.

fig=plt.figure(figsize=(8,8))
fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
for i in range(30):
    ax=fig.add_subplot(6,5,i+1,xticks=[],yticks=[])
    ax.imshow(digits.images[i],cmap=plt.cm.binary,interpolation='nearest')
    ax.text(0,7,str(digits.target[i]))
plt.show()

The output is:
write picture description here

【Two classification problems】

From the analysis of the data set above, we know that the digits data set has 10 categories. We regard 0~4 as one category and 5~9 as another category, and the problem becomes a two-category problem.
Import data:
Get the input and output of the dataset, and divide the training set and test set.

digits_X = digits.data   ##获得数据集中的输入
digits_y = digits.target ##获得数据集中的输出,即标签(也就是类别)
### test_size:测试数据大小
X_train,X_test,y_train,y_test = train_test_split(digits_X, digits_y, test_size = 0.1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
##输出为:
(1617, 64)
(180, 64)
(1617,)
(180,)

The training set has 1617 samples and the test set has 180 samples.

Since the input data of the digits dataset has inconsistent scales (units) of features in different dimensions, a normalization step is required to preprocess the data (for the data on each feature). Let's see what the preprocessed data looks like.

digits_X = digits.data   ##获得数据集中的输入
digits_y = digits.target ##获得数据集中的输出,即标签(也就是类别)
### test_size:测试数据大小
X_train,X_test,y_train,y_test = train_test_split(digits_X, digits_y, test_size = 0.1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
plt.gray()
plt.matshow(np.abs(X_train[0].reshape(8, 8)))
plt.show()
X_train=preprocessing.StandardScaler().fit_transform(X_train)

plt.gray()
plt.matshow(np.abs(X_train[0].reshape(8, 8)))
plt.show()

print(X_train[0])
print("第一个特征的均值为",X_train[0].mean())
print("第一个特征的方差为",X_train[0].var()) 

The output is:
write picture description here

[ 0.         -0.33114795 -1.08393502 -1.60706168  0.96486852  0.57531019
 -0.40894927 -0.11573126 -0.05788529 -0.61731417 -1.71250427  0.25255752
  1.19287415 -0.34182745 -0.51285157 -0.12578097 -0.04311306 -0.72760443
  0.55376957  1.52982569  1.43339513 -0.60237557 -0.54488676 -0.11662057
 -0.02487593  0.80537163  0.94404281  1.23094941  0.9929394  -0.42889005
 -0.63114251 -0.0497981   0.         -0.67552341 -1.22434888  0.14582454
  0.96358366 -0.3042967  -0.82781514  0.         -0.06468145 -0.53108301
 -1.06144103  0.26953149  1.32170303 -0.04757315 -0.8012934  -0.09359159
 -0.03735601 -0.40118922 -1.33164946 -0.1055701   1.23816962  0.69686567
 -0.75764252 -0.22004476 -0.02487593 -0.29910251 -1.08083868 -1.61483653
  0.44573836  0.38372346 -0.50824622 -0.19897474]
第一个特征的均值为 -0.0993004141872
第一个特征的方差为 0.579330788574

The first picture above is the unnormalized graph, and the second is the normalized graph.

Since we are here to implement a binary classification problem. So change the label from a tenth class of 0 to 9 to a binary class of 0 or 1.

print(y_train[0:20])
y_train = (y_train > 4).astype(np.int)
print(y_train[0:20])
###输出为:
[1 0 1 3 8 8 0 3 4 1 9 1 9 7 9 4 3 8 0 5]
[0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1]

The above is the tenth category without transformation, and the following is the second category after transformation. We take the category less than or equal to 4 as the 0 category, and the category greater than or equal to 5 as the 1 category.

Load the model and train:

##加载逻辑回归模型并使用L1正则化,正则化参数C选0.5
model_LR_l1=linear_model.LogisticRegression(C=0.5, penalty='l1', tol=0.01)
##将训练数据传入开始训练
model_LR_l1.fit(X_train,y_train)
##准备测试数据
X_test_scale=preprocessing.StandardScaler().fit_transform(X_test)
y_test_trans = (y_test>4).astype(np.int)
##将测试数据传入训练好的模型,进行评分
print(model_LR_l1.score(X_test_scale,y_test_trans))
##将测试集的输入数据传入训练好的模型,得到预测值。
y_pred = model_LR_l1.predict(X_test_scale)

Rating: 0.933333333333

Graphical display:
we can look at the predictions of these 180 samples in the test set (the previous 30 as an example)

fig1=plt.figure(figsize=(8,8))
fig1.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
for i in range(30):
    ax=fig1.add_subplot(6,5,i+1,xticks=[],yticks=[])
    ax.imshow(np.abs(X_test[i].reshape(8, 8)),cmap=plt.cm.binary,interpolation='nearest')
    ax.text(0,1,str(y_test[i]))
    ax.text(0,7,str(y_pred[i]))
plt.show()

The above is the true label (the label that has not become a two-class, that is, a ten-class label), and the bottom is the predicted value (it has become a two-class label, 0 means the real label is less than or equal to 4, 1 means the real label is greater than or equal to 5)
write picture description here
We It can be found that the 5th, 6th, and 26th predictions are wrong. According to my thoughts, the 4th of the 5th and 26th should be more like 9, so the model judges as 9 5 , so the predicted value output is 1. And the 6th 7 is more like 1, so the model judges as 1 4 , so the predicted value output is 0.

Let's see how many total samples are misjudged.

fig2=plt.figure(figsize=(8,8))
fig2.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
num=0
for i in range(180):
    if(y_test_trans[i]!=y_pred[i]):
        num=num+1
        ax=fig2.add_subplot(12,5,num,xticks=[],yticks=[])
        ax.imshow(np.abs(X_test[i].reshape(8, 8)),cmap=plt.cm.binary,interpolation='nearest')
        ax.text(0,1,str(y_test[i]))
        ax.text(0,7,str(y_pred[i]))
plt.show()
print(num)

write picture description here
There are a total of 12 sample prediction errors, 1 12 180 0.9333

【Multi-category】

For multi-classification, you can choose "ovr" or "multinomial" (ie mvm)
to use the OvR method:

##加载逻辑回归模型,选择随机平均梯度下降,最大迭代次数选择5000
model_LR_mult=linear_model.LogisticRegression(solver='sag',max_iter=5000,random_state=42,multi_class='ovr')
##将训练数据传入开始训练
model_LR_mult.fit(X_train,y_train)

Here the max_iter selection is small, the following situations may occur
write picture description here

X_test_scale=preprocessing.StandardScaler().fit_transform(X_test)
print(model_LR_mult.score(X_test_scale,y_test))
y_pred = model_LR_mult.predict(X_test_scale)
输出为:0.955555555556

Use the image to see the prediction, the top number is the real label, and the bottom number is the predicted value.
write picture description here

Find the wrongly predicted samples in the test set:
write picture description here
there are 8 in total, 1 8 180 0.95556

Use the MvM method:
It is no different from the above code, except that the multi_class parameter selects multinomial.

##加载逻辑回归模型,选择随机平均梯度下降,多分类方法用many vs many
model_LR_mult=linear_model.LogisticRegression(solver='sag',max_iter=5000,random_state=42,multi_class='multinomial')
##将训练数据传入开始训练
model_LR_mult.fit(X_train,y_train)
X_test_scale=preprocessing.StandardScaler().fit_transform(X_test)
print(model_LR_mult.score(X_test_scale,y_test))
y_pred = model_LR_mult.predict(X_test_scale)
##显示前30个样本的真实标签和预测值,用图显示
fig5=plt.figure(figsize=(8,8))
fig5.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
for i in range(30):
    ax=fig5.add_subplot(6,5,i+1,xticks=[],yticks=[])
    ax.imshow(np.abs(X_test[i].reshape(8, 8)),cmap=plt.cm.binary,interpolation='nearest')
    ax.text(0,1,str(y_test[i]))
    ax.text(0,7,str(y_pred[i]))
plt.show()
fig6=plt.figure(figsize=(8,8))
fig6.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
##找出分类错误的样本,用图显示
num=0
for i in range(180):
    if(y_test[i]!=y_pred[i]):
        num=num+1
        ax=fig6.add_subplot(6,5,num,xticks=[],yticks=[])
        ax.imshow(np.abs(X_test[i].reshape(8, 8)),cmap=plt.cm.binary,interpolation='nearest')
        #用目标值标记图像
        ax.text(0,1,str(y_test[i]))
        ax.text(0,7,str(y_pred[i]))
plt.show()
print(num)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324662318&siteId=291194637