Article Directory

Exercise 3: Multiple Classification Problems
Summarize

Exercise 3: Multiple Classification Problems

introduce

In this exercise, we will use logistic regression to recognize handwritten digits (0 to 9). We will extend the implementation of logistic regression in Exercise 2 and apply it to a one-vs-many classification problem.

Before starting the exercise, you need to download the following files for data upload :

ex3data1.mat - training set of handwritten digits

Throughout the exercise, the following mandatory assignments are involved :

Realize logistic regression vectorization ----(40 points)
Training one-to-many multi-classifiers ----(40 points)
Prediction using multiple classifiers ----(20 points)

1 multi-category

In this part of the exercise, you will extend the logistic regression algorithm you implemented earlier to apply it to a multi-classification problem .

1.1 Dataset

The data in the file ex3data1.matcontains a training set of 5000 handwritten digits. Each sample is a gray-scale image of 20 pixels by 20 pixels, and each pixel is represented by a floating-point number, representing the gray-scale intensity of the position.
Expanding the 20x20 pixel grid into a 400-dimensional vector, each training sample becomes a row of vectors in the data matrix. As shown in the image below, the file gives us a 5000x400 matrix where each row is a sample of an image of a handwritten digit.

insert image description here

containing the training set labels $y$ 。

1.2 Data Visualization

insert image description here

Images are represented in matrix X as 400-dimensional vectors (of which there are 5,000). The 400-dimensional "features" are the grayscale intensities of each pixel in the original 20 x 20 image. The class labels are in the vector y as the numeric classes representing the digits in the image.

Next, we need to load the data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat

data = loadmat('/home/jovyan/work/ex3data1.mat')
data

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 'y': array([[10],
        [10],
        [10],
        ...,
        [ 9],
        [ 9],
        [ 9]], dtype=uint8)}

using the shape built-in function $x$ , $The shape of y$ :

data['X'].shape, data['y'].shape

((5000, 400), (5000, 1))

1.3 Vectorization of Logistic Regression

In this part of the exercise, you need to modify the logistic regression implementation to be fully vectorized (i.e. without the alternative $f or$ loop). This is because vectorized code, in addition to being concise, is able to take advantage of linear algebra optimizations and is often much faster than iterative code. However, if we saw from exercise 2 that our cost function has a fully vectorized implementation, so we can reuse the same implementation here.

1.3.1 Vectorization of cost function

You need to write code to vectorize the cost function . We already know that the cost function is:

$J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{ {y}^{(i)}}\log \left( { {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)-\left( 1-{ {y}^{(i)}} \right)\log \left( 1-{ {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)]}$
To compute each element in the sum, we need to compute for each sample $i$
的 ${h}_{\theta }}\left( { {x}^{(i)}} \right)$ 。其中，
${h}_{\theta }}\left( x \right)=g\left({ { {\theta }^{T}}X} \right)\\$
And the sigmoid function is:
$zg\left( z \right)=\frac{1}{1+{ {e}^{-z}}}\\$
It turns out that, for all examples, we can compute quickly with matrix multiplication. We define $X$ 和 $\theta$ is:
insert image description here

Then, perform matrix multiplication $X\theta$ , calculated to get:
insert image description here

In the last equation, if $a$ and $b$ are both vectors, we can use $a^Tb=b^Ta$ $\theta^Tx^{(i)}$
can be computed in one line of code $i^{T} x^{(i)}$

###在这里填入代码###
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def cost(theta, X, y):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T))) 
    return np.sum(first - second) / len(X)

1.3.2 Gradient Vectorization

We know that this cost function should be minimized using gradient descent. To recap, the gradient of the logistic regression cost function is a vector whose $j$ 个元素定义为
$\frac{\partial J}{\partial \theta_j} = \frac{1}{m}\sum\limits_{i=1}^{m}{[{ {h}_{\theta }}\left( { {x}^{(i)}} \right)-{ {y}^{(i)}}]x_{_{j}}^{(i)}}$
To vectorize this operation, we need to take all $\theta_j$ The partial derivatives of are written out:

insert image description here
in:

Note $x^{(i)}$ is a one-way quantity, and $(h_\theta(x^{(i)})-y^{(i)})$ is an index (number).
$\beta_i = ({ {h}_{\theta }}\left( { {x}^{(i)}} \right)-{ { y}^{(i)}})$ can be understood as follows:
insert image description here

After vectorizing the operation, we know that the partial derivative calculation can be performed without using the LOOP cycle. Next you need to write code to implement a vectorized version of the above code .

###在这里填入代码###
def gradient(theta, X, y):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    
    parameters = int(theta.ravel().shape[1])
    error = sigmoid(X * theta.T) - y
    
    grad = ((X.T * error) / len(X)).T
    
    return grad

1.3.3 Vectorization of regularized logistic regression

In Exercise 2, we implement the cost function and gradient computation function of the regularized logistic regression algorithm. Its cost function is:

$J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{ {y}^{(i)}}\log \left( { {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)-\left( 1-{ {y}^{(i)}} \right)\log \left( 1-{ {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}}$
Note that there is no need for $\theta_o$ 进行正则化，其用于偏差的计算。
对应地，其梯度的计算公式如下：
$\begin{align} & Repeat\text{ }until\text{ }convergence\text{ }\!\!\{\!\!\text{ } \\ & \text{ }{ {\theta }_{0}}:={ {\theta }_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{[{ {h}_{\theta }}\left( { {x}^{(i)}} \right)-{ {y}^{(i)}}]x_{_{0}}^{(i)}} \\ & \text{ }{ {\theta }_{j}}:={ {\theta }_{j}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{[{ {h}_{\theta }}\left( { {x}^{(i)}} \right)-{ {y}^{(i)}}]x_{j}^{(i)}}+\frac{\lambda }{m}{ {\theta }_{j}} \\ & \text{ }\!\!\}\!\!\text{ } \\ & Repeat \\ \end{align}$
Next, you need to write code that implements a vectorized implementation of the regularized logistic regression algorithm's cost function and gradient .

###在这里填入代码###
def costReg(theta, X, y, learningRate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    reg = (learningRate / (2 * len(X))) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
    return np.sum(first - second) / len(X) + reg

def gradientReg(theta, X, y, learningRate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    
    parameters = int(theta.ravel().shape[1])
    error = sigmoid(X * theta.T) - y
    
    grad = ((X.T * error) / len(X)).T + ((learningRate / len(X)) * theta)
    
    # intercept gradient is not regularized
    grad[0, 0] = np.sum(np.multiply(error, X[:,0])) / len(X)
    
    return np.array(grad).ravel()

1.4 Multi-Classification - Classifier

Now that we have defined the cost and gradient functions, we need to build a classifier. For handwriting recognition, we have 10 possible classes (0-9), but logistic regression is a binary classification problem.

In this exercise, your task is to implement a one-to-one full classification method with $K$ labels of different classes have $k$ classifiers, each classifier in "category $i$ " and "not $i$ " to decide between. We'll wrap the classifier training in a function that computes the final weights for each of the 10 classifiers and returns the weights as [ k , n + 1 ][k,n+1 $[k, n + 1]$ , where $n$ is the number of parameters.

Note that :

Need to add $\theta_0$ to calculate the intercept term.
Convert $y$ from class labels to binary for each classifier (either class i or not class i).
Use the minimize function of the optimization class of the scipy library to minimize the cost function of each classifier.
Assign the found optimal parameters to the parameter array, and return the shape as $[k, n + 1]$ the parameter array.

Among them, the most important part of implementing vectorized code is to ensure that all matrices are written correctly and their dimensions are correct.

###在这里填入代码###
from scipy.optimize import minimize

def one_vs_all(X, y, num_labels, learning_rate):
    rows = X.shape[0]
    params = X.shape[1]
    
    # k个分类器的参数，形状为(k,n+1)
    all_theta = np.zeros((num_labels, params + 1))
    
    # 插入值为1的列，用于计算截距项
    X = np.insert(X, 0, values=np.ones(rows), axis=1)
    
    # 将分类标签转换为0-1标识
    for i in range(1, num_labels + 1):
        theta = np.zeros(params + 1)
        y_i = np.array([1 if label == i else 0 for label in y])
        y_i = np.reshape(y_i, (rows, 1))
        
        # 使用minimize函数最小化代价函数
        fmin = minimize(fun=costReg, x0=theta, args=(X, y_i, learning_rate), method='TNC', jac=gradientReg)
        all_theta[i-1,:] = fmin.x
    
    return all_theta

Let's check the variables that need to be initialized, and the shape of the variables:

rows = data['X'].shape[0]
params = data['X'].shape[1]

all_theta = np.zeros((10, params + 1))

X = np.insert(data['X'], 0, values=np.ones(rows), axis=1)

theta = np.zeros(params + 1)

y_0 = np.array([1 if label == 0 else 0 for label in data['y']])
y_0 = np.reshape(y_0, (rows, 1))

X.shape, y_0.shape, theta.shape, all_theta.shape

((5000, 401), (5000, 1), (401,), (10, 401))

Among them, $t h e t a$ is a 1D array, so when it is converted to a matrix in the compute gradient code, it becomes of shape $(1, 401)$ matrix. At the same time, we need to checkThe class labels in $y .$

np.unique(data['y'])#看下有几类标签

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=uint8)

Next, to make sure the training function works correctly, run the following code to see if you get reasonable output.

###请运行并测试你的代码###
all_theta = one_vs_all(data['X'], data['y'], 10, 1)
all_theta

array([[-2.38373823e+00,  0.00000000e+00,  0.00000000e+00, ...,
         1.30440684e-03, -7.49607957e-10,  0.00000000e+00],
       [-3.18277385e+00,  0.00000000e+00,  0.00000000e+00, ...,
         4.46416745e-03, -5.08967467e-04,  0.00000000e+00],
       [-4.79656036e+00,  0.00000000e+00,  0.00000000e+00, ...,
        -2.87471064e-05, -2.47976297e-07,  0.00000000e+00],
       ...,
       [-7.98398219e+00,  0.00000000e+00,  0.00000000e+00, ...,
        -8.95642491e-05,  7.22603652e-06,  0.00000000e+00],
       [-4.57124969e+00,  0.00000000e+00,  0.00000000e+00, ...,
        -1.33504169e-03,  9.98035730e-05,  0.00000000e+00],
       [-5.40535662e+00,  0.00000000e+00,  0.00000000e+00, ...,
        -1.16457336e-04,  7.86968213e-06,  0.00000000e+00]])

1.5 Prediction using classifiers

We are now ready for the final step where you need to use the trained classifier to predict a label for each image .

For this step, we will calculate the class probability for each class, for each training sample (using the vectorized code), and label the output class as the class with the highest probability.

###在这里填入代码###
def predict_all(X, all_theta):
    rows = X.shape[0]
    params = X.shape[1]
    num_labels = all_theta.shape[0]
    
    # 与之前一样，需要插入一列确保矩阵形状
    X = np.insert(X, 0, values=np.ones(rows), axis=1)
    
    # 将其转换为矩阵
    X = np.matrix(X)
    all_theta = np.matrix(all_theta)
    
    # 计算每个训练样本所属每个类别的概率
    h = sigmoid(X * all_theta.T)
    
    # 创建具有最大概率的索引数组
    h_argmax = np.argmax(h, axis=1)
    
    # 因为我们的数组是零索引的，所以我们需要为真正的标签预测+1
    h_argmax = h_argmax + 1
    
    return h_argmax

Now we can use the predict_all function to generate class predictions for each instance and see how our classifier works.

###请运行并测试你的代码###
y_pred = predict_all(data['X'], all_theta)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, data['y'])]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print ('accuracy = {0}%'.format(accuracy * 100))

accuracy = 94.46%

Summarize

The multi-classification problem is an important task in machine learning, which usually requires the use of classification algorithms to predict and classify different categories of data. In practice, we can use a variety of different multi-classification algorithms such as decision trees, neural networks, support vector machines, etc. to solve practical problems. At the same time, it is also necessary to pay attention to data preprocessing, feature selection, and model evaluation to improve the accuracy and reliability of the algorithm. Finally, in order to better apply multi-classification algorithms, it is necessary to continuously learn and explore new algorithms and techniques to cope with changing data and needs.

Multi-category problem practice based on jupyter