Article Directory
Exercise 3: Multiple Classification Problems
introduce
In this exercise, we will use logistic regression to recognize handwritten digits (0 to 9). We will extend the implementation of logistic regression in Exercise 2 and apply it to a one-vs-many classification problem.
Before starting the exercise, you need to download the following files for data upload :
- ex3data1.mat - training set of handwritten digits
Throughout the exercise, the following mandatory assignments are involved :
- Realize logistic regression vectorization ----(40 points)
- Training one-to-many multi-classifiers ----(40 points)
- Prediction using multiple classifiers ----(20 points)
1 multi-category
In this part of the exercise, you will extend the logistic regression algorithm you implemented earlier to apply it to a multi-classification problem .
1.1 Dataset
The data in the file ex3data1.mat
contains a training set of 5000 handwritten digits. Each sample is a gray-scale image of 20 pixels by 20 pixels, and each pixel is represented by a floating-point number, representing the gray-scale intensity of the position.
Expanding the 20x20 pixel grid into a 400-dimensional vector, each training sample becomes a row of vectors in the data matrix. As shown in the image below, the file gives us a 5000x400 matrix where each row is a sample of an image of a handwritten digit.
The second part of the training set is a 5000-dimensional vector yy containing the training set labelsy。
1.2 Data Visualization
Images are represented in matrix X as 400-dimensional vectors (of which there are 5,000). The 400-dimensional "features" are the grayscale intensities of each pixel in the original 20 x 20 image. The class labels are in the vector y as the numeric classes representing the digits in the image.
Next, we need to load the data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
data = loadmat('/home/jovyan/work/ex3data1.mat')
data
{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',
'__version__': '1.0',
'__globals__': [],
'X': array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
'y': array([[10],
[10],
[10],
...,
[ 9],
[ 9],
[ 9]], dtype=uint8)}
and check the data matrix XX using the shape built-in functionx ,yyThe shape of y :
data['X'].shape, data['y'].shape
((5000, 400), (5000, 1))
1.3 Vectorization of Logistic Regression
In this part of the exercise, you need to modify the logistic regression implementation to be fully vectorized (i.e. without the alternative for forf or loop). This is because vectorized code, in addition to being concise, is able to take advantage of linear algebra optimizations and is often much faster than iterative code. However, if we saw from exercise 2 that our cost function has a fully vectorized implementation, so we can reuse the same implementation here.
1.3.1 Vectorization of cost function
You need to write code to vectorize the cost function . We already know that the cost function is:
J ( θ ) = 1 m ∑ i = 1 m [ − y ( i ) log ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{
{y}^{(i)}}\log \left( {
{h}_{\theta }}\left( {
{x}^{(i)}} \right) \right)-\left( 1-{
{y}^{(i)}} \right)\log \left( 1-{
{h}_{\theta }}\left( {
{x}^{(i)}} \right) \right)]} J( i )=m1i=1∑m[−y(i)log(hi(x(i)))−(1−y(i))log(1−hi(x( i ) ))]
To compute each element in the sum, we need to compute for each sampleiii
的 h θ ( x ( i ) ) {
{h}_{\theta }}\left( {
{x}^{(i)}} \right) hi(x(i))。其中,
h θ ( x ) = g ( θ T X ) {
{h}_{\theta }}\left( x \right)=g\left({
{
{\theta }^{T}}X} \right)\\ hi(x)=g( iTX)
And the sigmoid function is:
g ( z ) = 1 1 + e − zg\left( z \right)=\frac{1}{1+{
{e}^{-z}}}\\g(z)=1+e−z1
It turns out that, for all examples, we can compute quickly with matrix multiplication. We define XXX和θ \thetaθ is:
Then, perform matrix multiplication X θ X\thetaXθ , calculated to get:
In the last equation, if aaa andbbb are both vectors, we can usea T b = b T aa^Tb=b^TaaTb=bThe fact that T aθ T x ( i ) \theta^Tx^{(i)}
can be computed in one line of codeiTx(i)
###在这里填入代码###
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def cost(theta, X, y):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
return np.sum(first - second) / len(X)
1.3.2 Gradient Vectorization
We know that this cost function should be minimized using gradient descent. To recap, the gradient of the logistic regression cost function is a vector whose jjthj个元素定义为
∂ J ∂ θ j = 1 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] x j ( i ) \frac{\partial J}{\partial \theta_j} = \frac{1}{m}\sum\limits_{i=1}^{m}{[{
{h}_{\theta }}\left( {
{x}^{(i)}} \right)-{
{y}^{(i)}}]x_{_{j}}^{(i)}} ∂θj∂J=m1i=1∑m[hi(x(i))−y(i)]xj(i)
To vectorize this operation, we need to take all θ j \theta_jijThe partial derivatives of are written out:
in:
Note x ( i ) x^{(i)}x( i ) is a one-way quantity, and( h θ ( x ( i ) ) − y ( i ) ) (h_\theta(x^{(i)})-y^{(i)})(hi(x(i))−y( i ) )is an index (number).
β i = ( h θ ( x ( i ) ) − y ( i ) ) \beta_i = ({
{h}_{\theta }}\left( {
{x}^{(i)}} \right)-{ {
y}^{(i)}})bi=(hi(x(i))−y( i ) )can be understood as follows:
After vectorizing the operation, we know that the partial derivative calculation can be performed without using the LOOP cycle. Next you need to write code to implement a vectorized version of the above code .
###在这里填入代码###
def gradient(theta, X, y):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
parameters = int(theta.ravel().shape[1])
error = sigmoid(X * theta.T) - y
grad = ((X.T * error) / len(X)).T
return grad
1.3.3 Vectorization of regularized logistic regression
In Exercise 2, we implement the cost function and gradient computation function of the regularized logistic regression algorithm. Its cost function is:
J ( θ ) = 1 m ∑ i = 1 m [ − y ( i ) log ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{
{y}^{(i)}}\log \left( {
{h}_{\theta }}\left( {
{x}^{(i)}} \right) \right)-\left( 1-{
{y}^{(i)}} \right)\log \left( 1-{
{h}_{\theta }}\left( {
{x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}} J( i )=m1i=1∑m[−y(i)log(hi(x(i)))−(1−y(i))log(1−hi(x(i)))]+2 mlj=1∑nij2
Note that there is no need for θ o \theta_oio进行正则化,其用于偏差的计算。
对应地,其梯度的计算公式如下:
R e p e a t u n t i l c o n v e r g e n c e { θ 0 : = θ 0 − a 1 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] x 0 ( i ) θ j : = θ j − a 1 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] x j ( i ) + λ m θ j } R e p e a t \begin{align} & Repeat\text{ }until\text{ }convergence\text{ }\!\!\{\!\!\text{ } \\ & \text{ }{
{\theta }_{0}}:={
{\theta }_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{[{
{h}_{\theta }}\left( {
{x}^{(i)}} \right)-{
{y}^{(i)}}]x_{_{0}}^{(i)}} \\ & \text{ }{
{\theta }_{j}}:={
{\theta }_{j}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{[{
{h}_{\theta }}\left( {
{x}^{(i)}} \right)-{
{y}^{(i)}}]x_{j}^{(i)}}+\frac{\lambda }{m}{
{\theta }_{j}} \\ & \text{ }\!\!\}\!\!\text{ } \\ & Repeat \\ \end{align} Repeat until convergence {
i0:=i0−am1i=1∑m[hi(x(i))−y(i)]x0(i) ij:=ij−am1i=1∑m[hi(x(i))−y(i)]xj(i)+mlij } Repeat
Next, you need to write code that implements a vectorized implementation of the regularized logistic regression algorithm's cost function and gradient .
###在这里填入代码###
def costReg(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
reg = (learningRate / (2 * len(X))) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
return np.sum(first - second) / len(X) + reg
def gradientReg(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
parameters = int(theta.ravel().shape[1])
error = sigmoid(X * theta.T) - y
grad = ((X.T * error) / len(X)).T + ((learningRate / len(X)) * theta)
# intercept gradient is not regularized
grad[0, 0] = np.sum(np.multiply(error, X[:,0])) / len(X)
return np.array(grad).ravel()
1.4 Multi-Classification - Classifier
Now that we have defined the cost and gradient functions, we need to build a classifier. For handwriting recognition, we have 10 possible classes (0-9), but logistic regression is a binary classification problem.
In this exercise, your task is to implement a one-to-one full classification method with kkK labels of different classes havekkk classifiers, each classifier in "categoryiii " and "notiii " to decide between. We'll wrap the classifier training in a function that computes the final weights for each of the 10 classifiers and returns the weights as [ k , n + 1 ][k,n+1][k,n+1 ] , wherennn is the number of parameters.
Note that :
- Need to add θ 0 \theta_0i0to calculate the intercept term.
- Convert $y$ from class labels to binary for each classifier (either class i or not class i).
- Use the minimize function of the optimization class of the scipy library to minimize the cost function of each classifier.
- Assign the found optimal parameters to the parameter array, and return the shape as [ k , n + 1 ] [k,n+1][k,n+1 ] the parameter array.
Among them, the most important part of implementing vectorized code is to ensure that all matrices are written correctly and their dimensions are correct.
###在这里填入代码###
from scipy.optimize import minimize
def one_vs_all(X, y, num_labels, learning_rate):
rows = X.shape[0]
params = X.shape[1]
# k个分类器的参数,形状为(k,n+1)
all_theta = np.zeros((num_labels, params + 1))
# 插入值为1的列,用于计算截距项
X = np.insert(X, 0, values=np.ones(rows), axis=1)
# 将分类标签转换为0-1标识
for i in range(1, num_labels + 1):
theta = np.zeros(params + 1)
y_i = np.array([1 if label == i else 0 for label in y])
y_i = np.reshape(y_i, (rows, 1))
# 使用minimize函数最小化代价函数
fmin = minimize(fun=costReg, x0=theta, args=(X, y_i, learning_rate), method='TNC', jac=gradientReg)
all_theta[i-1,:] = fmin.x
return all_theta
Let's check the variables that need to be initialized, and the shape of the variables:
rows = data['X'].shape[0]
params = data['X'].shape[1]
all_theta = np.zeros((10, params + 1))
X = np.insert(data['X'], 0, values=np.ones(rows), axis=1)
theta = np.zeros(params + 1)
y_0 = np.array([1 if label == 0 else 0 for label in data['y']])
y_0 = np.reshape(y_0, (rows, 1))
X.shape, y_0.shape, theta.shape, all_theta.shape
((5000, 401), (5000, 1), (401,), (10, 401))
Among them, theta thetat h e t a is a 1D array, so when it is converted to a matrix in the compute gradient code, it becomes of shape( 1 , 401 ) (1,401)(1,401 ) matrix. At the same time, we need to checkyyThe class labels in y .
np.unique(data['y'])#看下有几类标签
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint8)
Next, to make sure the training function works correctly, run the following code to see if you get reasonable output.
###请运行并测试你的代码###
all_theta = one_vs_all(data['X'], data['y'], 10, 1)
all_theta
array([[-2.38373823e+00, 0.00000000e+00, 0.00000000e+00, ...,
1.30440684e-03, -7.49607957e-10, 0.00000000e+00],
[-3.18277385e+00, 0.00000000e+00, 0.00000000e+00, ...,
4.46416745e-03, -5.08967467e-04, 0.00000000e+00],
[-4.79656036e+00, 0.00000000e+00, 0.00000000e+00, ...,
-2.87471064e-05, -2.47976297e-07, 0.00000000e+00],
...,
[-7.98398219e+00, 0.00000000e+00, 0.00000000e+00, ...,
-8.95642491e-05, 7.22603652e-06, 0.00000000e+00],
[-4.57124969e+00, 0.00000000e+00, 0.00000000e+00, ...,
-1.33504169e-03, 9.98035730e-05, 0.00000000e+00],
[-5.40535662e+00, 0.00000000e+00, 0.00000000e+00, ...,
-1.16457336e-04, 7.86968213e-06, 0.00000000e+00]])
1.5 Prediction using classifiers
We are now ready for the final step where you need to use the trained classifier to predict a label for each image .
For this step, we will calculate the class probability for each class, for each training sample (using the vectorized code), and label the output class as the class with the highest probability.
###在这里填入代码###
def predict_all(X, all_theta):
rows = X.shape[0]
params = X.shape[1]
num_labels = all_theta.shape[0]
# 与之前一样,需要插入一列确保矩阵形状
X = np.insert(X, 0, values=np.ones(rows), axis=1)
# 将其转换为矩阵
X = np.matrix(X)
all_theta = np.matrix(all_theta)
# 计算每个训练样本所属每个类别的概率
h = sigmoid(X * all_theta.T)
# 创建具有最大概率的索引数组
h_argmax = np.argmax(h, axis=1)
# 因为我们的数组是零索引的,所以我们需要为真正的标签预测+1
h_argmax = h_argmax + 1
return h_argmax
Now we can use the predict_all function to generate class predictions for each instance and see how our classifier works.
###请运行并测试你的代码###
y_pred = predict_all(data['X'], all_theta)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, data['y'])]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print ('accuracy = {0}%'.format(accuracy * 100))
accuracy = 94.46%
Summarize
The multi-classification problem is an important task in machine learning, which usually requires the use of classification algorithms to predict and classify different categories of data. In practice, we can use a variety of different multi-classification algorithms such as decision trees, neural networks, support vector machines, etc. to solve practical problems. At the same time, it is also necessary to pay attention to data preprocessing, feature selection, and model evaluation to improve the accuracy and reliability of the algorithm. Finally, in order to better apply multi-classification algorithms, it is necessary to continuously learn and explore new algorithms and techniques to cope with changing data and needs.