Machine learning---LDA code

1. Get projection coordinates

import numpy as np


def GetProjectivePoint_2D(point, line):
    a = point[0]
    b = point[1]
    k = line[0]
    t = line[1]

    if   k == 0:      return [a, t]
    elif k == np.inf: return [0, b]
    x = (a+k*b-k*t) / (k*k+1)
    y = k*x + t
    return [x, y]

This function is used to obtain the coordinates of the projection point from a point to a straight line. The parameter point in the function represents the coordinates of the point, represented by [a, b]

expressed in the form; the parameter line represents the parameters of the straight line, expressed in the form [k, t], where k represents the slope of the straight line and t represents the intercept of the straight line.

distance. The return value of the function is the coordinates of the projection point, expressed in the form [a',b'].

The specific implementation of the function is as follows:

Get the abscissa a and ordinate b of the point from the point parameter, and get the slope k and intercept t of the straight line from the line parameter.

If the slope k is 0, the straight line is a horizontal line, the ordinate of the projection point is the intercept t of the straight line, the abscissa remains unchanged, and the coordinates of the projection point are returned.

[a,t]。

If the slope k is positive infinity, the straight line is a vertical line, the abscissa of the projection point is 0, the ordinate remains unchanged, and the coordinates of the projection point are returned.

[0,b]。

Calculate the abscissa x of the projection point, the formula is x = (a + k b - k t) / (k*k + 1).

Calculate the ordinate y of the projected point through the equation of the straight line y = k*x + t.

Returns the coordinates [x, y] of the projection point.

Note: This function uses inf in the numpy library to represent positive infinity.

2. Draw a scatter plot

X = dataset[:,1:3]
y = dataset[:,3]
print(X,y)
# draw scatter diagram to show the raw data
f1 = plt.figure(1)       
plt.title('watermelon_3a')  
plt.xlabel('density')  
plt.ylabel('ratio_sugar')  
print(X[y == 0,0], X[y == 0,1])
plt.scatter(X[y == 0,0], X[y == 0,1], marker = 'o', color = 'k', s=100, label = 'bad')
plt.scatter(X[y == 1,0], X[y == 1,1], marker = 'o', color = 'g', s=100, label = 'good')
plt.legend(loc = 'upper right')  
plt.show()

In Python, Boolean indexing operations can be applied directly to array objects.

When we execute it y == 0, it will generate a Boolean array, where each element is ythe value at the corresponding position in the array.

No result equals 0. For example, if yan element of the array is 1, the corresponding Boolean array element is False; if ythe

If an element is 0, the corresponding Boolean array element is True. We can then use this boolean array as an index in another

on an array to obtain elements that meet the criteria. For a two-dimensional array, the first index represents the row and the second index represents

List. Therefore, X[y == 0, 0]it means to obtain Xthe element in column y0 of the row index that satisfies the value at the corresponding position of the array is equal to 0.

white.

3. LDA classification

from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics
import matplotlib.pyplot as plt

# generalization of train and test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.5, random_state=0)

# model fitting
lda_model = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X_train, y_train)

# model validation
y_pred = lda_model.predict(X_test)

# summarize the fit of the model
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))

# draw the classfier decision boundary
f2 = plt.figure(2) 
h = 0.001
# x0_min, x0_max = X[:, 0].min()-0.1, X[:, 0].max()+0.1
# x1_min, x1_max = X[:, 1].min()-0.1, X[:, 1].max()+0.1

x0, x1 = np.meshgrid(np.arange(-1, 1, h),
                     np.arange(-1, 1, h))

# x0, x1 = np.meshgrid(np.arange(x0_min, x0_max, h),
#                      np.arange(x1_min, x1_max, h))

z = lda_model.predict(np.c_[x0.ravel(), x1.ravel()]) 

# Put the result into a color plot
z = z.reshape(x0.shape)
plt.contourf(x0, x1, z)

# Plot also the training pointsplt.title('watermelon_3a')  
plt.title('watermelon_3a')  
plt.xlabel('density')  
plt.ylabel('ratio_sugar')  
plt.scatter(X[y == 0,0], X[y == 0,1], marker = 'o', color = 'k', s=100, label = 'bad')
plt.scatter(X[y == 1,0], X[y == 1,1], marker = 'o', color = 'g', s=100, label = 'good')
plt.show()

The train_test_split function in the model_selection module. The function of this function is to divide the data set into training set and test set

set and return the divided data.

Specifically, the parameters of the train_test_split function include:

X: feature data set to be divided

y: the target variable data set to be divided

test_size: The size of the test set, which can be a floating point number (representing the proportion) or an integer (representing the number of samples)

random_state: random seed, used to control the randomness of partitioning

The return value of the function is four arrays:

X_train: Feature data of the training set

X_test: Feature data of the test set

y_train: target variable data of the training set

y_test: the number of target variables in the test set

Some common parameters of the LinearDiscriminantAnalysis class and their meanings:

solver: Selection of solver, used to calculate the parameters of the LDA model. Optional solvers include: 'svd': singular value decomposition

(Singular Value Decomposition), suitable for data of any dimension. 'lsqr': least squares method (Least

Squares), suitable for high-dimensional data. 'eigen': Eigenvalue Decomposition, suitable for samples

this data.

shrinkage: Selection of shrinkage parameters, used to control the degree of shrinkage of the LDA model. Optional shrinkage parameters include: None: should not

With shrinkage, use the classic LDA method. 'auto': Automatically select shrinkage parameters and automatically determine the degree of shrinkage based on the characteristics of the data.

Floating point number: Manually specify the value of the shrinkage parameter. The larger the value, the higher the degree of shrinkage.

n_components: The number of feature dimensions after dimensionality reduction. The default value is None, which means no dimensionality reduction is performed.

priors: Prior probabilities of categories. If prior probabilities are specified, the LDA model will be calculated using these prior probabilities.

store_covariance: Whether to store the covariance matrix of the category. The default is False, which means the covariance matrix is not stored. If set

If True, the covariance matrix of the category can be accessed through the covariance_ attribute.

The function of lda_model.predict(X_test) is to use the already trained LDA model lda_model to test the test set data

X_test makes a prediction and returns the predicted target variable value.

The function of metrics.confusion_matrix(y_test, y_pred) is to calculate the confusion matrix between the prediction results and the real label.

The parameters of this function include: y_test: the true label of the test set, which is a one-dimensional array or list. y_pred: model pair test set

The prediction result is also a one-dimensional array or list.

The function of metrics.classification_report(y_test, y_pred) is to generate a classification report of the classification model, which contains

Some evaluation metrics used to evaluate the performance of the model on the test set. Classification reports provide the following metrics: Accuracy

(Precision): The proportion of samples predicted to be positive that are actually positive. Recall: samples that are actually positive examples

, the proportion of correctly predicted positive examples. F1 score (F1-Score): An indicator that comprehensively considers precision and recall. It is the second

the harmonic mean of Support: The number of samples in the test set for each category.

np.arange(-1, 1, h): Specifies the starting value, ending value and step size to generate a one-dimensional array. Here, the starting value is -1 and the end value is

The beam value is 1 and the step size is h. The function of np.meshgrid function is to convert these two one-dimensional arrays into two two-dimensional arrays x0 and x1,

Where each row of x0 is np.arange(-1, 1, h), and each column of x1 is np.arange(-1, 1, h).

np.ravel(arr) which contains all the elements in the original array, arranged from left to right, top to bottom. np.c_

[x0.ravel(), x1.ravel()] Flatten the two two-dimensional arrays x0 and x1 and concatenate them column by column into a new two-dimensional array. Get it this way

The array represents the coordinates of all points in the two-dimensional grid.

The purpose of making predictions like this is to visualize the decision boundaries of the classifier. By generating a set of coordinate points in a two-dimensional grid and

By predicting these points, the predicted category of each point can be obtained. These predictions can then be used to draw decision boundaries. decide

The policy boundary is the boundary where the classifier separates different classes in the feature space. You can visually understand the classifier by plotting the decision boundary

Classification results for different regions.

By predicting the points in the two-dimensional grid, the prediction category z of each point is obtained. Then, use the plt.contourf() function

Plot these predictions into a color map, where different colors represent different categories.

4. Calculate parameter w

u = []  
for i in range(2): # two class
    u.append(np.mean(X[y==i], axis=0))  # column mean

# 2-nd. computing the within-class scatter matrix, refer on book (3.33)
m,n = np.shape(X)
Sw = np.zeros((n,n))
for i in range(m):
    x_tmp = X[i].reshape(n,1)  # row -> cloumn vector
    if y[i] == 0: u_tmp = u[0].reshape(n,1)
    if y[i] == 1: u_tmp = u[1].reshape(n,1)
    Sw += np.dot( x_tmp - u_tmp, (x_tmp - u_tmp).T )

Sw = np.mat(Sw)
U, sigma, V= np.linalg.svd(Sw) 

Sw_inv = V.T * np.linalg.inv(np.diag(sigma)) * U.T
# 3-th. computing the parameter w, refer on book (3.39)
w = np.dot( Sw_inv, (u[0] - u[1]).reshape(n,1) )  # here we use a**-1 to get the inverse of a ndarray

print(w)

This code is used to calculate the parameter w in Linear Discriminant Analysis. Below is the code

explain:

First, create an empty list u to store the mean vector of each category.

Then, by looping through each category, the mean vector for each category is calculated and added to the list u.

Next, calculate the within-class scatter matrix. First, initialize an all-zero matrix Sw, whose maximum

is (n,n), where n is the dimension of the feature vector. Then, transpose the sample vector into a column vector by looping over each sample,

And select the corresponding mean vector according to the category it belongs to. Then, calculate the difference between each sample vector and the mean vector of its category

, multiply it by its transpose, and finally accumulate the result into Sw.

Convert Sw to matrix type.

Perform singular value decomposition (SVD) on Sw to obtain matrices U, sigma and V.

Calculate the inverse matrix Sw_inv of Sw, where VT represents the transpose of V, and np.linalg.inv(np.diag(sigma)) represents the inverse matrix of sigma

diagonal matrix.

Finally, the parameter w is calculated by multiplying Sw_inv by the difference between the mean vector u[0] of category 0 and the mean vector u[1] of category 1.

arrive. **-1 is used here to obtain the inverse matrix of ndarray.

5. Draw two-dimensional scatter plots and LDA results

f3 = plt.figure(3)
plt.xlim( -0.2, 1 )
plt.ylim( -0.5, 0.7 )

p0_x0 = -X[:, 0].max()
p0_x1 = ( w[1,0] / w[0,0] ) * p0_x0
p1_x0 = X[:, 0].max()
p1_x1 = ( w[1,0] / w[0,0] ) * p1_x0

plt.title('watermelon_3a - LDA')  
plt.xlabel('density')  
plt.ylabel('ratio_sugar')  
plt.scatter(X[y == 0,0], X[y == 0,1], marker = 'o', color = 'k', s=10, label = 'bad')
plt.scatter(X[y == 1,0], X[y == 1,1], marker = 'o', color = 'g', s=10, label = 'good')
plt.legend(loc = 'upper right')  

plt.plot([p0_x0, p1_x0], [p0_x1, p1_x1])

# draw projective point on the line


m,n = np.shape(X)
for i in range(m):
    x_p = GetProjectivePoint_2D( [X[i,0], X[i,1]], [w[1,0] / w[0,0] , 0] ) 
    if y[i] == 0: 
        plt.plot(x_p[0], x_p[1], 'ko', markersize = 5)
    if y[i] == 1: 
        plt.plot(x_p[0], x_p[1], 'go', markersize = 5)   
    plt.plot([ x_p[0], X[i,0]], [x_p[1], X[i,1] ], 'c--', linewidth = 0.3)

plt.show()

Create a graphics object named f3. Set the display range of the x-axis from -0.2 to 1. Set the display range of the y-axis to -0.5 to 0.7.

Calculate the inverse of the maximum value of the first dimension feature in category 0. Calculate the value of the second dimension feature in category 0 based on the weight.

Calculate the maximum value of the first dimension feature in category 1. Calculate the value of the second dimension feature in category 1 based on the weight.

p0_x0 represents the minimum value on the X-axis, while p0_x1 represents the predicted value corresponding to p0_x0 calculated by the linear model. same

Plot, p1_x0 represents the maximum value on the X-axis, and p1_x1 represents the predicted value corresponding to p1_x0 calculated by the linear model.

The purpose of this is to visualize the performance of the linear model in two dimensions. By calculating these two points, we can

A straight line is drawn on , which represents the decision boundary of the linear model. This allows for a better understanding of the linear model's analysis of the input data.

class or regression capabilities.

Set the title of the graph to 'watermelon_3a - LDA'. Set the x-axis label to 'density'.

Set the y-axis label to 'ratio_sugar'. Plot a scatter plot of category 0, represented by black dots, size 10, label

is 'bad'. Draw a scatter plot of category 1, represented by green dots with size 10 and label 'good'.

Displays a legend in the upper right corner of the graph. Draw a straight line connecting the category 0 and category 1 projection points.

Get the shape of the data set X, m is the number of samples, and n is the number of features. Iterate over each sample.

x_p = GetProjectivePoint_2D([X[i, 0], X[i, 1]], [w[1, 0] / w[0, 0], 0]): Calculate the projection of the sample point on the straight line

shadow point. The function GetProjectivePoint_2D has two parameters: [X[i, 0], X[i, 1]]: This is a column containing two elements

Table represents the coordinates of a point on a two-dimensional plane. X[i, 0] represents the x-coordinate of the point, and X[i, 1] represents the y-coordinate of the point. [w[1, 0] /

w[0, 0], 0]: This is a list of two elements representing the coordinates of a vector. w[1, 0] / w[0, 0] represents the x of the vector

component, 0 represents the y component of the vector. The function of the function is to project the point [X[i, 0], X[i, 1]] on the given two-dimensional plane to the vector

[w[1, 0] / w[0, 0], 0] and returns the coordinates of the projection point.

If the sample belongs to category 0. Draw the projected points of category 0, represented by black dots with a size of 5. If the sample belongs to the category

1. Draw the projected points of category 1, represented by green dots with a size of 5. Draw a dashed line from the sample point to the projection point, using cyan

The dashed line indicates that the line width is 0.3.

6. LDA example

X = np.delete(X, 14, 0)
y = np.delete(y, 14, 0)

u = []  
for i in range(2): # two class
    u.append(np.mean(X[y==i], axis=0))  # column mean

# 2-nd. computing the within-class scatter matrix, refer on book (3.33)
m,n = np.shape(X)
Sw = np.zeros((n,n))
for i in range(m):
    x_tmp = X[i].reshape(n,1)  # row -> cloumn vector
    if y[i] == 0: u_tmp = u[0].reshape(n,1)
    if y[i] == 1: u_tmp = u[1].reshape(n,1)
    Sw += np.dot( x_tmp - u_tmp, (x_tmp - u_tmp).T )

Sw = np.mat(Sw)
U, sigma, V= np.linalg.svd(Sw) 

Sw_inv = V.T * np.linalg.inv(np.diag(sigma)) * U.T
# 3-th. computing the parameter w, refer on book (3.39)
w = np.dot( Sw_inv, (u[0] - u[1]).reshape(n,1) )  # here we use a**-1 to get the inverse of a ndarray

print(w)

# 4-th draw the LDA line in scatter figure

# f2 = plt.figure(2)
f4 = plt.figure(4)
plt.xlim( -0.2, 1 )
plt.ylim( -0.5, 0.7 )

p0_x0 = -X[:, 0].max()
p0_x1 = ( w[1,0] / w[0,0] ) * p0_x0
p1_x0 = X[:, 0].max()
p1_x1 = ( w[1,0] / w[0,0] ) * p1_x0

plt.title('watermelon_3a - LDA')  
plt.xlabel('density')  
plt.ylabel('ratio_sugar')  
plt.scatter(X[y == 0,0], X[y == 0,1], marker = 'o', color = 'k', s=10, label = 'bad')
plt.scatter(X[y == 1,0], X[y == 1,1], marker = 'o', color = 'g', s=10, label = 'good')
plt.legend(loc = 'upper right')  

plt.plot([p0_x0, p1_x0], [p0_x1, p1_x1])

# draw projective point on the line


m,n = np.shape(X)
for i in range(m):
    x_p = GetProjectivePoint_2D( [X[i,0], X[i,1]], [w[1,0] / w[0,0] , 0] ) 
    if y[i] == 0: 
        plt.plot(x_p[0], x_p[1], 'ko', markersize = 5)
    if y[i] == 1: 
        plt.plot(x_p[0], x_p[1], 'go', markersize = 5)   
    plt.plot([ x_p[0], X[i,0]], [x_p[1], X[i,1] ], 'c--', linewidth = 0.3)

plt.show()

This code implements the Linear Discriminant Analysis (LDA) algorithm for classification problems. Specific steps are as follows:

Remove row 14 from dataset X and label y to prepare for two-class classification.

Calculate the mean vector u of the two types of data. u[0] represents the mean vector of the first type of data, and u[1] represents the mean vector of the second type of data.

Calculate the intra-class dispersion matrix Sw, that is, the degree of dispersion of samples within each category. Generalized inverse and singular value decomposition (SVD) are used to solve Sw

The inverse matrix of .

Calculate the parameter w, which is the discrimination direction vector. w is calculated using the product of the inverse matrix of Sw and the difference between u[0] and u[1].

Draw a scatter plot and draw an LDA line, the slope of the line is calculated according to the parameter w.

Draw the projection point of the sample point on the LDA line.

Show the graph at the end.