1. Logistic regression model
import numpy as np
class LogisticRegression(object):
def __init__(self, learning_rate=0.1, max_iter=100, seed=None):
self.seed = seed
self.lr = learning_rate
self.max_iter = max_iter
def fit(self, x, y):
np.random.seed(self.seed)
self.w = np.random.normal(loc=0.0, scale=1.0, size=x.shape[1])
self.b = np.random.normal(loc=0.0, scale=1.0)
self.x = x
self.y = y
for i in range(self.max_iter):
self._update_step()
# print('loss: \t{}'.format(self.loss()))
# print('score: \t{}'.format(self.score()))
# print('w: \t{}'.format(self.w))
# print('b: \t{}'.format(self.b))
def _sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-z))
def _f(self, x, w, b):
z = x.dot(w) + b
return self._sigmoid(z)
def predict_proba(self, x=None):
if x is None:
x = self.x
y_pred = self._f(x, self.w, self.b)
return y_pred
def predict(self, x=None):
if x is None:
x = self.x
y_pred_proba = self._f(x, self.w, self.b)
y_pred = np.array([0 if y_pred_proba[i] < 0.5 else 1 for i in range(len(y_pred_proba))])
return y_pred
def score(self, y_true=None, y_pred=None):
if y_true is None or y_pred is None:
y_true = self.y
y_pred = self.predict()
acc = np.mean([1 if y_true[i] == y_pred[i] else 0 for i in range(len(y_true))])
return acc
def loss(self, y_true=None, y_pred_proba=None):
if y_true is None or y_pred_proba is None:
y_true = self.y
y_pred_proba = self.predict_proba()
return np.mean(-1.0 * (y_true * np.log(y_pred_proba) + (1.0 - y_true) * np.log(1.0 - y_pred_proba)))
def _calc_gradient(self):
y_pred = self.predict()
d_w = (y_pred - self.y).dot(self.x) / len(self.y)
d_b = np.mean(y_pred - self.y)
return d_w, d_b
def _update_step(self):
d_w, d_b = self._calc_gradient()
self.w = self.w - self.lr * d_w
self.b = self.b - self.lr * d_b
return self.w, self.b
This code implements the training and prediction process of the logistic regression model.
In the initialization method of the class __init__
, you can set the learning rate learning_rate
, maximum number of iterations max_iter
, and
Random seed seed
.
In fit
the method, random numbers are used to generate the initial parameter sum , w
and b
then the parameters are updated iteratively until the maximum iteration is reached.
Number of generations. On each update iteration, _update_step
methods are called to update the parameters.
In _sigmoid
the method, the sigmoid function is defined to convert the output of the linear function into a probability value.
In _f
the method, the input x
and parameters w
are dot-multiplied, and the parameters are added b
to obtain the output of the linear function z
. Will z
make
Pass parameters to _sigmoid
the method and convert the output of the linear function into a probability value.
In predict_proba
the method, the probability that the sample is a positive example is predicted based on the input features and current parameters. sigmoid function
The output range of the number is between 0 and 1, which can be regarded as the probability that the sample is a positive example.
In predict
the method, the predicted probability value is converted into a binary classification label value.
In score
the method, the accuracy of the model on the training data is calculated. First determine whether a custom real standard is provided
label y_true
and predicted label y_pred
, if not, use the real label saved when the class was initialized self.y
and call
predict
Predicted labels obtained by method. Then, use list comprehension to traverse the real label and predicted label of each position, and compare
whether they are equal. If equal, add 1 to the list, indicating that the prediction is correct; if not equal, add 0, indicating that the prediction is correct
mistake. Next, use np.mean
the method to calculate the average of the elements in the list, i.e. calculate the proportion of accurate predictions. Finally, return
Accuracy acc
.
In loss
the method, the loss function of the model is calculated, namely the cross-entropy loss. L(y, p) = - (y * log(p) + (1 - y) *
log(1 - p)), where L
represents the loss function, y
is the true label (value is 0 or 1), p
is the predicted probability of the model (value range
ranges from 0 to 1). At that time y=1
, the loss function can be simplified as -log(p)
, indicating that the smaller the probability of the model predicting a positive class, the greater the loss.
big. At that time y=0
, the loss function can be simplified as -log(1-p)
, indicating that the smaller the probability that the model predicts a negative class, the greater the loss. this
The idea of a loss function is that when the model's prediction is consistent with the real label, the loss is close to 0; when the model's prediction is consistent with the real label
When inconsistent, losses increase. By minimizing the cross-entropy loss function, the model can better fit the training data and improve the accuracy of classification.
accuracy.
In _calc_gradient
the method, the gradient of the loss function with respect to the parameters is calculated.
In _update_step
the method, the parameters are updated based on the gradient and learning rate.
The purpose of this code is to implement a logistic regression model and provide functions such as training, prediction, accuracy, and loss functions. by iterating
By updating the parameters, the model can predict the category of the sample based on the input features.
2. Data segmentation
import numpy as np
def generate_data(seed):
np.random.seed(seed)
data_size_1 = 300
x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
y_1 = [0 for _ in range(data_size_1)]
data_size_2 = 400
x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
y_2 = [1 for _ in range(data_size_2)]
x1 = np.concatenate((x1_1, x1_2), axis=0)
x2 = np.concatenate((x2_1, x2_2), axis=0)
x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1)))
y = np.concatenate((y_1, y_2), axis=0)
data_size_all = data_size_1+data_size_2
shuffled_index = np.random.permutation(data_size_all)
x = x[shuffled_index]
y = y[shuffled_index]
return x, y
def train_test_split(x, y):
split_index = int(len(y)*0.7)
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
return x_train, y_train, x_test, y_test
This code implements the functions of data generation and training set and test set segmentation. Here's an explanation of the code:
generate_data(seed)
Functions are used to generate data. Based on the given random seed seed
, use
np.random.seed(seed)
Set a random seed to ensure that the data generated is consistent every time. Two categories of data were generated:
The first category data has 300 samples. np.random.normal
Two feature x1_1
sums are generated through the method x2_1
, serving respectively
From a normal distribution with means 5.0 and 4.0 and standard deviation 1.0. The labels y_1
are all 0. The second category data has 400 samples
This np.random.normal
method generates two feature x1_2
sums x2_2
with mean values of 10.0 and 8.0 respectively, and the standard
Normal distribution with a difference of 2.0. The labels y_2
are all 1. Use np.concatenate
the and np.hstack
method to combine two categories of
The data is merged and feature matrices x
and label vectors are returned y
.
np.concatenate
It is a function in the NumPy library, used to splice multiple arrays along the specified axis.
np.random.permutation
is a function in the NumPy library that is used to randomize a sequence or array.
np.hstack
Is a function in the NumPy library that is used to concatenate multiple arrays horizontally (column-wise).
reshape(-1,1)是
Reshape the array x1
into a two-dimensional array with a column number of 1, and the number of rows is automatically calculated based on the array length.
train_test_split(x, y)
Function is used to split the data into training and test sets. Based on 70% of the total sample size
as training set and 30% as test set. The specific steps are as follows: int(len(y)*0.7)
Calculate the index position of the split.
Use the slicing operation to split the feature matrix x
and label vector y
into the feature matrix x_train
and label vector of the training set y_train
,
x_test
As well as the feature matrix and label vector of the test set y_test
. Returns the feature matrices and label vectors of the training and test sets.
3. Result generation
import numpy as np
import matplotlib.pyplot as plt
#import data_helper
#from logistic_regression import *
# data generation
x, y = generate_data(seed=272)
x_train, y_train, x_test, y_test = train_test_split(x, y)
# visualize data
# plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
# plt.show()
# plt.scatter(x_test[:,0], x_test[:,1], c=y_test, marker='.')
# plt.show()
# data normalization
x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))
x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))
# Logistic regression classifier
clf = LogisticRegression(learning_rate=0.1, max_iter=500, seed=272)
clf.fit(x_train, y_train)
# plot the result
split_boundary_func = lambda x: (-clf.b - clf.w[0] * x) / clf.w[1]
xx = np.arange(0.1, 0.6, 0.1)
cValue = ['g','b']
plt.scatter(x_train[:,0], x_train[:,1], c=[cValue[i] for i in y_train], marker='o')
plt.plot(xx, split_boundary_func(xx), c='red')
plt.show()
# loss on test set
y_test_pred = clf.predict(x_test)
y_test_pred_proba = clf.predict_proba(x_test)
print(clf.score(y_test, y_test_pred))
print(clf.loss(y_test, y_test_pred_proba))
# print(y_test_pred_proba)
Anonymous function (lambda function) split_boundary_func
, which accepts one parameter x
and returns a value.
Specifically, this function calculates the classification boundaries using clf
the intercept clf.b
, weights clf.w[0]
, and clf.w[1]
The vertical coordinate value of the line.
In the logistic regression model, the classification boundary line can be expressed as:
w0 * x + w1 * y + b = 0
where w0
and w1
are the weights of the model and b
are the intercepts of the model. Here we will y
express it as
split_boundary_func(x)
, that is, the relationship between the ordinate value and the abscissa value.
plt.scatter
The function plots a scatter plot, specifically, x_train[:,0]
and x_train[:,1]
are in the training set data
Two feature columns (or independent variables). x_train[:,0]
means taking the first eigenvalue of all rows, x_train[:,1]
which means
Take the second eigenvalue of all rows. c=[cValue[i] for i in y_train]
Used to specify the color of each data point.
y_train
is the label (or dependent variable) of the training set data, used to indicate the classification category of the data point. cValue
is a color
List representing different categories of colors. Through list derivation, y_train
select the corresponding color according to the value to achieve different
Data points for categories have different colors. marker='o'
Specifies the shape of the scatter points as circles.
First, use generate_data
a function to generate some 2D data and split the dataset into training and test sets.
Then, a scatter plot of the training set and the test set is displayed through data visualization.
Next, the feature data of the training set and test set are normalized and scaled to the range [0, 1].
Then, create a logistic regression classifier object clf
and train it using the training set data.
Next, define a function split_boundary_func
to draw the classification boundary line, and draw the training set data and
Classification boundary lines.
Finally, use the trained model to predict the test set, and calculate the accuracy and loss of the model on the test set.