Machine Learning---Logistic Regression Code

1. Logistic regression model


import numpy as np

class LogisticRegression(object):

    def __init__(self, learning_rate=0.1, max_iter=100, seed=None):
        self.seed = seed
        self.lr = learning_rate
        self.max_iter = max_iter

    def fit(self, x, y):
        np.random.seed(self.seed)
        self.w = np.random.normal(loc=0.0, scale=1.0, size=x.shape[1])
        self.b = np.random.normal(loc=0.0, scale=1.0)
        self.x = x
        self.y = y
        for i in range(self.max_iter):
            self._update_step()
            # print('loss: \t{}'.format(self.loss()))
            # print('score: \t{}'.format(self.score()))
            # print('w: \t{}'.format(self.w))
            # print('b: \t{}'.format(self.b))

    def _sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-z))

    def _f(self, x, w, b):
        z = x.dot(w) + b
        return self._sigmoid(z)

    def predict_proba(self, x=None):
        if x is None:
            x = self.x
        y_pred = self._f(x, self.w, self.b)
        return y_pred

    def predict(self, x=None):
        if x is None:
            x = self.x
        y_pred_proba = self._f(x, self.w, self.b)
        y_pred = np.array([0 if y_pred_proba[i] < 0.5 else 1 for i in range(len(y_pred_proba))])
        return y_pred

    def score(self, y_true=None, y_pred=None):
        if y_true is None or y_pred is None:
            y_true = self.y
            y_pred = self.predict()
        acc = np.mean([1 if y_true[i] == y_pred[i] else 0 for i in range(len(y_true))])
        return acc

    def loss(self, y_true=None, y_pred_proba=None):
        if y_true is None or y_pred_proba is None:
            y_true = self.y
            y_pred_proba = self.predict_proba()
        return np.mean(-1.0 * (y_true * np.log(y_pred_proba) + (1.0 - y_true) * np.log(1.0 - y_pred_proba)))

    def _calc_gradient(self):
        y_pred = self.predict()
        d_w = (y_pred - self.y).dot(self.x) / len(self.y)
        d_b = np.mean(y_pred - self.y)
        return d_w, d_b

    def _update_step(self):
        d_w, d_b = self._calc_gradient()
        self.w = self.w - self.lr * d_w
        self.b = self.b - self.lr * d_b
        return self.w, self.b

This code implements the training and prediction process of the logistic regression model.

       In the initialization method of the class  __init__ , you can set the learning rate  learning_rate, maximum number of iterations  max_iter , and

Random seed  seed.

       In  fit the method, random numbers are used to generate the initial parameter sum  , w and  bthen the parameters are updated iteratively until the maximum iteration is reached.

Number of generations. On each update iteration,  _update_step methods are called to update the parameters.

In  _sigmoid the method, the sigmoid function is defined to convert the output of the linear function into a probability value.

       In  _f the method, the input  x and parameters  w are dot-multiplied, and the parameters are added  bto obtain the output of the linear function  z. Will  z make

Pass parameters to  _sigmoid the method and convert the output of the linear function into a probability value.

       In  predict_proba the method, the probability that the sample is a positive example is predicted based on the input features and current parameters. sigmoid function

The output range of the number is between 0 and 1, which can be regarded as the probability that the sample is a positive example.

       In  predict the method, the predicted probability value is converted into a binary classification label value.

       In  score the method, the accuracy of the model on the training data is calculated. First determine whether a custom real standard is provided

label  y_true and predicted label  y_pred, if not, use the real label saved when the class was initialized  self.y and call 

 predict Predicted labels obtained by method. Then, use list comprehension to traverse the real label and predicted label of each position, and compare

whether they are equal. If equal, add 1 to the list, indicating that the prediction is correct; if not equal, add 0, indicating that the prediction is correct

mistake. Next, use  np.mean the method to calculate the average of the elements in the list, i.e. calculate the proportion of accurate predictions. Finally, return

Accuracy  acc.

       In  loss the method, the loss function of the model is calculated, namely the cross-entropy loss. L(y, p) = - (y * log(p) + (1 - y) *

log(1 - p)), where L represents the loss function, y is the true label (value is 0 or 1), p is the predicted probability of the model (value range

ranges from 0 to 1). At that time y=1 , the loss function can be simplified as  -log(p), indicating that the smaller the probability of the model predicting a positive class, the greater the loss.

big. At that time y=0 , the loss function can be simplified as  -log(1-p), indicating that the smaller the probability that the model predicts a negative class, the greater the loss. this

The idea of ​​​​a loss function is that when the model's prediction is consistent with the real label, the loss is close to 0; when the model's prediction is consistent with the real label

When inconsistent, losses increase. By minimizing the cross-entropy loss function, the model can better fit the training data and improve the accuracy of classification.

accuracy.

       In  _calc_gradient the method, the gradient of the loss function with respect to the parameters is calculated.

       In  _update_step the method, the parameters are updated based on the gradient and learning rate.

The purpose of this code is to implement a logistic regression model and provide functions such as training, prediction, accuracy, and loss functions. by iterating

By updating the parameters, the model can predict the category of the sample based on the input features.

2. Data segmentation

import numpy as np

def generate_data(seed):
    np.random.seed(seed)
    data_size_1 = 300
    x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
    x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
    y_1 = [0 for _ in range(data_size_1)]
    data_size_2 = 400
    x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
    x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
    y_2 = [1 for _ in range(data_size_2)]
    x1 = np.concatenate((x1_1, x1_2), axis=0)
    x2 = np.concatenate((x2_1, x2_2), axis=0)
    x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1)))
    y = np.concatenate((y_1, y_2), axis=0)
    data_size_all = data_size_1+data_size_2
    shuffled_index = np.random.permutation(data_size_all)
    x = x[shuffled_index]
    y = y[shuffled_index]
    return x, y
def train_test_split(x, y):
    split_index = int(len(y)*0.7)
    x_train = x[:split_index]
    y_train = y[:split_index]
    x_test = x[split_index:]
    y_test = y[split_index:]
    return x_train, y_train, x_test, y_test

This code implements the functions of data generation and training set and test set segmentation. Here's an explanation of the code:

   generate_data(seed) Functions are used to generate data. Based on the given random seed  seed, use 

np.random.seed(seed) Set a random seed to ensure that the data generated is consistent every time. Two categories of data were generated:

The first category data has 300 samples.  np.random.normal Two feature  x1_1 sums  are generated through the method x2_1, serving respectively

From a normal distribution with means 5.0 and 4.0 and standard deviation 1.0. The labels  y_1 are all 0. The second category data has 400 samples

This  np.random.normal method generates two feature  x1_2 sums  x2_2with mean values ​​of 10.0 and 8.0 respectively, and the standard

Normal distribution with a difference of 2.0. The labels  y_2 are all 1. Use  np.concatenate the and  np.hstack method to combine two categories of

The data is merged and feature matrices  x and label vectors  are returned y.

        np.concatenate It is a function in the NumPy library, used to splice multiple arrays along the specified axis.

        np.random.permutation is a function in the NumPy library that is used to randomize a sequence or array.

        np.hstack Is a function in the NumPy library that is used to concatenate multiple arrays horizontally (column-wise).

        reshape(-1,1)是Reshape the array  x1 into a two-dimensional array with a column number of 1, and the number of rows is automatically calculated based on the array length.

   train_test_split(x, y) Function is used to split the data into training and test sets. Based on 70% of the total sample size

as training set and 30% as test set. The specific steps are as follows:  int(len(y)*0.7) Calculate the index position of the split.

Use the slicing operation to split the feature matrix  x and label vector  y into the feature matrix  x_train and label vector  of the training set y_train,

x_test As well as the feature matrix and label vector  of the test set  y_test. Returns the feature matrices and label vectors of the training and test sets.

3. Result generation

import numpy as np

import matplotlib.pyplot as plt

#import data_helper

#from logistic_regression import *

# data generation

x, y = generate_data(seed=272)
x_train, y_train, x_test, y_test = train_test_split(x, y)

# visualize data
# plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
# plt.show()
# plt.scatter(x_test[:,0], x_test[:,1], c=y_test, marker='.')
# plt.show()

# data normalization
x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))
x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))

# Logistic regression classifier
clf = LogisticRegression(learning_rate=0.1, max_iter=500, seed=272)
clf.fit(x_train, y_train)

# plot the result
split_boundary_func = lambda x: (-clf.b - clf.w[0] * x) / clf.w[1]
xx = np.arange(0.1, 0.6, 0.1)
cValue = ['g','b'] 
plt.scatter(x_train[:,0], x_train[:,1], c=[cValue[i] for i in y_train], marker='o')
plt.plot(xx, split_boundary_func(xx), c='red')
plt.show()

# loss on test set
y_test_pred = clf.predict(x_test)
y_test_pred_proba = clf.predict_proba(x_test)
print(clf.score(y_test, y_test_pred))
print(clf.loss(y_test, y_test_pred_proba))
# print(y_test_pred_proba)

       Anonymous function (lambda function) split_boundary_func, which accepts one parameter  xand returns a value.

Specifically, this function   calculates the classification boundaries using clf the intercept  clf.b, weights  clf.w[0] , and clf.w[1]

The vertical coordinate value of the line.

In the logistic regression model, the classification boundary line can be expressed as:

w0 * x + w1 * y + b = 0

         where  w0 and  w1 are the weights of the model and b are the intercepts of the model. Here we will  y express it as 

split_boundary_func(x), that is, the relationship between the ordinate value and the abscissa value.

   plt.scatter The function plots a scatter plot, specifically, x_train[:,0] and  x_train[:,1] are in the training set data

Two feature columns (or independent variables). x_train[:,0] means taking the first eigenvalue of all rows, x_train[:,1] which means

Take the second eigenvalue of all rows. c=[cValue[i] for i in y_train] Used to specify the color of each data point.

y_train is the label (or dependent variable) of the training set data, used to indicate the classification category of the data point. cValue is a color

List representing different categories of colors. Through list derivation,  y_train select the corresponding color according to the value to achieve different

Data points for categories have different colors. marker='o' Specifies the shape of the scatter points as circles.

       First, use  generate_data a function to generate some 2D data and split the dataset into training and test sets.

       Then, a scatter plot of the training set and the test set is displayed through data visualization.

       Next, the feature data of the training set and test set are normalized and scaled to the range [0, 1].

       Then, create a logistic regression classifier object  clfand train it using the training set data.

       Next, define a function  split_boundary_func to draw the classification boundary line, and draw the training set data and

Classification boundary lines.

       Finally, use the trained model to predict the test set, and calculate the accuracy and loss of the model on the test set.

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_43961909/article/details/132292573