Logistic regression model for machine learning

1 Introduction to logistic regression model

        Logistic regression (Logistic Regression, LR), also known as logistic regression analysis, is a machine learning algorithm, which belongs to one of the classification and prediction algorithms, and is mainly used to solve binary classification problems. Logistic regression uses the performance of historical data to predict the probability of future outcomes. For example, we can set the probability of purchase as the dependent variable, and the user's characteristic attributes, such as gender, age, registration time, etc., as independent variables. Predict the probability of purchase based on feature attributes.

        Logistic Regression It works by building a logistic regression model to predict the probability that an input sample belongs to a certain class. The core idea of ​​a logistic regression model is to model probability using a function called the sigmoid function (or logistic function). The formula for the sigmoid function is:

3f7cad7875574855a07fd6d7384381ad.png

2 Application Scenarios of Logistic Regression

        Logistic regression is a simple yet efficient machine learning algorithm that has several advantages. First, the logistic regression model is easy to understand and implement, and has high computational efficiency, which is especially suitable for large-scale data sets. Second, logistic regression provides the ability to interpret and infer the results, and the coefficients of the model can reveal which features have a greater or lesser impact on the classification results. In addition, logistic regression is suitable for high-dimensional data, can handle problems with a large number of features, and capture the relationship between different features. Additionally, logistic regression is able to output probabilistic predictions rather than just classification results, which is useful for tasks that require probability estimation or uncertainty analysis. Finally, logistic regression is somewhat robust to noise and missing values ​​in the data and can adapt to imperfect data in the real world. To sum up, logistic regression is a powerful and practical classification algorithm that is widely adopted in many practical applications. The following are common application scenarios of logistic regression.

  • Finance: Logistic regression can be used for financial risk management tasks such as credit scoring, fraud detection, and customer churn prediction.
  • Medical field: Logistic regression can be used for medical decision support tasks such as disease diagnosis, patient prognosis assessment, and drug response prediction.

  • Marketing: Logistic regression can be used for tasks in the field of marketing such as customer classification, user behavior analysis, and advertisement click rate prediction.

  • Natural Language Processing: Logistic regression can be used for natural language processing tasks such as text classification, sentiment analysis, spam filtering, etc.

  • Image recognition: Logistic regression can be applied to binary classification problems in image classification and target detection.

        The simplicity and interpretability of logistic regression have made it widely used in many practical applications. However, for complex nonlinear problems, logistic regression may not be applicable. At this time, other more complex models or combined with feature engineering techniques can be considered to improve performance.

 3 Based on pytorch to realize the two-category discrimination of bank fraudsters

(1) Dataset

On a bank fraud data set, through 15 features, the discriminant result of two classifications is obtained: whether it is a fraudulent or dishonest person. The built model is still a linear model. The output value is converted by sigmoid to become a probability of 0~1. It is generally believed that greater than 0.5 is 1, and less than 0.5 is 0.

0

56.75

12.25

0

0

6

0

1.25

0

0

4

0

0

200

0

-1

0

31.67

16.165

0

0

1

0

3

0

0

9

1

0

250

730

-1

1

23.42

0.79

1

1

8

0

1.5

0

0

2

0

0

80

400

-1

1

20.42

0.835

0

0

8

0

1.585

0

0

1

1

0

0

0

-1

0

26.67

4.25

0

0

2

0

4.29

0

0

1

0

0

120

0

-1

0

34.17

1.54

0

0

2

0

1.54

0

0

1

0

0

520

50000

-1

1

36

1

0

0

0

0

2

0

0

11

1

0

0

456

-1

0

25.5

0.375

0

0

6

0

0.25

0

0

3

1

0

260

15108

-1

0

19.42

6.5

0

0

9

1

1.46

0

0

7

1

0

80

2954

-1

0

35.17

25.125

0

0

10

1

1.625

0

0

1

0

0

515

500

-1

0

32.33

7.5

0

0

11

2

1.585

0

1

0

0

2

420

0

1

1

38.58

5

0

0

2

0

13.5

0

1

0

0

0

980

0

1

In the last column, -1 is a dishonest and fraudulent person, and 1 is not a dishonest and fraudulent person

(2) pytorch complete code

import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split


def accuracy(y_pred,y_true):
    y_pred = (y_pred>0.5).type(torch.int32)
    acc = (y_pred == y_true).float().mean()
    return acc

loss_fn = nn.BCELoss()
epochs = 1000
batch = 16
lr = 0.0001

data = pd.read_csv("credit.csv",header=None)
X = data.iloc[:,:-1]
Y = data.iloc[:,-1].replace(-1,0)


X = torch.from_numpy(X.values).type(torch.float32)
Y = torch.from_numpy(Y.values).type(torch.float32)

train_x,test_x,train_y,test_y = train_test_split(X,Y)

train_ds = TensorDataset(train_x,train_y)
train_dl = DataLoader(train_ds,batch_size=batch,shuffle=True)

test_ds = TensorDataset(test_x,test_y)
test_dl = DataLoader(test_ds,batch_size=batch)

model = nn.Sequential(
                nn.Linear(15,1),
                nn.Sigmoid()
)
optim = torch.optim.Adam(model.parameters(),lr=lr)

accuracy_rate = []

for epoch in range(epochs):
    for x,y in train_dl:
        y_pred = model(x)
        y_pred = y_pred.squeeze()
        loss = loss_fn(y_pred,y)
        optim.zero_grad()
        loss.backward()
        optim.step()

    with torch.no_grad():
        # 训练集的准确率和loss
        y_pred = model(train_x)
        y_pred = y_pred.squeeze()
        epoch_accuracy  = accuracy(y_pred,train_y)
        epoch_loss  = loss_fn(y_pred,train_y).data

        accuracy_rate.append(epoch_accuracy*100)

        # 测试集的准确率和loss
        y_pred = model(test_x)
        y_pred = y_pred.squeeze()

        epoch_test_accuracy  = accuracy(y_pred,test_y)
        epoch_test_loss  = loss_fn(y_pred,test_y).data


        print('epoch:',epoch,
              'train_loss:',round(epoch_loss.item(),3),
              "train_accuracy",round(epoch_accuracy.item(),3),
              'test_loss:',round(epoch_test_loss.item(),3),
              "test_accuracy",round(epoch_test_accuracy.item(),3)
             )


accuracy_rate = np.array(accuracy_rate)
times = np.linspace(1, epochs, epochs)
plt.xlabel('epochs')
plt.ylabel('accuracy rate')
plt.plot(times, accuracy_rate)
plt.show()

(3) output result

epoch: 951 train_loss: 0.334 train_accuracy 0.869 test_loss: 0.346 test_accuracy 0.866
epoch: 952 train_loss: 0.334 train_accuracy 0.863 test_loss: 0.348 test_accuracy 0.866
epoch: 953 train_loss: 0.337 train_accuracy 0.867 test_loss: 0.358 test_accuracy 0.86
epoch: 954 train_loss: 0.334 train_accuracy 0.867 test_loss: 0.35 test_accuracy 0.866
epoch: 955 train_loss: 0.334 train_accuracy 0.871 test_loss: 0.346 test_accuracy 0.866
epoch: 956 train_loss: 0.333 train_accuracy 0.865 test_loss: 0.348 test_accuracy 0.872
epoch: 957 train_loss: 0.333 train_accuracy 0.871 test_loss: 0.349 test_accuracy 0.866
epoch: 958 train_loss: 0.333 train_accuracy 0.867 test_loss: 0.347 test_accuracy 0.866
epoch: 959 train_loss: 0.334 train_accuracy 0.863 test_loss: 0.352 test_accuracy 0.866
epoch: 960 train_loss: 0.333 train_accuracy 0.867 test_loss: 0.35 test_accuracy 0.878
epoch: 961 train_loss: 0.334 train_accuracy 0.873 test_loss: 0.346 test_accuracy 0.866
epoch: 962 train_loss: 0.334 train_accuracy 0.865 test_loss: 0.353 test_accuracy 0.866
epoch: 963 train_loss: 0.333 train_accuracy 0.873 test_loss: 0.35 test_accuracy 0.866
epoch: 964 train_loss: 0.334 train_accuracy 0.863 test_loss: 0.345 test_accuracy 0.872
epoch: 965 train_loss: 0.333 train_accuracy 0.861 test_loss: 0.351 test_accuracy 0.866
epoch: 966 train_loss: 0.333 train_accuracy 0.873 test_loss: 0.348 test_accuracy 0.866
epoch: 967 train_loss: 0.333 train_accuracy 0.863 test_loss: 0.348 test_accuracy 0.866
epoch: 968 train_loss: 0.333 train_accuracy 0.867 test_loss: 0.351 test_accuracy 0.866
epoch: 969 train_loss: 0.334 train_accuracy 0.869 test_loss: 0.345 test_accuracy 0.878
epoch: 970 train_loss: 0.333 train_accuracy 0.869 test_loss: 0.348 test_accuracy 0.872
epoch: 971 train_loss: 0.335 train_accuracy 0.865 test_loss: 0.344 test_accuracy 0.86
epoch: 972 train_loss: 0.333 train_accuracy 0.867 test_loss: 0.35 test_accuracy 0.86
epoch: 973 train_loss: 0.334 train_accuracy 0.871 test_loss: 0.345 test_accuracy 0.872
epoch: 974 train_loss: 0.333 train_accuracy 0.865 test_loss: 0.351 test_accuracy 0.866
epoch: 975 train_loss: 0.333 train_accuracy 0.873 test_loss: 0.351 test_accuracy 0.86
epoch: 976 train_loss: 0.333 train_accuracy 0.869 test_loss: 0.346 test_accuracy 0.878
epoch: 977 train_loss: 0.333 train_accuracy 0.863 test_loss: 0.351 test_accuracy 0.866
epoch: 978 train_loss: 0.332 train_accuracy 0.865 test_loss: 0.351 test_accuracy 0.866
epoch: 979 train_loss: 0.332 train_accuracy 0.871 test_loss: 0.349 test_accuracy 0.866
epoch: 980 train_loss: 0.333 train_accuracy 0.865 test_loss: 0.345 test_accuracy 0.872
epoch: 981 train_loss: 0.332 train_accuracy 0.867 test_loss: 0.348 test_accuracy 0.872
epoch: 982 train_loss: 0.332 train_accuracy 0.863 test_loss: 0.349 test_accuracy 0.872
epoch: 983 train_loss: 0.333 train_accuracy 0.865 test_loss: 0.353 test_accuracy 0.866
epoch: 984 train_loss: 0.332 train_accuracy 0.865 test_loss: 0.35 test_accuracy 0.872
epoch: 985 train_loss: 0.333 train_accuracy 0.867 test_loss: 0.353 test_accuracy 0.86
epoch: 986 train_loss: 0.333 train_accuracy 0.871 test_loss: 0.345 test_accuracy 0.866
epoch: 987 train_loss: 0.331 train_accuracy 0.865 test_loss: 0.349 test_accuracy 0.872
epoch: 988 train_loss: 0.332 train_accuracy 0.869 test_loss: 0.345 test_accuracy 0.872
epoch: 989 train_loss: 0.332 train_accuracy 0.865 test_loss: 0.353 test_accuracy 0.866
epoch: 990 train_loss: 0.331 train_accuracy 0.865 test_loss: 0.348 test_accuracy 0.872
epoch: 991 train_loss: 0.333 train_accuracy 0.875 test_loss: 0.344 test_accuracy 0.86
epoch: 992 train_loss: 0.332 train_accuracy 0.865 test_loss: 0.351 test_accuracy 0.866
epoch: 993 train_loss: 0.331 train_accuracy 0.869 test_loss: 0.348 test_accuracy 0.872
epoch: 994 train_loss: 0.331 train_accuracy 0.871 test_loss: 0.348 test_accuracy 0.872
epoch: 995 train_loss: 0.331 train_accuracy 0.865 test_loss: 0.347 test_accuracy 0.872
epoch: 996 train_loss: 0.331 train_accuracy 0.865 test_loss: 0.347 test_accuracy 0.872
epoch: 997 train_loss: 0.331 train_accuracy 0.867 test_loss: 0.35 test_accuracy 0.872
epoch: 998 train_loss: 0.331 train_accuracy 0.867 test_loss: 0.349 test_accuracy 0.872
epoch: 999 train_loss: 0.331 train_accuracy 0.865 test_loss: 0.348 test_accuracy 0.872

bb1e0726dc7943cea90943a70ff470cd.png

 4 Complete data set and code download

Complete Code and Dataset: Code and Dataset

Guess you like

Origin blog.csdn.net/lsb2002/article/details/131244160