[Basics of Machine Learning] Log-likelihood regression (Logistic) to classify Titanic survivors

Logistic regression (Logistic) classifies Titanic survivors

Logistic Regression is log-likelihood regression, which can be regarded as the simplest artificial neural network. It is to select a line (hyperplane) to divide the data set into two parts by fitting the data to achieve classification.


1. Theoretical knowledge of log-likelihood regression

The core of logistic regression is shown in the figure below. Our goal is to find the optimal w and b . W and b can represent a straight line (hyperplane) to divide the data set into different parts.
structure

1. Define Hyper-Plane

We need to find a line (hyperplane) to divide the data set into two parts. The definition of the hyperplane is as follows (where w0 and b are equivalent):
Hyperplane

2. Active Function

For the classification problem, it is necessary to classify the category of the data. For a binary classification problem, we have to introduce a binary classification output function. It can only be 0 or 1 for the input data. Next, we will introduce two activation functions: Sigmod and Tanh and their images are as follows (blue is Sigmode, red is Tanh).
Activation function

  • For sigmod, when we enter a value, if the sigmod function is >=0.5 we consider it to be 1, otherwise it is 0.
  • For tanh, when we input a value, if the tanh function is >=0 we consider it to be 1, otherwise it is 0.
import numpy as np
# 激活函数

# sigmod

def sigmod(x):
    return np.array(1.0/(1.0 + np.exp(-x)))

# tanh

def tanh(x):
    return (np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))

3. Loss Function (Loss Function)

In order to obtain the optimal hyperplane parameters, we need to define a loss function to iterate to minimize the value of the loss function and obtain the optimal parameters. The loss function is defined as follows:
Loss function
where y is the predicted value, y_hat is the actual value, and y contains the w and b parameters .

4. Gradient Descent

Iterating along the direction of the gradient is the fastest convergence. Here is the back-propagation update . The alpha in the w and b graphs represents the learning rate
Update w and b

5. Calculation steps

According to the above, the whole calculation steps can be obtained (the number of iterations needs to be specified): The
calculation steps
code for obtaining the parameters of gradient descent is as follows:

# 梯度下降求解

def GD(input_data_x, input_data_y, alpha = 0.001, itera = 1000):
    [n, m] = input_data_x.shape
    # 初始化w,b
    w = np.ones((n, 1))
    b = np.ones(1)
    
    for i in range(itera):
        z = np.dot(w.T, input_data_x) + b
        a = sigmod(z)
        
        # error
        dz = a - input_data_y
        # print(w, b)
        # 更新 w 和 b
        b -= alpha * (np.sum(dz))/m
        w -= alpha * np.dot(input_data_x, dz.T)/m
    
    # print(w, b)
    return w, b

w, b = GD(train_data.T, train_label.T)

Test on the test data set:

def test(test_data, test_label, w, b):
    sigmod_result = sigmod(np.dot(w.T, test_data) + b)
    result = []
    # 判断分类
    for sr in sigmod_result[0]:
        if sr >= 0.5: 
            result.append(1)
        else:
            result.append(0)

    error = 0
    for i in range(len(result)):
        if result[i] != test_label[0][i]:
            error += 1
    
    print(error/len(result))

test(test_data.T, test_label.T, w, b)

The Logistic regression implemented in this paper has an error rate of 0.23979591836734693 for the classification of Titanic survivors.


2. Data (Data Set of Titanic Survivors)

The data set for this article comes from Kaggle/titanic

1. Data feature description

  • age Passenger age
  • fare
  • sex sex
  • pclass Socioeconomic status 1 (upper level) 2 (middle level) 3 (lower level)
  • survived 1 (survived) 0 (dead)

The survivors were classified according to passenger age, fare, gender, socioeconomic status, etc.

2. Import data

We need to divide the imported data into two parts: training set and test set. The training set accounts for 70% of the total data set, and the rest is the test set.

import pandas as pd 

def load_csv_data():
    csv_file = pd.read_csv(r'F:\UCAS\Work\Course\2020\ML\ML-Learning\ML_action\4.LogicRegression\data\titanic\train_and_test2.csv')
    data_set = np.array(csv_file)

    # 分类 最后一行为是否幸存
    survived = data_set[:, [-1]]
    survived = survived.astype(np.int)
    # 特征 其中第1, 2, 3, 21列 分别为 年龄, 船票, 性别, 社会经济地位
    # input_data = data_set[:, [1,2,3,21]]
    input_data = data_set[:, [1,2]]
    return input_data, survived

Data partition

# 导入数据
input_data, label = load_csv_data()

# ninput_data = normlize(input_data)
# print(input_data, label)
# 划分数据
m, n = input_data.shape
train_size = int(m*0.7)
# 训练集
train_data = input_data[0:train_size, :]
train_label = label[0:train_size]
# 测试集
test_data = input_data[train_size:-1, :]
test_label = label[train_size:-1]

3. Visualization

The following figure shows the visualized image divided by Logisitic, where red means death and green means survival. The horizontal axis represents age, and the vertical axis represents the ticket price. It can generally be concluded that people who are younger and have higher ticket prices have a higher survival rate. You can also deal with gender or other characteristics, which will not be repeated here.
Visualization

from matplotlib import pyplot as plt

colors = ['green' for x in range(0, train_size)]
for i in range(len(colors)):
    if train_label[i] == 0:
        colors[i] = 'red'
x = np.linspace(0, 88)
y = -(w[0]*x + b)/w[1]
plt.plot(x, y, color='blue')

plt.scatter(train_data[:, 0], train_data[:, 1], color = colors)
plt.show()

references

  1. Andrew Ng Deep Learning Course
  2. Machine learning practical books
  3. National University of Science and Technology Machine Learning 2020 Course PPT
  4. https://apachecn.gitee.io/ailearning/#/docs/ml/5

Guess you like

Origin blog.csdn.net/qq_37753409/article/details/109209876