Logistic Regression——Analysis of Bank Default Situation

1. Principle of Logistic Regression

1. Sigmoid function function

(-∞,+∞)The principle of Logistic regression is to map the results of linear regression to the logistic function (0,1). The linear regression function and logistic function are introduced below.

  • Linear regression function
    Mathematical expression of linear regression function:
    y = θ 0 + θ 1 x 1 + θ 2 x 2 , + … + θ nxn = θ T xy=\theta _{0} +\theta _{1}x _{1}+\theta _{2}x _{2},+…+\theta _{n}x _{n} = \theta^{T} xy=i0+i1x1+i2x2,++inxn=iT x
    wherexiis the independent variable,yis the dependent variable,ythe value range is(-∞,+∞),θ 0 \theta _{0}i0Is a constant term, θ i \theta _{i}iiIs the coefficient to be sought, different weights θ i \theta _{i}iiIt reflects the degree of contribution of the independent variable to the dependent variable.
    For a linear equation of one variable: y=a+bx, this kind of regression analysis that includes only one independent variable and one dependent variable is called a linear regression analysis of one variable.
    For binary linear equations: y= a+b1x1+b2x2, ternary linear equations: y = a+b1x1+b2x2+b3x3, this regression analysis includes regression analysis of two or more independent variables, which is called multiple linear regression analysis.
    Whether it is single linear regression analysis or multiple linear regression analysis, it is linear regression analysis.

  • Sigmoid function
    Function expression:
    g ( z ) = 1 1 + e − zg(z)=\frac{1}{1+e^{-z} }g(z)=1+ez1
    Function image:
    insert image description here
    It can be seen from the image that when ztends to -∞, g(z)tends to 0, when ztends to +∞, g(z)tends to 1, and the value threshold of the function is (0,1). The principle is that when ztending to -∞, e − ze^{-z}ez tends to ∞,g(z)tends to 0, when z tends to+∞,e − ze^{-z}ez tends to 0,g(z)tends to 1.
    At the same time, it can be found that when z tends to 5, the value of g(z) has reached around 0.99, and the larger z is, the more g(z) tends to 1. So this function is a good description of the probability that we usually encounter in our daily life. We might consider the probability of rain tomorrow to be 0.3 and the probability of sunshine to be 0.7. The probability of flipping a coin is 0.5 for heads and 0.5 for tails. Probability is also a number between 0 and 1. Naturally, we can relate the value range of the sigmod function to probability.

2. Use the gradient descent method to solve the parameters

The gradient descent process is similar to the process of a person going down a mountain:

  • step1: Make clear where you are now;
  • step2: Find the direction of the fastest descent from the current position;
  • step3: Walk one step along the direction found in the second step to reach a new position, and the new position is lower than the previous position;
  • step4: judge whether to go down the mountain, if it has not reached the lowest point, continue to step —, if it has reached the lowest point, then stop.

From the above analysis, the most important thing to solve the parameters with the gradient descent method is to find the direction of the fastest decline and determine the step size to be taken.
So what is the direction in which the function drops the fastest?
If you have learned the derivative of a one-variable function, you should know that the geometric meaning of the derivative is the slope of the tangent line at a certain point. In addition, the derivative can also represent the rate of change of the function at this point. The larger the derivative, the greater the change of the function at this point.
insert image description here

It can be seen from the figure that the slope of point p2 is greater than the slope of point p1, that is, the derivative of point p2 is greater than the derivative of point p1. For a multidimensional vector
x = ( x 1 , x 2 , . . . , xn ) x= (x_{1},x_{2},...,x_{n} )x=(x1,x2,...,xn)
Its derivative is called gradient (partial derivative). When calculating the derivative of a certain variable, treat other variables as constants and take the derivative of the whole function, that is, calculate the derivative for each of its components, that is, x ′ =
( x 1 ′ , x 2 ′ , . . . , xn ′ ) x^{'} = (x_{1}^{'},x_{2}^{'},...,x_{n}^{ '} )x=(x1,x2,...,xn)
for a specific point of the function, its gradient indicates the direction in which the value of the function changes most rapidly starting from this point. So far, the direction of the gradient descent method to solve the parameters has been found, that is, the gradient direction of the function.

Second, the use of Logistic regression classification

1. Data preprocessing

  1. Obtaining the Dataset
    My dataset is to obtain the user's personal information (age, education status, length of service, address, income, debt ratio, bank card debt, etc.), and use this information to finally determine whether the user has breached the contract.
    insert image description here
    During preprocessing, user information is first stored in a two-dimensional array, and the return value data_arris the data feature, which label_arris the label corresponding to each data:
def load_data_set(fileName):
    data_arr = []
    label_arr = []
    f = open(fileName, 'r')
    for line in f.readlines():
        line_arr = line.strip().split()
        data_arr.append([np.float(line_arr[0]), 
                            np.float(line_arr[1]),
                            np.float(line_arr[2]),
                            np.float(line_arr[3]),
                            np.float(line_arr[4]),
                            np.float(line_arr[5]),
                            np.float(line_arr[6]),
                            np.float(line_arr[7])])
        label_arr.append(int(line_arr[8]))
    data_arr = maxminnorm(data_arr)
    return data_arr, label_arr
data_arr, class_labels = load_data_set()

When preprocessing, it should be noted that if some data is too large, the data should be normalized. If it is not normalized, it will cause data overflow. According to the sigmoid function
g ( z ) = 1 1 + e − zg(z)=\frac{1}{1+e^{-z} }g(z)=1+ez1When z is too large, e − ze^{-z}e− The value of z is very large, so it is necessary to normalize the device, according to the normalization formula:xi = xi − min ( x ) max ( x ) − min ( x ) x_{i}=\frac{x_{ i}-min(x)}{max(x)-min(x)}xi=max(x)min(x)ximin(x)According to the formula, all the data can be guaranteed to be between 0-1, and the characteristics represented by the data will not be lost. After being brought into the sigmoid function, there will be no exception of data overflow.

def maxminnorm(array):
    array = np.array(array)
    maxcols=array.max(axis=0)
    mincols=array.min(axis=0)
    data_shape = array.shape
    data_rows = data_shape[0]
    data_cols = data_shape[1]
    t=np.empty((data_rows,data_cols))
    for i in range(data_cols):
        t[:,i]=(array[:,i]-mincols[i])/(maxcols[i]-mincols[i])
    return t

2. Use gradient ascent to calculate regression coefficients

  • Gradient ascent model
    Gradient ascent method, in fact, is because of the use of maximum likelihood estimation, what is passed in is an ordinary array param data_arr, of course, you can also pass in a two-dimensional ndarray. class_labelsis the category label, which is a row vector of . In order to facilitate matrix calculation, the row vector needs to be converted into a column vector by transposing the original vector and assigning it to label_mat.
def grad_ascent(data_arr, class_labels):
    data_mat = np.mat(data_arr)
    # 变成矩阵之后进行转置
    label_mat = np.mat(class_labels).transpose()
    # m->数据量,样本数 n->特征数
    m, n = np.shape(data_mat)
    # 学习率,learning rate
    alpha = 0.001
    # 最大迭代次数
    max_cycles = 500
    # 生成一个长度和特征数相同的矩阵
    # weights 代表回归系数, 此处的 ones((n,1)) 创建一个长度和特征数相同的矩阵,其中的数全部都是 1
    weights = np.ones((n, 1))
    for k in range(max_cycles):
        h = sigmoid(data_mat * weights)
        error = label_mat - h
        weights = weights + alpha * data_mat.transpose() * error
    return weights
  • Stochastic gradient ascent algorithm:
    The gradient ascent algorithm must traverse the entire data set when updating the regression coefficients. If the number of samples and labels to be processed is too large, the computational complexity will be too great. If only one sample point is used to update the regression coefficients at a time, this method is called the stochastic gradient ascent algorithm. The difference between the stochastic gradient ascent algorithm and the gradient ascent algorithm is as follows: (1) the variables and errors of the latter are vectors, while the former are all numerical values; (2) the former has no matrix conversion process, and all variables are numpy arrays.
def stoc_grad_ascent0(data_mat, class_labels):
    m, n = np.shape(data_mat)
    alpha = 0.01
    weights = np.ones(n)
    for i in range(m):
        # sum(data_mat[i]*weights)为了求 f(x)的值, f(x)=a1*x1+b2*x2+..+nn*xn,
        # 此处求出的 h 是一个具体的数值,而不是一个矩阵
        h = sigmoid(sum(data_mat[i] * weights))
        error = class_labels[i] - h
        weights = weights + alpha * error * data_mat[i]
    return weights
  • Improved stochastic gradient ascent algorithm:
    Compared with the stochastic gradient ascent algorithm, this algorithm has improved several aspects: (1) aplha will be adjusted at each iteration, which will alleviate data fluctuations. In addition, although aplha will continue to decrease with the number of iterations, it will never decrease to 0; (2) Update the regression coefficient by randomly selecting samples, which reduces fluctuations; (3) The default number of iterations can be modified, the third The parameter can pass in the number of iterations, and the default is 150.
def stoc_grad_ascent1(data_mat, class_labels, num_iter=150):
    m, n = np.shape(data_mat)
    weights = np.ones(n)
    for j in range(num_iter):
        # 这里必须要用list,不然后面的del没法使用
        data_index = list(range(m))
        for i in range(m):
            # i和j的不断增大,导致alpha的值不断减少,但是不为0
            alpha = 4 / (1.0 + j + i) + 0.01
            # 随机产生一个 0~len()之间的一个值
            # random.uniform(x, y) 方法将随机生成下一个实数,它在[x,y]范围内,x是这个范围内的最小值,y是这个范围内的最大值。
            rand_index = int(np.random.uniform(0, len(data_index)))
            h = sigmoid(np.sum(data_mat[data_index[rand_index]] * weights))
            error = class_labels[data_index[rand_index]] - h
            weights = weights + alpha * error * data_mat[data_index[rand_index]]
            del(data_index[rand_index])
    return weights

3. Training and Validation

  • The final classification function
    calculates the value of Sigmoid based on the regression coefficient and feature vector, and the function returns 1 if it is greater than 0.5, otherwise it returns 0
def classify_vector(in_x, weights):
    # print(np.sum(in_x * weights))
    prob = sigmoid(np.sum(in_x * weights))
    if prob > 0.5:
        return 1.0
    return 0.0
  • training model
def colic_test():
    f_train = open('data/loandata.txt', 'r')
    f_test = open('data/loandata_test.txt', 'r')
    training_set = []
    training_labels = []
    # trainingSet 中存储训练数据集的特征,trainingLabels 存储训练数据集的样本对应的分类标签
    for line in f_train.readlines():
        curr_line = line.strip().split('\t')
        if len(curr_line) == 1:
            continue    # 这里如果就一个空的元素,则跳过本次循环
        line_arr = [float(curr_line[i]) for i in range(8)]
        training_set.append(line_arr)
        training_labels.append(float(curr_line[8]))
    # 使用 改进后的 随机梯度下降算法 求得在此数据集上的最佳回归系数 trainWeights
    train_weights0 = grad_ascent(np.array(training_set), training_labels)
    train_weights1 = stoc_grad_ascent0(np.array(training_set), training_labels)
    train_weights2 = stoc_grad_ascent1(np.array(training_set), training_labels)
    error_count0 = 0
    error_count1 = 0
    error_count2 = 0
    num_test_vec = 0.0
    # 读取 测试数据集 进行测试,计算分类错误的样本条数和最终的错误率
    for line in f_test.readlines():
        num_test_vec += 1
        curr_line = line.strip().split('\t')
        if len(curr_line) == 1: 
            continue    # 这里如果就一个空的元素,则跳过本次循环
        line_arr = [float(curr_line[i]) for i in range(8)]
        if int(classify_vector(np.array(line_arr), train_weights0)) != int(curr_line[8]):
            error_count0 += 1
        if int(classify_vector(np.array(line_arr), train_weights1)) != int(curr_line[8]):
            error_count1 += 1
        if int(classify_vector(np.array(line_arr), train_weights2)) != int(curr_line[8]):
            error_count2 += 1
    right_rate = 1 - (error_count0 / num_test_vec)
    print('梯度下降法正确率:{}'.format(right_rate))
    right_rate = 1 - (error_count1 / num_test_vec)
    print('随机梯度下降法正确率:{}'.format(right_rate))
    right_rate = 1 - (error_count2 / num_test_vec)
    print('改进随机梯度下降法正确率:{}'.format(right_rate))
    return 

operation result:
insert image description here

Summarize

  • Pay attention to normalization when performing data preprocessing
    . In the data set, if a considerable part of the data values ​​are too large, the calculation of the sigmoid function will cause overflow. After normalization, all data are between 0 and 1, and the characteristics represented by the data can be preserved. Without normalization, optimizing the sigmoid function can also solve this problem.
  • Advantages of Gradient Descent Algorithm
    In addition to gradient descent, the unconstrained optimization algorithm in machine learning also has the aforementioned least squares method, in addition to Newton's method and quasi-Newton's method. Compared with the gradient descent method and the least squares method, the gradient descent method needs to choose a step size, but the least squares method does not. The gradient descent method is an iterative solution, and the least squares method is an analytical solution. If the sample size is not too large and there is an analytical solution, the least squares method has advantages over the gradient descent method, and the calculation speed is very fast. However, if the sample size is large, it is difficult or slow to solve the analytical solution by using the least squares method because it needs to find a super large inverse matrix. It is more advantageous to use the iterative gradient descent method.
  • Improved stochastic gradient ascent algorithm
    Compared with the stochastic gradient ascent algorithm, this algorithm has improved several aspects: (1) aplha will be adjusted at each iteration, which will alleviate data fluctuations. In addition, although aplha will continue to decrease with the number of iterations, it will never decrease to 0; (2) Update the regression coefficient by randomly selecting samples, which reduces fluctuations; (3) The default number of iterations can be modified, the third The parameter can pass in the number of iterations, and the default is 150.

Code link:
Link: https://pan.baidu.com/s/1uQqJsQfOhTl5kpVol3-vTw?pwd=wrwg
Extraction code: wrwg

Guess you like

Origin blog.csdn.net/chenxingxingxing/article/details/128163802