Python数据分析:逻辑回归(logistic regression)

Python数据分析:逻辑回归(logistic regression)

逻辑回归(Logistic Regression),简称LR,能够将特征输入集合转化为0和1这两类的概率。
  • 优点:计算代价不高,易于理解和实现
  • 缺点:容易欠拟合,分类精度不高
  • 使用数据:数值型和标称型
基本模型:
  • 训练样本:
    X ( x θ , x 1 , x 2 , , x n ) X \left(x_{\theta}, x_{1}, x_{2}, \ldots, x_{n}\right)

  • 学习参数:
    Θ ( θ θ , θ 1 , θ 2 , , θ n ) \Theta\left(\theta_{\theta}, \theta_{1}, \theta_{2}, \ldots, \theta_{n}\right)

Z = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + + θ n x n Z=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\cdots+\theta_{n} x_{n}

  • 向量表示:
    Z = Θ T X Z=\Theta^{T} X

  • sigmoid函数将线性转换成非线性
    g ( Z ) = 1 1 + e Z g(Z)=\frac{1}{1+e^{-Z}}
    在这里插入图片描述

  • 预测函数:
    h θ ( X ) = g ( Θ T X ) = 1 1 + e θ T X h_{\theta}(X)=g\left(\Theta^{T} X\right)=\frac{1}{1+e^{-\theta^{T} X}}

  • 用概率的形式表示:

    • 正样本
      h θ ( X ) = P ( y = 1 X ; Θ ) h_{\theta}(X)=P(y=1 | X ; \Theta)

    • 负样本
      1 h θ ( X ) = P ( y = 0 X ; Θ ) 1-h_{\theta}(X)=P(y=0 | X ; \Theta)

  • 损失函数:

cost ( h θ ( x ) , y ) = { log ( h θ ( x ) )  if  y = 1 log ( 1 h θ ( x ) )  if  y = 0 \operatorname{cost}\left(h_{\theta}(x), y\right)=\left\{\begin{array}{ll}{-\log \left(h_{\theta}(x)\right)} & {\text { if } \mathrm{y}=1} \\ {-\log \left(1-h_{\theta}(x)\right)} & {\text { if } \mathrm{y}=0}\end{array}\right.

cost ( h θ ( x ) , y ) = i = 1 m y i log ( h θ ( x ) ) ( 1 y i ) log ( 1 h θ ( x ) ) \operatorname{cost}\left(h_{\theta}(x), y\right)=\sum_{i=1}^{m}-y_{i} \log \left(h_{\theta}(x)\right)-\left(1-y_{i}\right) \log \left(1-h_{\theta}(x)\right)

  • 目标:通过训练样本求出参数theta使损失函数最小化

  • 求解:梯度下降(gradenet descent)

    在这里插入图片描述
    θ j = θ j α i = 1 m ( h θ ( x ( i ) ) y ( i ) ) x j ( i ) , ( j = 0 n ) \theta_{j}=\theta_{j}-\alpha \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)},(j=0 \ldots n)

    • alpha为学习率
    • 同时更新所有theta
    • 迭代更新至收敛

代码实现:

import numpy as np
from scipy.optimize import fmin_l_bfgs_b


class LogisticRegression(object):
    """
        Logistic Regression 类
    """
    def __init__(self, c=1.):
        self.c = c

    def fit(self, X, y):
        """
            训练模型
        """
        self._beta = np.zeros((X.shape[1] + 1, 1))

        # 使用L-BFGS-B求最优化
        result = fmin_l_bfgs_b(cost_func,               # 损失函数
                               self._beta,              # 初始值
                               args=(X, y, self.c))     # 损失函数的参数

        self._beta = result[0]
        return self

    def predict(self, X):
        """
            预测,返回标签
        """
        return np.argmax(self.predict_proba(X), axis=1)

    def predict_proba(self, X):
        """
            预测,返回概率
        """
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        XBeta = np.dot(X, self._beta).reshape((-1, 1))

        probs = 1. / (1. + np.exp(-XBeta))
        return np.hstack((1 - probs, probs))


def cost_func(beta, X, y, C):
    """
        损失函数/目标函数
        返回 正则化的负对数似然值 及 梯度值
    """

    # 给X加一列1,便于计算
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    # 转成列向量
    y = y.reshape((-1, 1))

    # 预先计算XBeta
    XBeta = np.dot(X, beta).reshape((-1, 1))

    # 预先计算Xbeta的exp值
    exp_XBeta = np.exp(XBeta)

    # 负对数似然值
    # neg_ll = C*np.sum(np.log(1. + exp_XBeta) - y*XBeta, axis=0) + 0.5*np.inner(beta, beta)
    neg_ll = C * np.sum(np.log(1. + exp_XBeta) - y * XBeta, axis=0)

    # 负对数似然值得梯度
    grad_neg_ll = C*np.sum((1. / (1. + exp_XBeta))*exp_XBeta*X - y*X, axis=0) + beta

    return neg_ll, grad_neg_ll


def cal_acc(true_labels, pred_labels):
    """
        计算准确率
    """
    n_total = len(true_labels)
    correct_list = [true_labels[i] == pred_labels[i] for i in range(n_total)]

    acc = sum(correct_list) / n_total
    return acc
sklearn中逻辑回归的函数调用:
sklearn.linear_model.LogisticRegression

猜你喜欢

转载自blog.csdn.net/weixin_41792682/article/details/89639993