Python数据分析：逻辑回归(logistic regression)

逻辑回归（Logistic Regression），简称LR，能够将特征输入集合转化为0和1这两类的概率。

优点：计算代价不高，易于理解和实现
缺点：容易欠拟合，分类精度不高
使用数据：数值型和标称型

基本模型：

训练样本：
$X \left(x_{\theta}, x_{1}, x_{2}, \ldots, x_{n}\right)$
学习参数：
$\Theta\left(\theta_{\theta}, \theta_{1}, \theta_{2}, \ldots, \theta_{n}\right)$

$Z=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\cdots+\theta_{n} x_{n}$

向量表示：
$Z=\Theta^{T} X$
sigmoid函数将线性转换成非线性
$g(Z)=\frac{1}{1+e^{-Z}}$
预测函数：
$h_{\theta}(X)=g\left(\Theta^{T} X\right)=\frac{1}{1+e^{-\theta^{T} X}}$
用概率的形式表示：
- 正样本
  $h_{\theta}(X)=P(y=1 | X ; \Theta)$
- 负样本
  $1-h_{\theta}(X)=P(y=0 | X ; \Theta)$
损失函数：

$\operatorname{cost}\left(h_{\theta}(x), y\right)=\left\{\begin{array}{ll}{-\log \left(h_{\theta}(x)\right)} & {\text { if } \mathrm{y}=1} \\ {-\log \left(1-h_{\theta}(x)\right)} & {\text { if } \mathrm{y}=0}\end{array}\right.$

$\operatorname{cost}\left(h_{\theta}(x), y\right)=\sum_{i=1}^{m}-y_{i} \log \left(h_{\theta}(x)\right)-\left(1-y_{i}\right) \log \left(1-h_{\theta}(x)\right)$

目标：通过训练样本求出参数theta使损失函数最小化
求解：梯度下降(gradenet descent)

$\theta_{j}=\theta_{j}-\alpha \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)},(j=0 \ldots n)$
- alpha为学习率
- 同时更新所有theta
- 迭代更新至收敛

代码实现：

import numpy as np
from scipy.optimize import fmin_l_bfgs_b


class LogisticRegression(object):
    """
        Logistic Regression 类
    """
    def __init__(self, c=1.):
        self.c = c

    def fit(self, X, y):
        """
            训练模型
        """
        self._beta = np.zeros((X.shape[1] + 1, 1))

        # 使用L-BFGS-B求最优化
        result = fmin_l_bfgs_b(cost_func,               # 损失函数
                               self._beta,              # 初始值
                               args=(X, y, self.c))     # 损失函数的参数

        self._beta = result[0]
        return self

    def predict(self, X):
        """
            预测，返回标签
        """
        return np.argmax(self.predict_proba(X), axis=1)

    def predict_proba(self, X):
        """
            预测，返回概率
        """
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        XBeta = np.dot(X, self._beta).reshape((-1, 1))

        probs = 1. / (1. + np.exp(-XBeta))
        return np.hstack((1 - probs, probs))


def cost_func(beta, X, y, C):
    """
        损失函数/目标函数
        返回 正则化的负对数似然值 及 梯度值
    """

    # 给X加一列1，便于计算
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    # 转成列向量
    y = y.reshape((-1, 1))

    # 预先计算XBeta
    XBeta = np.dot(X, beta).reshape((-1, 1))

    # 预先计算Xbeta的exp值
    exp_XBeta = np.exp(XBeta)

    # 负对数似然值
    # neg_ll = C*np.sum(np.log(1. + exp_XBeta) - y*XBeta, axis=0) + 0.5*np.inner(beta, beta)
    neg_ll = C * np.sum(np.log(1. + exp_XBeta) - y * XBeta, axis=0)

    # 负对数似然值得梯度
    grad_neg_ll = C*np.sum((1. / (1. + exp_XBeta))*exp_XBeta*X - y*X, axis=0) + beta

    return neg_ll, grad_neg_ll


def cal_acc(true_labels, pred_labels):
    """
        计算准确率
    """
    n_total = len(true_labels)
    correct_list = [true_labels[i] == pred_labels[i] for i in range(n_total)]

    acc = sum(correct_list) / n_total
    return acc

sklearn中逻辑回归的函数调用：

sklearn.linear_model.LogisticRegression

Python数据分析：逻辑回归(logistic regression)

Python数据分析：逻辑回归(logistic regression)

逻辑回归（Logistic Regression），简称LR，能够将特征输入集合转化为0和1这两类的概率。

基本模型：

sklearn中逻辑回归的函数调用：

猜你喜欢