10 logistic regression

10 logistic regression

Classification algorithms - logistic regression

Scenarios (binary)

  • CTR (a typical two-class problem, point or no point, can be obtained)
  • It is spam
  • Whether sick
  • Financial Fraud
  • False account

Logistic regression defined

  1. Logistic regression: a classification algorithm, using linear regression equation as input, convert sigmoid function by Probability.
  2. sigmoid function: 1 / (1 + e ^ -x), the input value x, maps to (0,1), linked to the probability value

  3. Logistic regression formula

Linear regression Input -> Sigmoid conversion -> Classification [0,1] probability value, the threshold value and generally 0.5

  1. Logistic regression loss of function optimization
  • Same as linear regression principle, it is a classification problem, the loss of function is not the same, can only be solved by gradient descent
  • Log-likelihood function loss:

y = 1, a target value, the probability of a predictive value is 100%, the loss is minimum (close to 0)
Note: a small amount of which category of data, which is characterized by positive Example 1

Class 0 is the target value, the greater the probability that the predicted class 1, the greater the loss. (0 predicted probability of belonging to the better)
determines only the probability of belonging to a category, this probability is 1 belongs to, if they are small probability (less than threshold), then the non-1, that is, 0.

  • Examples :( complete loss function loss, the higher the accuracy)

Loss function

Mean square error

  • Only a minimum, the lowest point there is no more local

Log-likelihood function

  • A plurality of local minima, the use of gradient descent, can be problematic, although no global minimum point, but the effect can be.
  • Improve methods:
    • Random initialization, the minimum number of comparison results
    • The solution process, the learning rate adjustment

Logistic regression Case

Benign / Malignant breast data

  • Download raw data:

  • Description Data
    (1) 699 samples, a total of 11 data, the first column of search terms id, 9 are listed after tumor
    medicine related characteristics, a numerical value indicating the last tumor types.
    (2) contains 16 missing values, use the "?" Mark.

from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, classification_report
from sklearn.externals import joblib
import pandas as pd
import numpy as np

def logistic():
    :return: None
    # 1.读取数据
    # 构造列标签名字
    column = ['Sample code number','Clump Thickness', 'Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
    data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', names=column)

    # 2.缺失值处理
    data = data.replace(to_replace='?', value=np.nan)
    data = data.dropna() # 直接删除nan

    # 3. 进行数据分割
    x_train, x_test, y_train, y_test = train_test_split(data[column[1:10]], data[column[10]], test_size=0.25) # 取特征值,目标值

    # 4.标准化处理 (分类问题,目标值不做标准化)
    std = StandardScaler()
    x_train = std.fit_transform(x_train)
    x_test = std.transform(x_test)

    # 5. 逻辑回归预测
    lg = LogisticRegression(C=1.0)
    lg.fit(x_train, y_train) # 训练lg模型
    y_predict = lg.predict(x_test)
    print('准确率:', lg.score(x_test, y_test))
    print('召回率:',classification_report(y_test, y_predict, labels=[2,4], target_names=['良性','恶性']))

if __name__ == '__main__':

Logistic regression summary

  1. 应用:广告点击率预测,是否患病,金融诈骗,是否为虚假账号 (带有概率的二分类问题)
  2. 优点:适合需要得到一个分类概率的场景,简单,速度快
  3. 缺点:不方便处理多分类问题 1 vs 1, 1 vs


  • 有没有先验概率 ( P(C) ) 需不需要总结历史数据

Guess you like

Origin www.cnblogs.com/hp-lake/p/11979505.html