sklearn-- logistic regression, ROC curve and the curve KS

A, sklearn logic regression Class

  Sklearn in logistic regression, the model is constructed with the main and LogisticRegressionCV LogisticRegression two classes, only the differences between the two cross-validation and regularization factor C, the following describes two classes (** important parameters plus green tape):

  

  sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)

  •  Penalty ** : Optional 'l1', 'l2', regularization coefficient. Note that 'l1' does not apply to the case of MVM.
  • dual: A Boolean value. If True, then solve the dual form (only penalty = 'l2' and solver = 'liblinear' has a dual form); if False, the original form to solve. This parameter generally do not need care.
  • C ** : a floating-point number that specifies the regularization of the inverse coefficient. If it is smaller the value, the greater the regularization. 10 recommendations.
  • fit_intercept: a Boolean value, the need to develop intercept values. If False, it does not calculate the value of b (model assumes that your data is centralized). Generally need to intercept, this can be ignored.
  • intercept_scaling: a floating point number, only when the solver 'liblinear' makes sense =. When fit_intercept, wherein out of a synthetic equivalent, characterized in that the constant is 1, has a weight of b. When calculating the regularization term, which is also considered artificial feature. Therefore, in order to reduce the influence of man-made features, the need to provide intercept_scaling. You do not need to bother.
  • class_weight ** : category weights of the Y, a dictionary or character string 'balanced'. If a dictionary: Dictionary places each category are given by way of the right weight, such as {class_1: 0.4, class_2: 0.6 }. If the string is 'balanced': automatically calculates the weight, the higher the weight smaller sample weight. If not specified, the weight of each weight classification is 1.
  • random_state: an integer or a RandomState instance, or None. • If an integer, it specifies the random number generator seed. • If RandomState instance, it specifies the random number generator. • If None, the default random number generator. Generally do not need to bother.
  • Solver ** : A string that specifies the algorithm for solving optimization problems can be such a value. 'newton-cg': using Newton's Method. 'lbfgs': using the L-BFGS quasi-Newton method. 'liblinear': Use liblinear. 'sag': using Stochastic Average Gradient descent algorithm. Note: For small-scale datasets, 'liblearner' is more appropriate; for large-scale data sets, 'sag' is more appropriate. where 'newton-cg', 'lbfgs ',' sag ' treatment only penalty = '12' is.
  • max_iter ** : An integer specifying the maximum number of iterations. You may default or self-adjusting.
  • multi_class ** : A string that specifies strategies for multi-classification problems, the value may be as follows. • 'ovr': using one-vs-rest strategy. • 'multinomial': direct multi-classification logistic regression strategy, sklearn uses a method sofmax function. Note that the first multi-classification into one-hot Y data.
  • verbose: a positive number. For opening / closing the intermediate iteration log output.
  • warm_start: A Boolean value. If True, then use the results to continue training before training, or training from scratch.
  • n_jobs: a positive number. The number of CPU time assigned tasks in parallel. If all of -1 is used by the CPU.

  

  Another class parameter describes only different portions

  sklearn.linear_model.LogisticRegressionCV(Cs=10, fit_intercept=True, cv=None, dual=False, penalty='l2', scoring=None, solver='lbfgs', tol=0.0001, max_iter=100, class_weight=None, n_jobs=1, verbose=0, refit=True, intercept_scaling=1.0, multi_class='ovr', random_state=None)

  • Cs ** : regularization coefficient, the proposed default.
  • CV ** : default hierarchical k-fold cross validation Stratified K-Folds, if the number is an integer fold cross-validation is developed.

 

  Suppose we create a model model = sklearn.linear_model.LogisticRegression (), if you want to calculate and plot the ROC curve AUC time to pay attention:

  1. When we use model.predict () method to get a class label, for example, the prediction of binary 0 or 1, will not get the predicted probability of logistic regression.
  2. In order to obtain the predicted probability of logistic regression need model.predict_proba (x_test) method, note that this method returns an array shape i * j rows, the number of samples i, j is the number of classes, ij of the i-th sample is represented by j probability class; category probabilities and all the i-th sample and 1.

 

Two, sklearn the ROC and related classes KS

  1、sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)

  • y_true: 0,1 of binary type label.
  •  y_score: predicted value of y.
  •  average: returns the average value of the embodiment, there are [None, 'micro', 'macro' (default), 'samples', 'weighted'] Several.
  • max_fpr: Sets the maximum fpr, it can take None.

  Return Value: float type auc value, i.e., the area under the ROC curve.

 

  2、sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)

  • y_true: 0,1 of binary type label.
  •  y_score: predicted value of y.
  • pos_label: classified as positive class label, if it is (0,1), (- 1,1) default class 1 is classified as positive.
  • sample_weight: sample weights.
  • drop_intermediate: whether to remove some of the poor threshold, remove the poor threshold can make better ROC curve.

  Return Value: fpr, tpr, thresholds, returns three vectors, which respectively correspond to the threshold value stored nowhere and fpr TPR, it can be used to draw the ROC curve, and obtains a threshold corresponding to the maximum KS.

 

Third, the sample code

  

from sklearn.linear_model Import LogisticRegression
 from sklearn.model_selection Import train_test_split
 from sklearn.datasets Import load_iris
 from sklearn.metrics Import roc_auc_score, roc_curve
 Import matplotlib.pyplot AS PLT
 Import numpy AS NP 

IRIS = load_iris () 
iris.target [iris.target == . 1], iris.target [iris.target == 2] = 0,1    # the iris three types of data into the second category data, labels = 1 and combined into labels = 0 0, labels = 2 into. 1 
x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (iris.data, iris.target, test_size = 0.3) #Split the training set and test set 

Model = LogisticRegression (Solver = ' Newton-CG ' , multi_class = ' OVR ' )     # Create a model 
model.fit (x_train, y_train)   # pass the training data 

# LR predicted probability value test data, return i * j rows of data, i is the number of samples, j is the number of classes, ij denotes the i th sample is the probability of class j; i-th category probabilities of all samples is 1. 
# Not be used here model.predict (), because the output is 0 or 1, not the probability value can not be calculated on the subsequent roc curve 
# Further model._predict_proba_lr can be used to calculate the probability value lr 
y_pre = model.predict_proba ( x_test) 

y_0 = List (y_pre [:,. 1])     # fetch data of the second column, the second column because the probability of classification categories tends to 0 is 0, a probability of classification categories tend to 1:00. 1 

FPR, TPR, Thresholds = roc_curve (android.permission.FACTOR., y_0)   # calculation fpr, tpr, thresholds
auc=roc_auc_score(y_test,y_0) #计算auc

#画曲线图
plt.figure()
plt.plot(fpr,tpr)
plt.title('$ROC curve$')
plt.show()


#计算ks
KS_max=0
best_thr=0
for i in range(len(fpr)):
    if(i==0):
        KS_max=tpr[i]-fpr[i]
        best_thr=thresholds[i]
    elif (tpr[i]-fpr[i]>KS_max):
        KS_max = tpr[i] - fpr[i]
        best_thr =Thresholds [I] 

Print ( ' maximum of KS: ' , KS_max)
 Print ( ' best threshold: ' , best_thr)

 

Guess you like

Origin www.cnblogs.com/dwithy/p/11583612.html