A, sklearn logic regression Class
Sklearn in logistic regression, the model is constructed with the main and LogisticRegressionCV LogisticRegression two classes, only the differences between the two cross-validation and regularization factor C, the following describes two classes (** important parameters plus green tape):
sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)
- Penalty ** : Optional 'l1', 'l2', regularization coefficient. Note that 'l1' does not apply to the case of MVM.
- dual: A Boolean value. If True, then solve the dual form (only penalty = 'l2' and solver = 'liblinear' has a dual form); if False, the original form to solve. This parameter generally do not need care.
- C ** : a floating-point number that specifies the regularization of the inverse coefficient. If it is smaller the value, the greater the regularization. 10 recommendations.
- fit_intercept: a Boolean value, the need to develop intercept values. If False, it does not calculate the value of b (model assumes that your data is centralized). Generally need to intercept, this can be ignored.
- intercept_scaling: a floating point number, only when the solver 'liblinear' makes sense =. When fit_intercept, wherein out of a synthetic equivalent, characterized in that the constant is 1, has a weight of b. When calculating the regularization term, which is also considered artificial feature. Therefore, in order to reduce the influence of man-made features, the need to provide intercept_scaling. You do not need to bother.
- class_weight ** : category weights of the Y, a dictionary or character string 'balanced'. If a dictionary: Dictionary places each category are given by way of the right weight, such as {class_1: 0.4, class_2: 0.6 }. If the string is 'balanced': automatically calculates the weight, the higher the weight smaller sample weight. If not specified, the weight of each weight classification is 1.
- random_state: an integer or a RandomState instance, or None. • If an integer, it specifies the random number generator seed. • If RandomState instance, it specifies the random number generator. • If None, the default random number generator. Generally do not need to bother.
- Solver ** : A string that specifies the algorithm for solving optimization problems can be such a value. 'newton-cg': using Newton's Method. 'lbfgs': using the L-BFGS quasi-Newton method. 'liblinear': Use liblinear. 'sag': using Stochastic Average Gradient descent algorithm. Note: For small-scale datasets, 'liblearner' is more appropriate; for large-scale data sets, 'sag' is more appropriate. where 'newton-cg', 'lbfgs ',' sag ' treatment only penalty = '12' is.
- max_iter ** : An integer specifying the maximum number of iterations. You may default or self-adjusting.
- multi_class ** : A string that specifies strategies for multi-classification problems, the value may be as follows. • 'ovr': using one-vs-rest strategy. • 'multinomial': direct multi-classification logistic regression strategy, sklearn uses a method sofmax function. Note that the first multi-classification into one-hot Y data.
- verbose: a positive number. For opening / closing the intermediate iteration log output.
- warm_start: A Boolean value. If True, then use the results to continue training before training, or training from scratch.
- n_jobs: a positive number. The number of CPU time assigned tasks in parallel. If all of -1 is used by the CPU.
Another class parameter describes only different portions
sklearn.linear_model.LogisticRegressionCV(Cs=10, fit_intercept=True, cv=None, dual=False, penalty='l2', scoring=None, solver='lbfgs', tol=0.0001, max_iter=100, class_weight=None, n_jobs=1, verbose=0, refit=True, intercept_scaling=1.0, multi_class='ovr', random_state=None)
- Cs ** : regularization coefficient, the proposed default.
- CV ** : default hierarchical k-fold cross validation Stratified K-Folds, if the number is an integer fold cross-validation is developed.
Suppose we create a model model = sklearn.linear_model.LogisticRegression (), if you want to calculate and plot the ROC curve AUC time to pay attention:
- When we use model.predict () method to get a class label, for example, the prediction of binary 0 or 1, will not get the predicted probability of logistic regression.
- In order to obtain the predicted probability of logistic regression need model.predict_proba (x_test) method, note that this method returns an array shape i * j rows, the number of samples i, j is the number of classes, ij of the i-th sample is represented by j probability class; category probabilities and all the i-th sample and 1.
Two, sklearn the ROC and related classes KS
1、sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)
- y_true: 0,1 of binary type label.
- y_score: predicted value of y.
- average: returns the average value of the embodiment, there are [None, 'micro', 'macro' (default), 'samples', 'weighted'] Several.
- max_fpr: Sets the maximum fpr, it can take None.
Return Value: float type auc value, i.e., the area under the ROC curve.
2、sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)
- y_true: 0,1 of binary type label.
- y_score: predicted value of y.
- pos_label: classified as positive class label, if it is (0,1), (- 1,1) default class 1 is classified as positive.
- sample_weight: sample weights.
- drop_intermediate: whether to remove some of the poor threshold, remove the poor threshold can make better ROC curve.
Return Value: fpr, tpr, thresholds, returns three vectors, which respectively correspond to the threshold value stored nowhere and fpr TPR, it can be used to draw the ROC curve, and obtains a threshold corresponding to the maximum KS.
Third, the sample code
from sklearn.linear_model Import LogisticRegression from sklearn.model_selection Import train_test_split from sklearn.datasets Import load_iris from sklearn.metrics Import roc_auc_score, roc_curve Import matplotlib.pyplot AS PLT Import numpy AS NP IRIS = load_iris () iris.target [iris.target == . 1], iris.target [iris.target == 2] = 0,1 # the iris three types of data into the second category data, labels = 1 and combined into labels = 0 0, labels = 2 into. 1 x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (iris.data, iris.target, test_size = 0.3) #Split the training set and test set Model = LogisticRegression (Solver = ' Newton-CG ' , multi_class = ' OVR ' ) # Create a model model.fit (x_train, y_train) # pass the training data # LR predicted probability value test data, return i * j rows of data, i is the number of samples, j is the number of classes, ij denotes the i th sample is the probability of class j; i-th category probabilities of all samples is 1. # Not be used here model.predict (), because the output is 0 or 1, not the probability value can not be calculated on the subsequent roc curve # Further model._predict_proba_lr can be used to calculate the probability value lr y_pre = model.predict_proba ( x_test) y_0 = List (y_pre [:,. 1]) # fetch data of the second column, the second column because the probability of classification categories tends to 0 is 0, a probability of classification categories tend to 1:00. 1 FPR, TPR, Thresholds = roc_curve (android.permission.FACTOR., y_0) # calculation fpr, tpr, thresholds auc=roc_auc_score(y_test,y_0) #计算auc #画曲线图 plt.figure() plt.plot(fpr,tpr) plt.title('$ROC curve$') plt.show() #计算ks KS_max=0 best_thr=0 for i in range(len(fpr)): if(i==0): KS_max=tpr[i]-fpr[i] best_thr=thresholds[i] elif (tpr[i]-fpr[i]>KS_max): KS_max = tpr[i] - fpr[i] best_thr =Thresholds [I] Print ( ' maximum of KS: ' , KS_max) Print ( ' best threshold: ' , best_thr)