Evaluation classification model and model selection and tuning

Model of evaluation criteria

  • Accuracy
    • estimator.score () the most common, the percentage of correct predictions
  • Confusion matrix
    • In classification tasks, there are four different combinations between the predicted results and the right mark, constitutes a confusion matrix (for multi-classification)
  • Accuracy rate
    • Predict the result is positive samples in the proportion of true positive examples , namely Richard quasi
  • Recall
    • Is a real positive sample embodiment a ratio of a positive result of the prediction of the embodiment , the whole, the ability to distinguish positive samples investigation
    •  

       

  • Other classification criteria
    • F1-score, reflecting the robustness of the model . Specific operation will be displayed in the code results.
    •  

       

These are a standard for model evaluation. Some foundation.


Classification model of assessment API

sklearn.metrics.classfication_report (y_true, y_pred, target_names = None)
   y_true : real target
  y_pred : estimator predicted target
  target_names : target category name
  return : each class precision and recall

 Case presentation:

# Of news classified this case as an example 

from sklearn.datasets Import fetch_20newsgroups
 from sklearn.model_selection Import train_test_split
 from sklearn.feature_extraction.text Import TfidfVectorizer   # feature extraction 
from sklearn.naive_bayes Import MultinomialNB   # Bayesian 

# 1. Obtain data 
news = fetch_20newsgroups ( = Subset ' All ' )
 # 2. divided data 
x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (news.data, news.target, test_size = 0.25 )
 # of data sets in the feature extraction 3.
= TF TfidfVectorizer ()
 # to the word of the importance of the training set among statistical 
x_train = tf.fit_transform (x_train)
 # test set should be the importance of statistical word 
x_test = tf.transform (x_test)
 # 4. carry out simple Bayeux Sri Lanka algorithm 
BYS = MultinomialNB (Alpha = 0.1 ) 
bys.fit (x_train, y_train) 
# 5. predict 
predict = bys.predict (x_test)
 Print ( " predicted article category: " , predict)
 Print ( " accuracy: " , bys.score (x_test, android.permission.FACTOR.))
 # 6. model assessment 
from sklearn.metrics Import classification_report
classification_report(y_test, predict, target_names=news.target_names)
Code

 

 

 


Model of selecting and tuning

Cross-validation

  In order to evaluate the model more accurate and reliable

  Cross-validation process:

Will receive the training data, the training and validation set is divided into the following FIG Example: The data is divided into five parts, one of which is then set as the verification after 5 (group) of the test, each time a different replacement validation set. . 
Results 5 to obtain the model group, averaged as the final result. Also known as 5-fold cross-validation .

 

 

Grid search

  Adjust parameters, also known as hyper-parameters!

  Super Parametric Search - grid search:

 

Under normal circumstances, there are many parameters that need to be manually specified (such as K-value k- nearest neighbor algorithm), this is called hyper-parameters . But the complicated manual process, so it is necessary to model the preset parameter combinations of several super. 
Each super parameters are used to evaluate cross-validation . Finally, select the optimal combination of parameters modeling.

 

 


Grid search and cross validation API

 

sklearn.model_selection.GridSearchCV(estimator, param_grid=None, cv=None)
    对估计器的指定参数进行详尽搜索
    estimator:估计器对象
    param_grid:估计器参数(dict){"n_neighbors":[1,3,5]}
    cv:指定几折交叉验证
    fit:输入训练集数据
    score:准确率
结果分析
    best_score:在交叉验证中验证的最好结果
    best_estimator_:最好的参数模型
    cv_results_:每次交叉验证后的测试集准确率结果和训练集准确率结果

   案例代码: 

# 代码案例
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import MinMaxScaler

# 读取数据
import pandas as pd
import numpy as np

data = np.loadtxt('datingTestSet.txt',dtype=np.object,delimiter='\t',encoding='gbk')
df = pd.DataFrame(data)

#获取特征值
x = df.iloc[:,:3]
# 获取目标值
y = df[3]


#特征工程
# 归一化
scaler = MinMaxScaler()
x = scaler.fit_transform(x)

# 分割数据集
x_train , x_test, y_train,y_test=train_test_split(x,y,test_size=0.25)

# 实例化估计器
kn= KNeighborsClassifier()

# 构造一些参数的值进行搜索
param={'n_neighbors':[15,16,17,18,19,20,21,22,23,24]}

# 进行网格交叉搜索
gc = GridSearchCV(kn,param_grid=param,cv=2)
gc.fit(x_train,y_train)

# 预测准确率
print('测试集上的准确率:',gc.score(x_test,y_test))
print('在交叉验证中最好的结果:',gc.best_score_)
print('选择最好的模型是:',gc.best_estimator_)
print('每个超参数,每次交叉验证的结果:',gc.cv_results_)

 

 

 

Guess you like

Origin www.cnblogs.com/luowei93/p/11964729.html