Model of evaluation criteria
- Accuracy
- estimator.score () the most common, the percentage of correct predictions
- Confusion matrix
- In classification tasks, there are four different combinations between the predicted results and the right mark, constitutes a confusion matrix (for multi-classification)
- Accuracy rate
- Predict the result is positive samples in the proportion of true positive examples , namely Richard quasi
- Recall
- Is a real positive sample embodiment a ratio of a positive result of the prediction of the embodiment , the whole, the ability to distinguish positive samples investigation
-
- Other classification criteria
- F1-score, reflecting the robustness of the model . Specific operation will be displayed in the code results.
-
These are a standard for model evaluation. Some foundation.
Classification model of assessment API
sklearn.metrics.classfication_report (y_true, y_pred, target_names = None) y_true : real target y_pred : estimator predicted target target_names : target category name return : each class precision and recall
Case presentation:
# Of news classified this case as an example from sklearn.datasets Import fetch_20newsgroups from sklearn.model_selection Import train_test_split from sklearn.feature_extraction.text Import TfidfVectorizer # feature extraction from sklearn.naive_bayes Import MultinomialNB # Bayesian # 1. Obtain data news = fetch_20newsgroups ( = Subset ' All ' ) # 2. divided data x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (news.data, news.target, test_size = 0.25 ) # of data sets in the feature extraction 3. = TF TfidfVectorizer () # to the word of the importance of the training set among statistical x_train = tf.fit_transform (x_train) # test set should be the importance of statistical word x_test = tf.transform (x_test) # 4. carry out simple Bayeux Sri Lanka algorithm BYS = MultinomialNB (Alpha = 0.1 ) bys.fit (x_train, y_train) # 5. predict predict = bys.predict (x_test) Print ( " predicted article category: " , predict) Print ( " accuracy: " , bys.score (x_test, android.permission.FACTOR.)) # 6. model assessment from sklearn.metrics Import classification_report classification_report(y_test, predict, target_names=news.target_names)
Model of selecting and tuning
Cross-validation
In order to evaluate the model more accurate and reliable
Cross-validation process:
Will receive the training data, the training and validation set is divided into the following FIG Example: The data is divided into five parts, one of which is then set as the verification after 5 (group) of the test, each time a different replacement validation set. .
Results 5 to obtain the model group, averaged as the final result. Also known as 5-fold cross-validation .
Grid search
Adjust parameters, also known as hyper-parameters!
Super Parametric Search - grid search:
Under normal circumstances, there are many parameters that need to be manually specified (such as K-value k- nearest neighbor algorithm), this is called hyper-parameters . But the complicated manual process, so it is necessary to model the preset parameter combinations of several super.
Each super parameters are used to evaluate cross-validation . Finally, select the optimal combination of parameters modeling.
Grid search and cross validation API
sklearn.model_selection.GridSearchCV(estimator, param_grid=None, cv=None) 对估计器的指定参数进行详尽搜索 estimator:估计器对象 param_grid:估计器参数(dict){"n_neighbors":[1,3,5]} cv:指定几折交叉验证 fit:输入训练集数据 score:准确率 结果分析 best_score:在交叉验证中验证的最好结果 best_estimator_:最好的参数模型 cv_results_:每次交叉验证后的测试集准确率结果和训练集准确率结果
案例代码:
# 代码案例 from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.preprocessing import MinMaxScaler # 读取数据 import pandas as pd import numpy as np data = np.loadtxt('datingTestSet.txt',dtype=np.object,delimiter='\t',encoding='gbk') df = pd.DataFrame(data) #获取特征值 x = df.iloc[:,:3] # 获取目标值 y = df[3] #特征工程 # 归一化 scaler = MinMaxScaler() x = scaler.fit_transform(x) # 分割数据集 x_train , x_test, y_train,y_test=train_test_split(x,y,test_size=0.25) # 实例化估计器 kn= KNeighborsClassifier() # 构造一些参数的值进行搜索 param={'n_neighbors':[15,16,17,18,19,20,21,22,23,24]} # 进行网格交叉搜索 gc = GridSearchCV(kn,param_grid=param,cv=2) gc.fit(x_train,y_train) # 预测准确率 print('测试集上的准确率:',gc.score(x_test,y_test)) print('在交叉验证中最好的结果:',gc.best_score_) print('选择最好的模型是:',gc.best_estimator_) print('每个超参数,每次交叉验证的结果:',gc.cv_results_)