Classification and evaluation index prediction algorithm

  The accuracy of classification and prediction models to predict the training set to derive and can not reflect future performance prediction model, in order to effectively judge the performance of a predictive model, we need a group does not participate in data collection predictive model established, and evaluation of the accuracy of the prediction model on the data set, the set of independent data sets is called the test set. Evaluation prediction model, usually a relative / absolute error, mean absolute error, mean square error, root mean square error, mean absolute percentage error of indicators.

1, the absolute error relative error

  Provided $ $ represents the actual value of the Y, $ \ hat {Y} $ denotes the predicted value of the $ E $ absolute error, which is calculated as: $ E = Y- \ hat {Y} $

  $ E $ is the relative error, which is calculated as: $ e = \ frac {Y- \ hat {Y}} {Y} $

2, the mean absolute error

  Calculated average error is: $ MAE = \ frac {1} {n} \ sum_ {i = 1} ^ {n} \ left | E_ {i} \ right | = \ frac {1} {n} \ sum_ {i = 1} ^ {n} \ left | Y_ {i} - \ hat {Y} _ {i} \ right | $

  Wherein, $ MAE $ denotes the mean absolute error, $ E_ {i} $ denotes the absolute error of $ I $ th actual value and the predicted value, $ Y _ {\ mathrm {i}} $ represents $ I $ th actual values, $ \ hat {Y} _ {i} $ $ I $ denotes the predicted value.

  Since the positive and negative prediction error, in order to avoid negative offset, so the absolute value of the integrated error and taking the average , which is one of the composite indicator Error Analysis.

3, the mean square error

  Calculated mean square error is: $ MSE = \ frac {1} {n} \ sum_ {i = 1} ^ {n} E_ {i} ^ {2} = \ frac {1} {n} \ sum_ { i = 1} ^ {n} \ left (Y_ {i} - \ hat {Y} _ {i} \ right) ^ {2} $

  Where, MSE represents the mean square error. Usually the mean square error for reducing the degree of distortion of the square . Prediction mean square error is the average sum of the squares of the error , it can not avoid the problem of plus or minus added.

  Because of the squared error E has been to strengthen the role of a large error in the value of the index, thereby improving the sensitivity of the indicator, it is a great advantage. The mean square error is one of the comprehensive index error analysis.

4, the RMSE

  RMS error is calculated as: $ RMSE = \ sqrt {\ frac {1} {n} \ sum_ {i = 1} ^ {n} E_ {i} ^ {2}} = \ sqrt {\ frac { 1} {n} \ sum_ {i = 1} ^ {n} \ left (Y_ {i} - \ hat {Y} _ {i} \ right) ^ {2}} $

  Wherein, represents the RMSE Root Mean Square Error, the other symbols are the same before.

  This is the square root of the mean square error, representing the degree of dispersion predicted value , also known as standard deviation, best fit case RMSE $ = 0 $ . RMSE error is one of the composite indicator analysis.

5, the mean absolute percent error

  MAPE is: $ MAPE = \ frac {1} {n} \ sum_ {i = 1} ^ {n} \ left | E_ {i} / Y_ {i} \ right | = \ frac {1} {n} \ sum_ {i = 1} ^ {n} \ left | \ left (Y_ {i} - \ hat {Y} _ {i} \ right) / Y_ {i} \ right | $

  Which, MAPE represents the mean absolute percentage error. MAPE generally believed that less than 10, and higher precision.

6, Kappa statistics

  Kappa statistics is to compare two or more observers are the same for the same thing twice or multiple observations of the same thing, or observer, is the difference between consistency and coherence due to the opportunities resulting from actual observation as the size of the basis for evaluation of statistical indicators. Kappa statistics and weighted Kappa statistics not only for consistency disordered and ordered categorical variable data, reproducible test, but also reflects the consistency of a given size "amount" value.

  Kappa value in [-1 + 1] between the size of its value has a different meaning:

    Kappa = + 1, described two determination results were in agreement.

    Kappa = -1, described two determination results completely inconsistent.

    Kappa = 0, explain the result of judgment is an opportunity twice caused.

    Kappa <0, explained worse, the two test results are not consistent degree of consistency than opportunity caused meaningless in practical applications.

    Kappa> 0, a meaningful description, Kappa greater, indicating consistency better.

    Kappa≥0.75, explained that it had achieved a considerable degree of satisfaction with the agreement.

    Kappa <0.4, illustrate the extent of the lack of consistency.

7, recognition accuracy

   Recognition accuracy is calculated as: $ \ text {Accuracy} = \ frac {T P + FN} {T P + T N + F P + FN} \ times 100 \% $

  In which the meanings:

  TP (True Positives): certainly correct, yes indicate the correct number of categories.

  TN (True Negatives): the right negative, indicates the number of correctly classified negative.

  FP (False Positives): a false positive, indicate the number of false positive classification.

  FN (False Negatives): negative error, indicates the number of misclassification negative.

8, recognition accuracy rate

  Recognition accuracy rate is calculated as: $ \ text {Precision} = \ frac {TP} {T P + FP} \ times 100 \% $

9, feedback rate

  Response rate is calculated as: $ \ text {Recall} = \ frac {TP} {T P + TN} \ times 100 \% $

10, ROC curve

  受试者工作特性(Receiver Operating Characteristic,ROC)曲线,得此名的原因在于曲线上各点反映着相同的感受性,它们都是对同一信号刺激的反应,只不过是在几种不同的判定标准下所得的结果而已。接受者操作特性曲线就是以虚惊概率为横轴,击中概率为纵轴所组成的坐标图,和被试在特定刺激条件下由于采用不同的判断标准得出的不同结果画出的曲线。

  这是一种非常有效的模型评价方法,可为选定临界值给出定量提示。将灵敏度(Sensitivity)设在纵轴,1-特异性(1-Specificity)设在横轴,就可得出ROC曲线图。该曲线下的积分面积(Area)大小与每种方法的优劣密切相关,反映分类器正确分类的统计概率,其值越接近1说明该算法的效果越好。

11、混淆矩阵

  混淆矩阵(Confusion Matrix)是模式识别领域中一种常用的表达形式。它描绘样本数据的真实属性与识别结果类型之间的关系,是评价分类器性能的一种常用方法。假设对于N类模式的分类任务,识别数据集D包括$T_{0}$个样本,每类模式分别含有$T_{i}$个数据(i=1…N)。采用某种识别算法构造分类器$C$,$c m_{i j}$,表示第$i$类模式被分类器$C$判断成第$j$类模式的数据占第$i$类模式样本总数的百分率,则可得到如下N·N维混淆矩阵:$$C M(C, D)=\left(\begin{array}{ccccc}{c m_{11}} & {c m_{22}} & {\dots} & {c m_{1 i}} & {\dots} & {c m_{1 N}} \\ {c m_{21}} & {c m_{22}} & {\dots} & {c m_{2 i}} & {\dots} & {c m_{2 N}} \\ {\vdots} & {\vdots} & {} & {\vdots} & {} \\ {c m_{i 1}} & {c m_{i 2}} & {\dots} & {c m_{i i}} & {\dots} & {c m_{i N}} \\ {\vdots} & {\vdots} & {} & {\vdots} & {} \\ {c m_{N 1}} & {c m_{N 2}} & {\dots} & {c m_{N i}} & {\dots} & {c m_{N N}}\end{array}\right)$$

  混淆矩阵中元素的行下标对应目标的真实属性,列下标对应分类器产生的识别属性。对角线元素表示各模式能够被分类器C正确识别的百分率,而非对角线元素则表示发生错误判断的百分率。

  通过混淆矩阵,可以获得分类器的正确识别率和错误识别率。

  各模式正确识别率:$R_{i}=c m_{i i}, \quad i=1, \cdots, N$

  平均正确识别率:$R_{A}=\sum_{i=1}^{N}\left(c m_{i i} \cdot T_{i}\right) / T_{0}$

  各模式错误识别率:$W_{i}=\sum_{j=1, j \neq i}^{N} c m_{i j}=1-c m_{i i}=1-R_{i}$

  平均错误识别率:$W_{A}=\sum_{i=1}^{N} \sum_{j=1, j \neq i}^{N}\left(c m_{i i} \cdot T_{i}\right) / T_{0}=1-R_{A}$

 

Guess you like

Origin www.cnblogs.com/fangxiaoqi/p/11456301.html