机器学习正负样本失衡时的评估指标参考,及代码实现

参考自:

https://www.zhihu.com/question/428547855

https://www.jianshu.com/p/7919ef304b19

混淆矩阵

预测结果
真实结果 正例 反例
正例 TP(真正例) FN(假反例)
反例 FP(假正例) TN(真反例)
  • F 1 = 2 ∗ P ∗ R P + R F1 = \frac{2 * P * R}{P + R} F1=P+R2PR
  • P:精确率:所有模型预测为正例的样本中真实为正例的概率
    • P = T P T P + F P P = \frac{TP}{TP + FP} P=TP+FPTP
  • R:召回率:所有正例中模型正确预测的概率
    • R = T P T P + F N R = \frac{TP}{TP+FN} R=TP+FNTP

GAUC

https://blog.csdn.net/qq_42363032/article/details/120070512

修改F1

此时的F1 score对于样本不平衡imbalanced learning问题并不太好用。所以另一种定义方法是分别定义F1 score for Positive和F1 score for Negative。前者等价于通常所说的F1 score,后者略微修改上述公式就能求出。然后再根据Positive和Negative的比例来加权求一个weighted F1 score即可。这个新的F1 score还是能大致反应模型的真实水平的。但是,如果的样本高度不均匀,weighted F1 score也会失效。

  • F1 score for Positive

    • F 1 p o s i t i v e = 2 ∗ P p o s i t i v e ∗ R p o s i t i v e P p o s i t i v e + R p o s i t i v e F1_{positive} = \frac{2 * P_{positive} * R_{positive}}{P_{positive} + R_{positive}} F1positive=Ppositive+Rpositive2PpositiveRpositive
    • P p o s i t i v e = T P T P + F P P_{positive} = \frac{TP}{TP + FP} Ppositive=TP+FPTP
    • R p o s i t i v e = T P T P + F N R_{positive} = \frac{TP}{TP+FN} Rpositive=TP+FNTP
  • F1 score for Negative

    • F 1 n e g a t i v e = 2 ∗ P n e g a t i v e ∗ R n e g a t i v e P n e g a t i v e + R n e g a t i v e F1_{negative} = \frac{2 * P_{negative} * R_{negative}}{P_{negative} + R_{negative}} F1negative=Pnegative+Rnegative2PnegativeRnegative
    • P n e g a t i v e = F N F N + T N P_{negative} = \frac{FN}{FN + TN} Pnegative=FN+TNFN
    • R n e g a t i v e = T N F P + T N R_{negative} = \frac{TN}{FP+TN} Rnegative=FP+TNTN
  • 设正样本Positive占比为 α \alpha α,负样本Negative占比为 β \beta β

  • F 1 w e i g h t e d = α ∗ F 1 p o s i t i v e + β ∗ F 1 n e g a t i v e F1_{weighted} = \alpha * F1_{positive} + \beta * F1_{negative} F1weighted=αF1positive+βF1negative

Specificity

一个简单且在实际应用和paper中都常见的指标是specificity,它是模型对negative的召回率。它的计算很简单,为specificity = TNs / (TNs + FPs)。

specificity之所以常见有两方面原因。在实际应用中,尤其是与imbalanced learning有关的问题中,少类样本通常是更加关注的样本。因此观察模型对它的召回率通常非常重要。在paper中,在你打榜主score打不赢别人的时候,你可以另辟蹊径地告诉别人,specificity非常重要,它就成了你人生中重要的僚机,让你多了一条路来有理有据地outperforms others。

  • TNR:true negative rate,描述识别出的负例占所有负例的比例
    计算公式为: S p e c i f i c i t y = T N R = T N T N + F P Specificity = TNR= \frac{TN}{TN + FP} Specificity=TNR=TN+FPTN

G-Mean

G-Mean是另外一个指标,也能评价不平衡数据的模型表现,其计算公式如下。

对正样本召回率和对负样本召回率相乘再开根号

G − M e a n = R e c a l l ∗ S p e c i f i c i t y = T P T P + F N ∗ T N T N + F P G-Mean = \sqrt{Recall * Specificity} \\ = \sqrt{\frac{TP}{TP+FN} * \frac{TN}{TN+FP}} GMean=RecallSpecificity =TP+FNTPTN+FPTN

MCC

MCC是应用在机器学习中,用以测量二分类的分类性能的指标,该指标考虑了真阳性,真阴性,假阳性和假阴性,通常认为该指标是一个比较均衡的指标,即使是在两类别的样本含量差别很大时,也可以应用它。
MCC本质上是一个描述实际分类与预测分类之间的相关系数,它的取值范围为[-1,1],取值为1时表示对受试对象的完美预测,取值为0时表示预测的结果还不如随机预测的结果,-1是指预测分类和实际分类完全不一致。
M C C = T P ∗ T N − F P ∗ F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) MCC = \frac{TP * TN - FP*FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} MCC=(TP+FP)(TP+FN)(TN+FP)(TN+FN) TPTNFPFN

----

GAUC代码实现

# 计算GAUC
# GAUC即先计算各个用户自己的AUC,然后加权平均
# 权重设为每个用户click的次数,并且会滤掉单个用户全是正样本或全是负样本的情况
# flag=False:不进行矫正   当 is_outflow=1 时,预测为1,矫正预测为0
def calculationGAUC(df, models, flag=False):
    gbdt, ohecodel, lr = models[0], models[1], models[2]
    # 计算测试集gauc
    sumWAUCclick, sumWclick = 0, 0
    sumWAUCall, sumWall = 0, 0
    for suuid, data in df.groupby('suuid'):
        # 过滤单个用户全是正样本或全是负样本的情况
        if len(set(list(data['y']))) == 1:
            continue
        # 计算权重为每个用户的点击数、每个用户样本数
        wclick = data['y'].sum()
        wall = len(list(data['y']))
        # 对于每个用户预测并计算其AUC
        x, y = np.array(data.iloc[:, 1:-1]), np.array(data.iloc[:, -1])
        x_leaves = gbdt.apply(x)[:, :, 0]
        x_trans = ohecodel.transform(x_leaves)
        yproba = lr.predict_proba(x_trans)[:, 1]  # 预测的概率
        # y_pre = lr.predict(x_trans)
        y_pre = []
        for proba in yproba:
            if proba > 0.8:
            # if proba > 0.5:
                y_pre.append(1)
            else:
                y_pre.append(0)
        if not flag:
            aucUser = roc_auc_score(y, y_pre)
        # 当 is_outflow=1 时
        else:
            is_outflowdata = np.array(data.loc[:, 'is_outflow'])
            for i in range(len(is_outflowdata)):
                if is_outflowdata[i] == 1:
                    if yproba[i] > 0.8:
                        yproba[i] = yproba[i] - 0.1
                    if yproba[i] > 0.8:
                        y_pre[i] = 1
                    else:
                        y_pre[i] = 0
            aucUser = roc_auc_score(y, y_pre)
        # 分子、分母累加
        sumWAUCclick = sumWAUCclick + wclick * aucUser
        sumWAUCall = sumWAUCall + wall * aucUser
        sumWclick += wclick
        sumWall += wall

    gaucclick = sumWAUCclick / sumWclick
    gaucall = sumWAUCall / sumWall
    return gaucclick, gaucall

修改F1代码实现

def weightF1ForPN(y, y_pre, F1_positive, alpha, beta):
    lenall = len(y)
    # y = y.flatten()
    pre = 0
    rec = 0
    precisoinlen = 0
    recallLen = 0

    for i in range(lenall):
        # 精确率_负样本:所有预测为负中,真实为负的比例
        if y_pre[i] == 0:
            pre += 1
            if y[i] == 0:
                precisoinlen += 1
        # 召回率_负样本:所有负例中模型为负预测的概率
        if y[i] == 0:
            rec += 1
            if y_pre[i] == 0:
                recallLen += 1
    p_negative = precisoinlen / pre
    r_negative = recallLen / rec
    print('  预测为负的样本数:{},在这其中实际为负的样本数:{},负样本精确率:{}'.format(pre, precisoinlen, p_negative))
    print('  负例样本:{},负例中预测为负的数量:{},负样本召回率:{}'.format(rec, recallLen, r_negative))
    F1_negative = (2 * p_negative * r_negative) / (p_negative + r_negative)
    print('  负样本F1:{}'.format(F1_negative))
    f1_weight = alpha * F1_positive + beta * F1_negative
    return f1_weight, p_negative, r_negative

MCC代码实现

def evalMCC(y, y_pre):
    lenall = len(y)
    TP, FP, FN, TN = 0, 0, 0, 0
    for i in range(lenall):
        if y_pre[i] == 1:
            if y[i] == 1:
                TP += 1
            if y[i] == 0:
                FP += 1
        if y_pre[i] == 0:
            if y[i] == 1:
                FN += 1
            if y[i] == 0:
                TN += 1
    member = TP*TN - FP*FN
    demember = ((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN)) ** 0.5
    mcc = member / demember
    return mcc

猜你喜欢

转载自blog.csdn.net/qq_42363032/article/details/121560262