Wilson Confidence Interval Algorithm

The basis of this algorithm is based on the binomial of the user's choice. Each recordable data is an independent event of "0-1", which conforms to the Poisson distribution, so this type of data is easily classified into two categories. Item distribution. There are many calculation formulas for calculating the confidence interval of the binomial distribution, the most common is the "Normal approximation interval", but it is only suitable for the case of a large number of samples (np> 5 and n(1 − p)> 5) For small samples, its accuracy is very poor. The Wilson algorithm solves the accuracy problem of small samples. The input of Wilson algorithm is the confidence level, and the output is the confidence interval. If you want to compare the data sorting, you can choose the lower limit of the confidence interval.

S is the formula of Wilson's confidence interval algorithm, where n is the total number of samples, u is the number of positive examples, v is the number of negative examples, and z represents the statistic corresponding to a certain confidence level. Generally, at the 95% confidence level, z statistics The value of the quantity is 1.96. To give a simple example, vote for someone with 80 votes in favor and 20 votes against, then n is 100, u is 80, and v is 20.

Quantile table of normal distribution:

Algorithm nature:

  1. Nature: the range of score S is [0,1), effect: normalized, suitable for sorting
  2. Nature: when the number of positive examples u is 0, p is 0, and the score S is 0; effect: no favorable comments, the lowest score;
  3. Nature: when the number of negative cases v is 0, p is 1, degenerates to 1/(1 + z^2 / n), and the score S is always less than 1; effect: the scores are permanently comparable;
  4. Property: When p is constant, the larger the n, the numerator decrease speed is less than the denominator decrease speed, the more the score S, and vice versa; the effect: the same praise rate p, the more the total number of examples n, the more the score S;
  5. Nature: When n tends to infinity, it degenerates to p, and the score S is determined by p; Effect: When the total number of comments n is larger, the favorable rating p will increase the score S more obviously;
  6. Nature: When the quantile z is larger, the total number n is more important, and the praise rate p is less important, and vice versa; the effect: the larger the z, the more important the total number of reviews n, the lower the degree of discrimination; the smaller the z, the more the praise rate p important;

Python code implementation:

def wilson_score(pos, total, p_z=0.8):
    """
    威尔逊得分计算函数
    :param pos: 正例数
    :param total: 总数
    :param p_z: 正太分布的分位数
    :return: 威尔逊得分
    """
    pos_rat = pos * 1. / total * 1.  # 正例比率
    score = (pos_rat + (np.square(p_z) / (2. * total))
             - ((p_z / (2. * total)) * np.sqrt(4. * total * (1. - pos_rat) * pos_rat + np.square(p_z)))) / \
    (1. + np.square(p_z) / total)
    return score

Application test:

Guess you like

Origin blog.csdn.net/gf19960103/article/details/105053027