Quantile-based outlier removal in pyspark

First explain the lower quartiles

insert image description here

How to identify outliers in a set of data by quartiles

Calculated by the Tukey's Test method, which can be used to identify outliers in a set of data:

The specific method is as follows: where Q3 represents the upper quartile, Q1 represents the lower quartile, and k represents the coefficient, which can take a value of 1.5 or 3.

  • Maximum estimate = Q3+k(Q3-Q1)
  • Minimum estimate = Q1-k(Q3-Q1)

When k=3, it represents extreme outliers;
when k=1.5, it represents moderate outliers.

Code

异常值:不属于正常的值 包含:缺失值,超过正常范围内的较大值或较小值
 + 分位数去极值
 + 中位数绝对偏差去极值
 + 正态分布去极值
上述三种操作的核心都是:通过原始数据设定一个正常的范围,超过此范围的就是一个异常值
# spark 分位数去极值
# https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html?highlight=approxquantile
df = ss.createDataFrame([
    (1, 143.5, 651),
    (2, 509.9, 10),
    (3, 33.9, 1333),
    (4, 66.2, 904),
    (5, 133.2, 0),
    (6, 124.1, 10172),
    (7, 1.2, 1648093296)], ['id', 'cost', 'run_time'])
df.show()

cols = ['cost', 'run_time']

def quantile_excludes_outliers(df, cols, k_q1=1.5, k_q3=1.5):
    '''
    基于四分位剔除异常值
    :param df: spark dataframe
    :param cols: 列名列表
    :param k_q1: 下四分位系数
    :param k_q3: 上四分位系数
    :return: spark dataframe
    '''
    bounds = {
    
    }
    for col in cols:
        quantiles = df.approxQuantile(col,[0.25,0.75], 0.05)  # 返回下四分位数、上四分位
        IQR = quantiles[1] - quantiles[0]
        bounds[col] = [quantiles[0] - k_q1 * IQR, quantiles[1] + k_q3 * IQR]  # 最小值估计、最大值估计

    print(bounds)

    filter_quantile = fn.udf(lambda x, y: 'yes' if x < bounds[y][0] or x > bounds[y][1] else 'no')
    for c in cols:
        df = df.withColumn(c + '_is_outlier', filter_quantile(fn.col(c), fn.lit(c)))
        df = df.filter(df[c + '_is_outlier'] == 'no')

    df.show()
    
    df = df.drop(*[c + '_is_outlier' for c in cols])

    df.show()
    
    return df

quantile_excludes_outliers(df, cols, k_q1=1.5, k_q3=1.5)

+---+-----+----------+
| id| cost|  run_time|
+---+-----+----------+
|  1|143.5|       651|
|  2|509.9|        10|
|  3| 33.9|      1333|
|  4| 66.2|       904|
|  5|133.2|         0|
|  6|124.1|     10172|
|  7|  1.2|1648093296|
+---+-----+----------+

{'cost': [-130.49999999999997, 307.9], 'run_time': [-15233.0, 25415.0]}
+---+-----+--------+---------------+-------------------+
| id| cost|run_time|cost_is_outlier|run_time_is_outlier|
+---+-----+--------+---------------+-------------------+
|  1|143.5|     651|             no|                 no|
|  3| 33.9|    1333|             no|                 no|
|  4| 66.2|     904|             no|                 no|
|  5|133.2|       0|             no|                 no|
|  6|124.1|   10172|             no|                 no|
+---+-----+--------+---------------+-------------------+

+---+-----+--------+
| id| cost|run_time|
+---+-----+--------+
|  1|143.5|     651|
|  3| 33.9|    1333|
|  4| 66.2|     904|
|  5|133.2|       0|
|  6|124.1|   10172|
+---+-----+--------+

Referenced from:

https://blog.csdn.net/qq_30031221/article/details/109180961

https://zhuanlan.zhihu.com/p/344502263

Guess you like

Origin blog.csdn.net/qq_42363032/article/details/123801944