Python list 和 numpy 实现 3-sigma 异常值剔除和异常值替换实例

实例:异常值直接剔除

import numpy as np
data_list = [1,2,3,4,5,5,4,3,2,1,1,2,3,4,5,5,4,3,2,1,10000,-10000]
data_array = np.asarray(data_list)

mean = np.mean(data_array , axis=0)
std = np.std(data_array , axis=0)

preprocessed_data_array = [x for x in data_array if (x > mean - 3 * std)]
preprocessed_data_array = [x for x in preprocessed_data_array if (x < mean + 3 * std)]
print(preprocessed_data_array )
  • 输出:
[1, 2, 3, 4, 5, 5, 4, 3, 2, 1, 1, 2, 3, 4, 5, 5, 4, 3, 2, 1]

实例:异常值替换为均值

import numpy as np

data_list = [1,2,3,4,5,5,4,3,2,1,1,2,3,4,5,5,4,3,2,1,10000,-10000]
# print(sum(data_list)/len(data_list))

data_array = np.asarray(data_list, dtype=float)	# 注意:这里要指定 dtype 的类型,否则下面替换时可能会因数据类型不同而导致替换的均值的精度不同

mean = np.mean(data_array, axis=0)
std = np.std(data_array, axis=0)
print(mean)

floor = mean - 3*std
upper = mean + 3*std

for i, val in enumerate(data_array):
    data_array[i] = float(np.where(((val<floor)|(val>upper)), mean, val))
print(data_array)
  • 输出:
2.727272727272727
[1.         2.         3.         4.         5.         5.
 4.         3.         2.         1.         1.         2.
 3.         4.         5.         5.         4.         3.
 2.         1.         2.72727273 2.72727273]
  • 注意:如果 data_array = np.asarray(data_list, dtype=float) 没有指定 dtype=float,则默认取 mean 的下整数值替换异常值,输出如下:
2.727272727272727
[1 2 3 4 5 5 4 3 2 1 1 2 3 4 5 5 4 3 2 1 2 2]

参考

  • https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html

猜你喜欢

转载自blog.csdn.net/sdnuwjw/article/details/111053069