Matlab outlier processing

Matlab outlier processing

In the preprocessing of data, we often encounter the situation of outliers and missing values. Below we will introduce the common techniques of these two situations, hoping to help you.

When the data has both outliers and missing values, there is no strict order of which is processed first. I'm used to dealing with outliers first, then missing values

How to identify outliers

Outliers refer to sample points where some values ​​in the sample deviate significantly from the rest, so they are also called outliers. Common outlier judgment methods can be divided into the following two situations:

Data given range

data not given range

3 sigma principle

  • The premise is that the normal distribution can be used to determine outliers using the 3sigma principle.
    • Historical Experience: Height
    • Statistics: Normality Test
  • If you want to narrow the interval a little bit, remove some outliers
    • 2sigma can be used
x = [48 51 57 57 49 86 48 53 59 50 48 47 53 56 60];   % 假设x是取自正态分布的样本
u = mean(x,'omitnan');  % 忽略数据中的缺失值计算均值
sigma = std(x,'omitnan');   % 计算标准差  std(x,0,'omitnan')是总体标准差
lb = u - 3*sigma    % 区间下界,low bound的缩写
ub = u + 3*sigma   % 区间上界,upper bound的缩写
tmp = (x < lb) | (x > ub);
ind = find(tmp)

returns 6, meaning the 6th position is an outlier

  • Usually when it is not known whether the population is normally distributed, we use box plots to remove outliers
  • Generally speaking, the box plot removes more outliers than 3sigma
  • Usually k is 1.5, of course it can be other

Boxplot to identify outliers

x = [48 51 57 57 49 86 48 53 59 50 48 47 53 56 60];
% 计算分位数的函数需要MATLAB安装了统计机器学习工具箱
Q1 = prctile(x,25); % 下四分位数
Q3 = prctile(x,75); % 上四分位数
IQR = Q3-Q1; % 四分位距
lb = Q1 - 1.5*IQR % 下界
ub = Q3 + 1.5*IQR % 上界
tmp = (x < lb) | (x > ub);
ind = find(tmp)

What to do after removing outliers

Once an outlier is identified, we can often treat the outlier as a missing value and hand it over to the missing value handling method.

Code: x(ind) = nan (Note: If multiple columns of data need to be processed, you can write a loop.)

x(ind) = nan % 这里的nan表示非数字Not-a-Number,例如0/0就会返回nan,我们这里可以用nan代表数字中的缺失值
  • If the data is given a range then it is easy to find outliers
  • If the data does not have a given range, then generally do not do it lazy
    • Unless the data is outrageous, or affects the training effect of the model

Source of content: The fourth live broadcast of Mathematical Modeling Qingfeng:
Using matlab to quickly realize machine learning
supporting lectures and follow-up videos
Welcome everyone to learn

おすすめ

転載: blog.csdn.net/weixin_57345774/article/details/126965835