Data processing method arrangement [currently the most complete]

data processing method

The main method is to write data processing codes to process the data, mainly using Python language, using related libraries such as Pandas, Numpy, Scikit-learn, etc. Some code examples are as follows.

1. Missing data processing

Missing data refers to the situation where there is a null value in the variable value of a certain row record or a certain column feature in the data set. Commonly used missing value processing methods mainly include the following:
(1) Deletion method, if the data missing ratio of a row record or a column feature in the data set is greater than the specified threshold value, the row data or column feature can be considered as invalid data Or invalid features, just delete the records with missing data directly.

# 参考代码
# 缺失值进行判断
df.isnull()
# 缺失值删除
df.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
# pandas删除全为0值的行
df.drop( index = df.age[df1.age == 0].index )
# 参数:
# axis:轴。0或’index’,表示按行删除;1或’columns’,表示按列删除。
# how:筛选方式。‘any’,表示该行/列只要有一个以上的空值,就删除该行/列;‘all’,表示该行/列全部都为空值,就删除该行/列。
# thresh:非空元素最低数量。int型,默认为None。如果该行/列中,非空元素数量小于这个值,就删除该行/列。
# subset:子集。列表,元素为行或者列的索引。如果axis=0或者‘index’,subset中元素为列的索引;如果axis=1或者‘column’,subset中元素为行的索引。由subset限制的子区域,是判断是否删除该行/列的条件判断区域。
# inplace:是否原地替换。布尔值,默认为False。如果为True,则在原DataFrame上进行操作,返回值为None。

(2) The filling method based on statistical variables. This type of method needs to decide which statistical variable to use for filling according to the type and distribution of the features. For example, if the feature is discrete, the missing value can be filled directly by the mode; if the feature is continuous numerical and the data distribution is relatively uniform, the missing value can be filled with the average, and the average value of the global variable or attribute can be used to fill in the missing value. Replaces all missing data; features are continuous but skewed, can be filled with median, etc.

# 参考代码

# 方法一:直接用Pandas库
# 均值填补数据
data['空缺值所在列名'] =  data['空缺值所在列名'].fillna(data['空缺值所在列名'].mean())
catering_sale = "catering_sale.xls" 
# 众数填补数据
data['空缺值所在列名'] =  data['空缺值所在列名'].fillna(data['空缺值所在列名'].mode()[0])
catering_sale = "catering_sale.xls" 
# 中位数填补数据
data['空缺值所在列名'] =  data['空缺值所在列名'].fillna(data['空缺值所在列名'].median())
catering_sale = "catering_sale.xls" 
# 前一个数据填补数据
data['空缺值所在列名'] =  data['空缺值所在列名'].fillna(method='pad')
catering_sale = "catering_sale.xls" 
# 后一个数据填补数据
data['空缺值所在列名'] =  data['空缺值所在列名'].fillna(method='bfill')

# 方法二:计算平均值等,再进行填充
# 对‘column’列用平均值带替代缺失值NaN计算平均值
c = avg = 0
for ele in df['column']:
    if str(ele).isnumeric():
        c += 1
        avg += ele
avg /= c
 
# 替换缺失值
df = df.replace(to_replace="NaN",
                value=avg)
 
# 展示数据
df

(3) Filling method based on interpolation. This type of method mainly fills in missing variable values ​​through random interpolation, Lagrangian interpolation, polynomial interpolation and other methods. For example: the polynomial interpolation method is to fit the existing data by constructing a polynomial, so that all sample data conform to the distribution of the polynomial. When it is necessary to obtain the missing value of a certain sample, it is obtained by solving the polynomial.

# 参考代码

# 使用库函数进行插值
# Scipy库中的interp1d函数可以实现一维数据的插值,包括线性插值、多项式曲线插值和邻近值插值。
# 定义x,y分别为数据的‘时间轴’和需要插补的‘列名’
x = data['time']
y = data['column']
# xnew构建新的完整的时间序列 = [1,2,3,...23]
xnew = np.linspace(1,23,num=23)
# 线性插值
f1 = interp1d(x,y,kind='linear') # kind替换不同的种类用不同方法进行插值,多项式曲线插值-‘cubic’,邻近值插值-‘nearest’
ynew1 = f1(xnew)

# 自定义函数进行插值
from scipy.interpolate import lagrange # 拉格朗日函数
data=pd.read_excel('data.xls')
# 自定义列向量插值函数
def ploy(s,n,k=6):
	y=s[list(range(n-k,n))+list(range(n+1,n+1+k))]#取数
	y=y[y.notnull()]
	return lagrange(y.index,list(y))(n)
for i in data.columns:
	for j in range(len(data)):
		if(data[i].isnull())[j]:
			data[i][j]=ploy(data[i],j)
data.to_excel('data_1.xls')

(4) Model-based filling method, which uses a supervised model or an unsupervised model to fill in missing values. For example: K-nearest neighbor filling uses clustering to obtain several sample points adjacent to a missing sample, and fills in missing values ​​by calculating the mean or weighted average of these sample points.

# 参考代码
# scikit-learn发布0.22版本中,新增的非常好用的缺失值插补方法:KNNImputer。基于KNN算法使得可以更便捷地处理缺失值,并且与直接用均值、中位数相比更为可靠。
from sklearn.impute import KNNImputer
# n_neighbors为选择“邻居”样本的个数,这里n_neighbors=2。
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(data)

# IterativeImputer多变量缺失值填补-虑数据在高维空间中的整体分布情况,然后在对有缺失值的样本进行填充。
# IterativeImputer多变量缺失值填补方法
iterimp = IterativeImputer(random_state = 123)
oceandfiter = iterimp.fit_transform(oceandf)
# 获取填充后的变量-column为第5列
column = oceandfiter[:,4]

# MissForest缺失值填补-利用随机森林的思想,进行缺失值填充
# MissForest缺失值填补方法
forestimp = MissForest(n_estimators = 100,random_state = 123)
oceandfforest = forestimp.fit_transform(oceandf)
# 获取填充后的变量-column为第5列
column = oceandfforest[:,4]

(5) Hot card filling method, this type of method is to find the sample point most similar to the missing sample in the data set, and use the variable value of the most similar sample to fill the missing data. The key to the problem is that different problems may use different standards to judge similarity, and how to formulate this judgment standard. This method is conceptually simple and uses the relationship between data to estimate the null value, but the disadvantage is that it is difficult to define similar standards and there are many subjective factors.

# 参考代码

def hot_deck_imputation(dataframe:pd.DataFrame):
	from sklearn.impute import KNNImputer 
	hot_deck_imputer = KNNlmputer(n_neighbors=2,weights="uniform")#虽然看着是用 KNN,但是参数固定:n neighbors=2
	new_df= hot_deck_imputer.fit_transform(dataframe)
	return new_df

(6) Prediction method, this type of method uses a prediction model to predict each missing data. Use existing data as training samples to build a predictive model and predict missing data. This method maximizes the use of known relevant data and is a popular missing data processing technique.

# 参考代码

# 一些采用ML或DL方法进行预测来对缺失值进行插补,如RNN方法
# 南开大学Yonghong Luo等人提出的用GAN来进行时间序列插值的算法-算法框架(E2GAN)(文章发表于NIPS 2018 和IJCAI 2019)
1. Luo, Yonghong, et al. "Multivariate time series imputation with generative adversarial networks."Advances in Neural Information Processing Systems. 2018.
2. Luo, Yonghong, et al. "E 2 GAN: end-to-end generative adversarial network for multivariate time series imputation."Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAl Press, 2019.

2. Data resampling

For time series data, the data resampling method can be used to convert the time series from one frequency to another. There are two main implementation methods, namely downsampling and upsampling. Downsampling refers to converting high-frequency data into For low frequencies, upsampling is the exact opposite, converting low frequency data to high frequencies.

For some high-density sensors, a large amount of time-series data will be generated at the millisecond level, so such a large amount of data is sampled, and the data is compressed to the level of seconds, minutes, hours, etc. to compress the data and reduce the amount of data.

The resample() function provided by Pandas can be used to achieve data resampling.

# 参考代码

DataFrame.resample(rule, on='索引列名', axis=0, fill_method=None, label='left', closed='left')
# on:可以指定某一列使用resample,例如:例如:resample(‘3D’, on=’交易日期’).sum()
# axis=0:默认是纵轴,横轴设置axis=1
# fill_method = None:升采样时如何插值,比如‘ffill’(向上采样)、‘bfill’(向下采样)等
# label= ‘right’:在降采样时,如何设置聚合值的标签,例如,9:30-9:35会被标记成9:30还是9:35,默认9:35
# closed = ‘right’:在降采样时,各时间段的哪一段是闭合的,‘right’或‘left’,默认‘right’
# label=’left’, closed=’left’,建议统一设置成’left’
# rule的取值对应如下:
B	工作日频率
C	自定义工作日频率
D	日历日频率
W	每周频率
M	月末频率
SM	半月末频率(15日和月末)
BM	营业月末频率
CBM	自定义业务月末频率
MS	月初频率
SMS	半个月开始频率(第1次和第15次)
BMS	营业月开始频率
CBMS	自定义营业月开始频率
Q	四分之一端频率
BQ	业务季度末频率
QS	四分之一起始频率
BQS	业务季度开始频率
A	年终频率
BA	营业年末频率
AS	年份起始频率
BAS	业务年度开始频率
BH	营业时间频率
H	每小时频率
T	分钟频率
S	第二频率
L	毫秒
U	微秒
N	纳秒

data_h =data_index.resample('H').first()
data_h

3. Outlier processing

When a data point in the data obviously deviates from the distribution of other data points or a data point is obviously different from other data points, it is judged as an outlier (outlier value), and abnormal data can be used for outliers The detection method detects outliers and removes them.

Abnormal data detection mainly includes the following methods:
(1) Based on the method of statistical analysis, judge whether the data is abnormal through the description information of the feature and the range of the feature value. For example, for the age feature, the value range of the statute is [0, 200]. When there is a negative number or a number greater than 200, it is judged as abnormal data.
(2) Density-based methods, which are judged by the fact that the local density of outliers are significantly lower than most neighboring points, are suitable for non-uniform data sets.
(3) Based on the clustering method, generally normal data points present an aggregation form of "like flock together", and normal data appear around dense neighborhoods, while abnormal points deviate far away, so as to judge the abnormality of the data.
(4) Based on the tree method, anomalies are judged by division. For example, Isolation Forest (iForest) is considered to be one of the most effective anomaly detection methods. This method is to judge the abnormality by calculating the abnormal correlation score of the sample point. If the abnormal correlation score obtained by a certain sample is high, And when it is greater than the threshold value, it can be judged as abnormal.
(5) Based on the prediction method, the occurrence of outliers is judged by comparing the predicted time series curve with the real data for the time series data.

Some specific methods are introduced as follows:

(1) Anomaly detection based on statistical distribution

Data distribution models can be created by estimating the parameters of a probability distribution. An object is an anomaly if it does not fit the model well, ie if it is likely not to follow the distribution.

3σ-law

Assuming that a set of test data contains only random errors, the original data is calculated and processed to obtain the standard deviation, and then an interval is determined according to a certain probability, and the error exceeding this interval is considered to be an outlier.

The probability in the interval (μ−3σ,μ+3σ) is 99.74. Therefore, it can be considered that when the data distribution interval exceeds this interval, it can be considered as abnormal data.

Assuming that the data set is generated by a normal distribution, the distribution can be represented by N(μ,σ), where μ is the mean of the sequence, σ is the standard deviation of the sequence, and the data falls outside (μ-3σ,μ+3σ) The probability is only 0.27%, and the probability of falling in the area outside (μ-4σ, μ+4σ) is only 0.01%. According to the understanding of the business and the timing curve, the appropriate K value can be found as an exception of different levels Call the police.

# 参考代码

#3-sigma识别异常值
def three_sigma(df_col): #df_col:DataFrame数据的某一列
    rule = (df_col.mean() - 3 * df_col.std() > df_col) | (df_col.mean() + 3 * df_col.std() < df_col)
    index = np.arange(df_col.shape[0])[rule]
    outrange = df_col.iloc[index]
    return outrange

# 另一种方式识别低值和高值
def three_sigma(s):
    mu, std = np.mean(s), np.std(s)
    lower, upper = mu-3*std, mu+3*std
    return lower, upper

Z-score

Z-score is a standard score, which measures the distance between the data point and the average value. If A is 2 standard deviations away from the average value, the Z-score is 2. When Z-score=3 is used as the threshold to remove outliers, it is equivalent to 3sigma.

# 参考代码

def z_score(s):
  z_score = (s - np.mean(s)) / np.std(s)
  return z_score

MA moving average method

The easiest way to identify data irregularities is to flag data points that deviate from the distribution, including mean, median, quantile, and mode.
Assuming that the abnormal data points are a certain standard deviation from the mean, then we can calculate the local mean under the sliding window of the time series data, and use the mean to determine the degree of deviation. This is known in a technique as a moving average (MA), and is designed to smooth out short-term fluctuations and highlight long-term fluctuations. Sliding average also includes cumulative moving average, weighted moving average, exponentially weighted moving average, double exponential smoothing, triple exponential smoothing, etc. Mathematically, the nn cycle simple moving average can also be defined as a "low-pass filter".

This method has obvious flaws: there may be noise data similar to abnormal behavior in the data, so the boundary between normal behavior and abnormal behavior is usually not obvious; the definition of abnormal or normal may change frequently, and thresholds based on moving averages may not Not always applicable.

# 参考代码

# 简单移动平均线(Simple Moving Average,SMA)|加权移动平均(Weighted Moving Average,WMA)|指数加权移动平均(exponentially weighted moving average,EWMA)
# python pandas包中,对于ewma和sma有现成的实现方法,这里对于wma进行代码的编写处理
class WMA(object):
    """
    加权移动平均实现
    """
    def get_wma_weights(span, flag=True):
        """
        计算每个数值点的wma权重值
        """
        paras = range(1, span + 1)
        count = sum(paras)
        if flag:
            return [float(para) / count for para in paras]
        else:
            return [float(para) / count for para in paras][::-1]

    def get_wma_values(self, datas):
        """
        计算wma数值
        """
        wma_values = []
        wma_keys = datas.index
        for length in range(1, len(datas) + 1):
            wma_value = 0
            weights = self.get_wma_weights(length)
            for index, weight in zip(datas.index, weights):
                wma_value += datas[index] * weight
            wma_values.append(wma_value)
        return pd.Series(wma_values, wma_keys)
# 计算异常值
def calculate_variance(data, moving_average):
    variance = 0
    flag_list = moving_average.isnull()
    count = 0
    for index in range(len(data)):
        if flag_list[index]:
            count += 1
            continue
        variance += (data[index] - moving_average[index]) ** 2
    variance /= (len(data) - count)
    return variance
    
# ewma进行拟合
ewma_line = pd.ewma(data, span=4)
# 简单移动平均
sma_line = pd.rolling_mean(data, window=4)
# wma加权移动平均
wma_line = WMA().get_wma_values(data)


sma_var = calculate_variance(data, sma_line)
wma_var = calculate_variance(data, wma_line)
ewma_var = calculate_variance(data, ewma_line)

boxplot box plot (quantile anomaly detection)

A box plot is a statistical graph used to display the dispersion of a set of data. It is mainly used to reflect the characteristics of the original data distribution, and can also compare the distribution characteristics of multiple sets of data. The drawing method is: first find the maximum value, minimum value, median, and upper and lower quartiles of a set of data . Outliers and suspected outliers are divided by different quantiles.

IQR is the third quartile minus the first quartile, and values ​​greater than Q3+1.5IQR and less than Q1-1.5*IQR are considered outliers.

# 参考代码

# 定义箱线图识别异常值函数
def box_plot(Ser):
    '''
    Ser:进行异常值分析的DataFrame的某一列
    '''
    Low = Ser.quantile(0.25)-1.5*(Ser.quantile(0.75)-Ser.quantile(0.25))
    Up = Ser.quantile(0.75)+1.5*(Ser.quantile(0.75)-Ser.quantile(0.25))
    index = (Ser< Low) | (Ser>Up)
    Outlier = Ser.loc[index]
    return(Outlier)

box_plot(df['counts']).head(8)


# 另一种形式识别低值和高值
def boxplot(s):
    q1, q3 = s.quantile(.25), s.quantile(.75)
    iqr = q3 - q1
    lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
    return lower, upper

Grubbs anomaly test

The Grubbs test is a method to find the outlier from the sample. The so-called outlier refers to the data in the sample that deviates too far from the average value. They may be normal data in extreme cases, or they may be wrong data in the measurement process. Using the Grubbs test requires the population to be normally distributed.

Algorithm flow:

①. Sort the samples from small to large
②. Find the mean and std.dev [mean and standard deviation] of the samples
③. Calculate the difference between min/max and mean, the larger one is the suspicious value
④. Find the z-score of the suspicious value (standard score), if it is greater than the Grubbs critical value, then it is outlier;
⑤. The Grubbs critical value can be obtained by looking up the table [it is determined by two values: detection level α (the stricter the smaller), the number of samples n], exclude outlier , do steps ①-④ for the remaining sequence loop.
Since what is needed here is abnormal judgment, it is only necessary to judge whether tail_avg is outlier.

# 参考代码

# 使用grubbs对应的库直接调用
# 添加包 pip install outlier_utils==0.0.3
from outliers import smirnov_grubbs as grubbs
grubbs.test(data, alpha) # 默认双边检验
# https://github.com/c-bata/outlier-utils

(2) Density-based anomaly detection

Density-based anomaly detection has a prerequisite, that is, normal data points present an aggregation form of "like flock together", and normal data appear around dense neighborhoods, while abnormal points deviate farther away.
For this scenario, we can calculate a score to evaluate the nearest set of data points. This score can use Eucledian distance or other distance calculation methods. The specific situation needs to be determined according to the data type: categorical or numerical.

Density estimates (1) of objects can be computed relatively directly, especially when there is a proximity measure between objects, objects in low-density regions are relatively far from their neighbors, and may be seen as anomalies. A more sophisticated approach takes into account (2) the fact that datasets may have regions of different densities, and only classifies a point as an outlier if its local density is significantly lower than most of its neighbors.

  • The formula for calculating the density-based anomaly score is:

density ⁡ ( x , k ) = ( ∑ y ∈ N ( x , k ) distance ⁡ ( x , y ) ∣ N ( x , k ) ∣ ) − 1 \operatorname{density}(x, k)=\left(\frac{\sum_{y \in N(x, k)} \operatorname{distance}(x, y)}{|N(x, k)|}\right)^{-1} density(x,k)=(N(x,k)yN(x,k)distance(x,y))1

where N ( x , k ) N(x,k)N(x,k ) refers to the set of k nearest neighbors of x,∣ N ( x , k ) ∣ |N(x,k)|N(x,k ) represents the size of the set, and y is the nearest neighbor of x.

LOF

The local outlier factor algorithm, the full name is Local Outlier Factor (abbreviated as LOF). The LOF algorithm is an unsupervised anomaly detection method that computes the local density deviation of a given data point relative to its neighbors. The anomaly score for each sample is called the local anomaly factor. The anomaly score is local and depends on how isolated the sample is from the surrounding neighborhood. To be precise, locality is given by k-nearest neighbors, and distances are used to estimate local density. By comparing the local density of a sample with that of its neighbors, samples whose density is significantly lower than its neighbors can be identified, and these samples are regarded as abnormal sample points.

data point P \mathrm{P}The local relative density (local outlier factor) of P is the point P \mathrm{P}The average local reachable density of points in the P neighborhood and theP \mathrm{P}In the case of the infinitesimal equation, namely:
LOF k ( P ) = ∑ O ∋ N k ( P ) lr ⁡ ( O ) lr ⁡ ( P ) ∣ N k ( P ) ∣ = ∑ O ∋ N k ( P ) . lrd ⁡ ( O ) ∣ N k ( P ) ∣ / lrd ⁡ ( P ) LO F_k(P)=\frac{\sum_{O \ni N_k(P)} \frac{\operatorname{lr}(O)}; {\operatorname{lr}(P)}}{\left|N_k(P)\right|}=\frac{\sum_{O \ni N_k(P)}\operatorname{lrd}(O)}{\left |N_k(P)\right|} / \operatorname{lrd}(P)LOFk(P)=Nk(P)ONk(P)lr(P)lr(O)=Nk(P)ONk(P)l r d ( O )/l r d ( P )
data pointP \mathrm{P}The local reachability density of P = P =\mathrm{P}=The reciprocal of the average reachable distance of P nearest neighbors. The larger the distance, the lower the density.
lrd ⁡ k ( P ) = 1 ∑ o ∋ N k ( P ) reach distk ( P , O ) ∣ N k ( P ) ∣ \operatorname{lrd}_k(P)=\frac{1}{\frac{\ sum_{o \ni N_k(P)} \text { reach }{ }_d is t_k(P, O)}{\left|N_k(P)\right|}}l r dk(P)=Nk(P)oNk(P) reach di s tk(P,O)1
Point P \mathrm{P}P to pointOOO 's kth\mathrm{k}k reachable distance= max ⁡ =\max=max (pointOOk \mathrm{k}of Ok nearest neighbor distance, pointP \mathrm{P}P to pointOOO 的距离)。
 reach  dist ⁡ k ( O , P ) = max ⁡ { d k ( O ) , d ( O , P ) } \text { reach } \operatorname{dist}_k(O, P)=\max \left\{d_k(O), d(O, P)\right\}  reach distk(O,P)=max{ dk(O),d(O,P)}

insert image description here

Point OOO 'skkk nearest neighbor distance =kkthThe k nearest points and pointsOOThe distance between O.

Overall, the LOF algorithm flow is as follows:

  • For each data point, calculate its distance from all other points, sorted from closest to farthest;
  • For each data point, find its K-Nearest-Neighbor and calculate the LOF score.
# 参考代码
from sklearn.neighbors import LocalOutlierFactor as LOF

clf = LOF(n_neighbors=2)
res = clf.fit_predict(data)
print(res)
print(clf.negative_outlier_factor_)

COF

COF is a variant of LOF. Compared with LOF, COF can handle outliers at low densities. The local density of COF is calculated based on the average chain distance. At the beginning, the k-nearest neighbor of each point will be calculated first. Next, the Set based nearest Path of each point will be calculated, as shown in the figure below:

COF

Suppose k=5, so the neighbors of F are B, C, D, E, G. For F, the closest point to him is E, so the first element of SBN Path is F, and the second is E. The point closest to E is D, so the third element is D, the next point closest to D is C and G, so the fourth and fifth elements are C and G, and the last point closest to C is B, the sixth element is B. So after the whole process, the SBN Path of F is {F, E, D, C, G, C, B}. For the distance e={e1, e2, e3,...,ek} corresponding to the SBN Path, follow the above example e={3,2,1,1,1}.
So it can be said that if you want to calculate the SBN Path of point p, you only need to directly calculate the minimum spanning tree of the graph formed by point p and all the points of its neighbors, and then execute the shortest path algorithm with point p as the starting point to get the SBN Path. And then with SBN Path, it will continue to calculate the chain distance of point p:
ac − dist ⁡ ( p ) = ∑ i = 1 k 2 ( k + 1 − i ) k ( k + 1 ) dist ⁡ ( ei ) \mathrm{ac}_{-} \operatorname{dist}(p)=\sum_{i=1}^k \frac{2(k+1-i)}{k(k+1)} \operatorname {dist}\left(e_i\right)acdist(p)=i=1kk(k+1)2(k+1i)dist(ei)
With ac_distance, COF can be calculated:
COF ⁡ ( p ) = ac − dist ⁡ ( p ) 1 k ∑ o ∈ N k ( p ) ac − dist ⁡ ( o ) \operatorname{COF}(p)= \frac{a c_{-} \operatorname{dist}(p)}{\frac{1}{k} \sum_{o \in N_k(p)} a c_{-} \operatorname{dist}(o) }COF(p)=k1oNk(p)acdist(o)acdist(p)

# 参考代码

from pyod.models.cof import COF
cof = COF(contamination = 0.06,  ## 异常值所占的比例
          n_neighbors = 20,      ## 近邻数量
        )
cof_label = cof.fit_predict(iris.values) # 鸢尾花数据
print("检测出的异常值数量为:",np.sum(cof_label == 1))

(3) Cluster-based anomaly detection

Usually, similar data points tend to belong to similar groups or clusters, determined by their distance from the local cluster centers. The distance between normal data and the center of the cluster should be further, while the abnormal data should be far away from the center of the cluster. Clustering is one of the most popular algorithms in the field of unsupervised learning, and clustering anomaly detection can be divided into two steps:

①Use clustering algorithm to cluster;
②Calculate the degree of abnormality of each sample point: the degree of abnormality of each point is equal to the distance to the nearest cluster center point.

  • Method 1: Discard small clusters that are far away from other clusters.
    This method can be used with any clustering method, but requires minimum cluster sizes and thresholds for distances between small clusters and other clusters. Often, the process can be simplified to discarding all clusters smaller than some minimum size. This scheme is highly sensitive to the choice of the number of clusters. Furthermore, this scheme makes it difficult to attach outlier scores to objects. Note that treating a group of objects as an outlier extends the concept of an outlier from individual objects to groups of objects, but essentially does not change anything.
  • Method 2: Cluster all objects and then evaluate the degree to which the objects belong to the cluster.
    • For prototype-based clustering, the distance from an object to its cluster center can be used to measure the degree to which an object belongs to a cluster.
    • For clustering methods based on an objective function, the objective function can be used to evaluate the degree to which objects belong to arbitrary clusters.
      In a special case, we can classify an object as an outlier if removing it results in a significant improvement in that objective. For example, for K-means, removing objects that are far from the center of their associated clusters can significantly improve the sum of squared errors (SSE) of the clusters.

In summary, clustering creates a model of the data, rather than anomalously distorting the model.

K-Means clustering

K similar groups of data points are created, and data samples that do not belong to these clusters (far from the cluster center) may be marked as abnormal data.

# 参考代码

# 用不同数量的质心计算-这里划分20个聚类
n_cluster = range(1, 20)
kmeans = [KMeans(n_clusters=i).fit(data) for i in n_cluster]
scores = [kmeans[i].score(data) for i in range(len(kmeans))]
# 这里选择15个质心并将这些数据添加到中心数据框中
df['cluster'] = kmeans[14].predict(data)

# 获取每个点与其最近的质心之间的距离。 最大距离被视为异常
distance = getDistanceByPoint(data, kmeans[14])
# 异常点的数量
number_of_outliers = int(outliers_fraction*len(distance))
# 异常判定的阈值,大于阈值判定为异常
threshold = distance.nlargest(number_of_outliers).min()
# anomaly21 包含异常结果,0为正常,1为异常
df['anomaly21'] = (distance >= threshold).astype(int)

DBSCAN

Outliers can be found through clustering

  • 1 core idea: Based on the density
    DBSCAN algorithm, all the dense areas of the sample points can be found, and these dense areas can be regarded as clusters one by one.

  • 2 algorithm parameters: Neighborhood radius R and the minimum number of points minpoints
    describe dense —> When the number of points within the neighborhood radius R is greater than the minimum number of points minpoints, it is dense.
    insert image description here

  • 3 types of points: core points, boundary points and noise points
    Core points: Points whose number of sample points within the neighborhood radius R is greater than or equal to minpoints are called core points.
    Boundary point: A point that does not belong to a core point but is in the neighborhood of a certain core point is called a boundary point.
    Noise points: Noise points are neither core points nor boundary points.
    insert image description here

  • The relationship of 4 kinds of points: density direct, density reachable, density connected, non-density connected
    density direct : if P is the core point, and Q is in the R neighborhood of P, then P to Q is called density direct. Any core point is directly connected to its own density, and the density direct connection does not have the symmetry, if P to Q density direct connection, then Q to P is not necessarily density direct.
    Density reachable : if there are core points P2, P3, ..., Pn, and P1 to P2 density direct, P2 to P3 density direct, ..., P(n-1) to Pn density direct, Pn to Q density direct, Then the density from P1 to Q can be reached. Density reachability also has no symmetry.
    Density-connected : If there is a core point S such that S is densely accessible to both P and Q, then P and Q are density-connected. Density connection has symmetry, if P and Q are density connection, then Q and P must also be density connection. Two points connected by density belong to the same cluster.
    Non-density connection : If two points do not belong to the density connection relationship, then the two points are not density connection. Two points that are not connected by density belong to different clusters, or there are noise points among them.
    insert image description here

Judging abnormalities is consistent with the k-means algorithm, and abnormal points can be found during clustering. But it does not need to input the number k of clusters and can find clusters of any shape.

# 参考代码

# 直接使用sklearn库中的DBSCAN即可
from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3],
              [8, 7], [8, 8], [25, 80]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)

clustering.labels_
array([ 0,  0,  0,  1,  1, -1])
# 0,,0,,0:表示前三个样本被分为了一个群
# 1, 1:中间两个被分为一个群
# -1:最后一个为异常点,不属于任何一个群

(4) Tree-based anomaly detection

Such methods fall into the category of partition-based methods.

The simplest division method is threshold detection, which delimits the threshold through human experience and judges the abnormality of the data.
Specifically, in order to avoid false alarms caused by single-point jitter, it is necessary to perform threshold judgment on the accumulated window mean value. The specific accumulation is performed through the window: x ^ ( t ) = xt + xt − 1 + … + xt
− w + 1 w \hat{x}(t)=\frac{x_t+x_{t-1}+\ldots+x_{t-w+1}}{w}x^(t)=wxt+xt1++xtw+1
The advanced division-based anomaly detection algorithm is iForest (Isolation Forest), an Ensemble-based fast anomaly detection method with linear time complexity and high precision. Compared with LOF and OneClassSVM, it occupies less memory and is faster. The principle of the algorithm is as follows: It divides the data points in the time series into trees, and the lower the depth, the easier it is to be divided, that is, outliers.
The algorithm does not use indicators such as distance and density to describe the difference between samples and other samples, but directly describes the so-called isolation. Assuming that there is a set of one-dimensional data now, we want to split this set of data, the purpose is to separate points A and B separately, first randomly select a value X between the maximum value and the minimum value, and then according to <X and >=X can divide the data into left and right groups, repeat this step in these two groups of data, until the data can no longer be divided. Some clusters with very high density will be cut many times before they stop cutting, that is, each point exists in a subspace alone, but those points with sparse distribution mostly stop in a subspace very early. So implement anomaly detection in isolated forest.

insert image description here

# 参考代码

from sklearn.datasets import load_iris 
from sklearn.ensemble import IsolationForest

data = load_iris(as_frame=True) 
X,y = data.data,data.target 
df = data.frame 

# 模型训练
iforest = IsolationForest(n_estimators=100, max_samples='auto',  
                          contamination=0.05, max_features=4,  
                          bootstrap=False, n_jobs=-1, random_state=1)

#  fit_predict 函数 训练和预测一起 可以得到模型是否异常的判断,-1为异常,1为正常
df['label'] = iforest.fit_predict(X) 

# 预测 decision_function 可以得出 异常评分
df['scores'] = iforest.decision_function(X)

(5) Anomaly detection based on dimensionality reduction

PCA

The eigenvectors obtained by PCA [linear method] after eigenvalue decomposition reflect the different directions of variance variation of the original data, and the eigenvalues ​​are the variance of the data in the corresponding direction. Therefore, the eigenvector corresponding to the largest eigenvalue is the direction with the largest data variance, and the eigenvector corresponding to the smallest eigenvalue is the direction with the smallest data variance. The variation of the variance of the original data in different directions reflects its intrinsic characteristics. If a single data sample is not consistent with the characteristics of the overall data sample, for example, it deviates greatly from other data samples in some directions, it may indicate that the data sample is an outlier.

  • One method of PCA is to find k eigenvectors, and calculate the reconstruction error (reconstruction error) of each sample after being projected by these k eigenvectors, and the reconstruction error of normal points should be smaller than that of abnormal points.
  • Similarly, the weighted Euclidean distance from each sample to the hyperspace formed by the k selected eigenvectors can also be calculated (the smaller the eigenvalue, the greater the weight).
  • It is also possible to directly analyze the covariance matrix, and use the Mahalanobis distance of the sample (the distance from the sample to the center of the distribution when considering the relationship between features) as the abnormality of the sample, and this method can also be understood as a soft (Soft PCA)
# 参考代码

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(centered_training_data)
transformed_data = pca.transform(training_data)
y = transformed_data

# 计算异常分数
lambdas = pca.singular_values_
M = ((y*y)/lambdas)

# 前k个特征向量和后r个特征向量
q = 5
print "Explained variance by first q terms: ", sum(pca.explained_variance_ratio_[:q])
q_values = list(pca.singular_values_ < .2)
r = q_values.index(True)

# 对每个样本点进行距离求和的计算
major_components = M[:,range(q)]
minor_components = M[:,range(r, len(features))]
major_components = np.sum(major_components, axis=1)
minor_components = np.sum(minor_components, axis=1)

# 人为设定c1、c2阈值
components = pd.DataFrame({'major_components': major_components, 
                               'minor_components': minor_components})
c1 = components.quantile(0.99)['major_components']
c2 = components.quantile(0.99)['minor_components']

# 制作分类器
def classifier(major_components, minor_components):  
    major = major_components > c1
    minor = minor_components > c2    
    return np.logical_or(major,minor)

results = classifier(major_components=major_components, minor_components=minor_components)

Autoencoder

Due to the linearity of the PCA method, there are great restrictions on the extraction of feature dimension types. In recent years, many neural network methods have been used for time series anomaly detection, such as Autoencoder, which overcomes these limitations by introducing the inherent nonlinearity of neural networks.

Auto-Encoder, called autoencoder in Chinese, is an unsupervised learning model. It is based on the backpropagation algorithm and optimization methods (such as gradient descent method), using the input data XXX itself is used as supervision to guide the neural network to try to learn a mapping relationship, so as to obtain a reconstructed outputXRX^RXR. _ In the time series anomaly detection scenario, the anomaly is a minority compared to the normal, so we believe that if the output XRX^Rreconstructed from the autoencoder is usedXIf the difference between R and the original input exceeds a certain threshold (threshold), there is an anomaly in the original time series. [If the samples are all numerical, MSE or MAE can be used as a measure. The larger the reconstruction error of the sample, the greater the possibility of abnormality]

The algorithm model consists of two main parts: Encoder (encoder) and Decoder (decoder).

The role of the encoder is to input high-dimensional input XXX is encoded into a low-dimensional hidden variablehhh so as to force the neural network to learn the most informative features; the role of the decoder is to take the hidden variablehhWhen h is restored to the original dimension, the best state is that the output of the decoder can perfectly or approximately restore the original input, that is, $X^R ≈ X $ .

# 参考代码

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense

# 标准化数据
scaler = preprocessing.MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(dataset_train),
                              columns=dataset_train.columns,
                              index=dataset_train.index)
# Random shuffle training data
X_train.sample(frac=1)
X_test = pd.DataFrame(scaler.transform(dataset_test),
                             columns=dataset_test.columns,
                             index=dataset_test.index)

tf.random.set_seed(10)
act_func = 'relu'
# Input layer:
model=Sequential()
# First hidden layer, connected to input vector X.
model.add(Dense(10,activation=act_func,
                kernel_initializer='glorot_uniform',
                kernel_regularizer=regularizers.l2(0.0),
                input_shape=(X_train.shape[1],)
               )
         )
model.add(Dense(2,activation=act_func,
                kernel_initializer='glorot_uniform'))
model.add(Dense(10,activation=act_func,
                kernel_initializer='glorot_uniform'))
model.add(Dense(X_train.shape[1],
                kernel_initializer='glorot_uniform'))
model.compile(loss='mse',optimizer='adam')
print(model.summary())

# Train model for 100 epochs, batch size of 10:
NUM_EPOCHS=100
BATCH_SIZE=10
history=model.fit(np.array(X_train),np.array(X_train),
                  batch_size=BATCH_SIZE,
                  epochs=NUM_EPOCHS,
                  validation_split=0.05,
                  verbose = 1)

plt.plot(history.history['loss'],
         'b',
         label='Training loss')
plt.plot(history.history['val_loss'],
         'r',
         label='Validation loss')
plt.legend(loc='upper right')
plt.xlabel('Epochs')
plt.ylabel('Loss, [mse]')
plt.ylim([0,.1])
plt.show()

# 查看训练集还原的误差分布如何,以便制定正常的误差分布范围
X_pred = model.predict(np.array(X_train))
X_pred = pd.DataFrame(X_pred,
                      columns=X_train.columns)
X_pred.index = X_train.index

scored = pd.DataFrame(index=X_train.index)
scored['Loss_mae'] = np.mean(np.abs(X_pred-X_train), axis = 1)
plt.figure()
sns.distplot(scored['Loss_mae'],
             bins = 10,
             kde= True,
            color = 'blue')
plt.xlim([0.0,.5])

# 误差阈值比对,找出异常值
X_pred = model.predict(np.array(X_test))
X_pred = pd.DataFrame(X_pred,
                      columns=X_test.columns)
X_pred.index = X_test.index
threshod = 0.3
scored = pd.DataFrame(index=X_test.index)
scored['Loss_mae'] = np.mean(np.abs(X_pred-X_test), axis = 1)
scored['Threshold'] = threshod
scored['Anomaly'] = scored['Loss_mae'] > scored['Threshold']
scored.head()

(6) Classification-based anomaly detection

OneClassSVM

SVM (Support Vector Machine) is an effective technique for detecting anomalies. Usually associated with supervised learning, SVM is a class of generalized linear classifiers for binary classification of data, whose decision boundary is the maximum-margin hyperplane that is solved for the learning samples.

But there is an extension (OneClassCVM) that can be used to identify anomalies as an unsupervised problem (where the training data is not labeled). The algorithm learns soft boundaries to cluster normal data instances using the training set, and then, using test instances, adjusts itself to identify anomalies that fall outside the learned region. Depending on the use case, the output of an anomaly detector can be a numeric scalar value for filtering with a domain-specific threshold or a textual label (e.g. binary/multi-label).

One-Class SVM is based on one type of data (normal data) to find the hyperplane, and transforms the SVM algorithm to solve the maximum interval target of negative samples, and then completes the anomaly detection under unsupervised learning. It can be understood that this is a novel value detection (Novelty Detection) algorithm, that is, in One-Class SVM, all the data that are different from normal data are regarded as novel data, and we set the boundary according to actual needs, and only consider the data that exceeds the boundary for abnormal data.

# 参考代码

from sklearn import svm
# fit the model
clf = svm.OneClassSVM(nu=0.1, kernel='rbf', gamma=0.1)
clf.fit(X)
y_pred = clf.predict(X)
# 输出异常值
n_error_outlier = y_pred[y_pred == -1].size

(7) Prediction-based anomaly detection

For a single time-series data, compare the predicted time-series curve with the real data, calculate the residual of each point, and model the residual sequence, and use methods such as KSigma or quantiles to detect anomalies. The specific process is as follows:

insert image description here

Data Fusion Processing

Data fusion of data generated by multi-dimensional sensors can produce more accurate, more complete and more reliable data than a single information source. Data fusion is divided into two steps: preprocessing and data fusion.

  1. Preprocessing
    1) External correction, to remove the impact on the result data caused by external noise such as external terrain, weather, air pressure, and wind speed. The purpose of external correction is mainly to remove the influence of external random factors on the consistency of the measurement data.
    2) Internal calibration, to remove the impact on the result data caused by the differences in the parameters of each sensor such as sensitivity and resolution. The purpose of internal calibration is mainly to eliminate the data differences obtained by different sensors.

  2. Data fusion
    According to different data fusion purposes and data fusion levels, select the appropriate data fusion algorithm to synthesize the extracted features or multi-dimensional data to obtain a more accurate representation or estimation than a single sensor.

According to the operation object level of data fusion, it is divided into: decision-level fusion, feature-level fusion and data-level fusion.

1) Data-level fusion
The operation object is the front-end data, and the processing of the raw data collected by the sensor is the bottom-level fusion. Commonly used data-level data fusion methods include: wavelet transform method, algebraic method, Kauth-Thomas Transformation (Kauth-Thomas Transformation, KT) and so on.

2) Feature-level data fusion
Feature-level data fusion is oriented to the fusion of monitoring object features, extracting feature information from the raw data collected by sensors to reflect the attributes of things for comprehensive analysis and processing, which is the intermediate link of data fusion .
The general process of feature-level data fusion is as follows: first preprocess the data, then perform feature extraction on the data, then perform feature-level fusion on the extracted data, and finally explain the attributes of the fused data.

3) Decision-making level data fusion
On the basis of the underlying two-level data fusion, feature extraction, data classification and logic operations are performed on the data to provide assistance for managers' decision-making. The required decision is data fusion at the highest level. This level of data fusion is characterized by fault tolerance and good real-time performance. When one or several sensors fail, decisions can still be made.
The general process of decision-level data fusion is: preprocessing the data, then extracting the features of the data, then explaining the attributes of the features, merging the attributes, and finally explaining the fused attributes.

The multi-sensor measurement data is merged through different fusion operations to reduce the amount of stored data and reduce the data resolution, but at the same time, it can also present all the information required for data retention after fusion.

# 参考代码

# 这里简单介绍对多个传感器测量一个特征的数据采用‘卡尔曼滤波’来进行 数据融合

# 卡尔曼增益 = 数据1的误差 除以 (数据1的误差 + 数据2的误差)
# 误差对应方差
def kalman_gain(e1, e2):
    return e1/(e1 + e2)
 
# 估计值 = 数据1的估计值 + 系数*(数据2测量值 - 数据1的估计值)
def now_estimated_value(X1, K, X2):
    return X1 + K(X2 - X1)
 
# 更新估计误差 = (1 - 卡尔曼增益)* 数据1的估计误差 + 卡尔曼增益* 数据2的估计误差
def now_estimated_error(K, e1, e2):
    return (1 - K)*e1 + K*e2
 
# 循环体
K = kalman_gain(e1, e2)
X_k = now_estimated_value(X1, K, X2)
e_EST = now_estimated_error(K, e1, e2)

Data Dimensionality Reduction Processing

When analyzing big data with multiple variables, there will be a lot of rich information, and there may be correlations between variables, but this increases the complexity of problem analysis. When analyzing each indicator, the analysis is often isolated, and the information in the data cannot be fully utilized. Therefore, blindly reducing indicators will lose a lot of useful information, and may even lead to wrong conclusions.

Consider changing the closely related variables into as few new variables as possible, so that these new variables are irrelevant, then you can use fewer comprehensive indicators to represent various types of information in each variable. While reducing the indicators that need to be analyzed, the loss of information contained in the original indicators is minimized, so as to achieve the purpose of comprehensive analysis of the collected data.

PCA analysis

The essence of Principal Component Analysis (PCA) is to map high-dimensional data into a low-dimensional space through a certain linear mapping, and expect the variance of the data in the projected dimension to be the largest, so as to achieve the goal of using a relatively Fewer data dimensions to retain more of the effect of the original data point characteristics. The main idea is to map n-dimensional features to k-dimensional. This k-dimensional is a new orthogonal feature, also known as principal component, which is a k-dimensional feature reconstructed on the basis of the original n-dimensional features. The job of PCA is to sequentially find a set of mutually orthogonal coordinate axes from the original space, and the selection of new coordinate axes is closely related to the data itself. Among them, the selection of the first new coordinate axis is the direction with the largest variance in the original data, the selection of the second new coordinate axis is the direction with the largest variance in the plane orthogonal to the first coordinate axis, and the third axis is the direction with the largest variance. , which has the largest variance in the two planes with orthogonal axes. By analogy, n such coordinate axes can be obtained. With the new coordinate axes obtained in this way, we find that most of the variance is contained in the first k coordinate axes, and the variance contained in the latter coordinate axes is almost zero. After that, the remaining coordinate axes can be ignored, and only the first k coordinate axes containing most of the variance are kept. In fact, this is equivalent to only retaining the dimensional features that contain most of the variance, while ignoring the feature dimensions that contain almost zero variance to achieve dimensionality reduction processing on data features.

In terms of calculation, by calculating the covariance matrix of the data matrix, and then obtaining the eigenvalue eigenvector of the covariance matrix, select the matrix composed of the eigenvectors corresponding to the k features with the largest eigenvalue (that is, the largest variance). In this way, the data matrix can be transformed into a new space to achieve dimensionality reduction of data features.

# 参考代码

# 直接调用sklearn库中PCA
# X_new为标准化后的数据
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # 设置PCA模型的参数n_components为2,即将三维数据降为二维数据
pca.fit(X_new) # 对标准化后的数据进行模型训练
X_transformed = pca.transform(X_new) # 对标准化后的数据进行降维
print(X_transformed)
print(pca.components_) # 获取线性组合系数

data standardization

Data standardization refers to mapping data to a specified interval through a certain method and ratio. According to the different functions used, it can be classified into three categories: linear dimensionless method, broken line dimensionless method, and curve dimensionless method. Some original data sets are dimensional data that have not been transformed. If they are directly input into the model for training, the convergence speed of the model will be slowed down due to the influence of different dimensional features. When it is large, the model may ignore features with smaller dimensions and fail to achieve the desired effect. Therefore, before model training, it is necessary to convert the data into dimensionless data through standardized methods to eliminate the impact of dimension on the model. The common standardization methods are as follows:

(1) min-max standardization (normalization): This method is based on the conversion of the two most values ​​​​in the sample, the maximum value is classified as 1, the minimum value is classified as 0, and other values ​​​​are distributed among them. For each attribute, let minA and maxA be the minimum and maximum values ​​of attribute A respectively, and map an original value x of A to a value x' in the interval [0,1] through min-max standardization, the formula is : New data = (original data - minimum value)/(maximum value - minimum value)
X nev = X − X min ⁡ X max ⁡ − X min ⁡ X_{nev}=\frac{X-X_{\min }} {X_{\max }-X_{\min }}Xnev=XmaxXminXXmin

# 参考代码

from sklearn import preprocessing
# feature_range设置最大最小变换值,默认(0,1)
min_max_normalizer=preprocessing.MinMaxScaler(feature_range=(0,1))
# 将数据缩放(映射)到设置固定区间
scaled_data=min_max_normalizer.fit_transform(data)
# 将变换后的数据转换为dataframe对象
price_frame_normalized=pandas.DataFrame(scaled_data)

(2) z-score standardization (normalization): Generally, the mean is normalized to 0, and the variance is normalized to 1. Standardize the data based on the mean and standard deviation of the original data. Normalize the original value x of A to x' using z-score. The z-score normalization method is suitable for situations where the maximum and minimum values ​​of attribute A are unknown, or when there are outlier data beyond the value range, and the formula is: new data = (original data - mean) / standard deviation X
new = X − μ σ X_{\text {new }}=\frac{X-\mu}{\sigma}Xnew =pXm

# 参考代码

from sklearn import preprocessing
# Z-Score标准化
zscore_scaler=preprocessing.StandardScaler()
data_scaler_1=zscore_scaler.fit_transform(data)

(3) Regularization: Data regularization is to scale a certain norm of the sample (such as the L1 norm) to 1. The regularization process is for a single sample, and for each sample, the sample is scaled to the unit norm.

# 参考代码

def proportional_normalization(value):
    """比例归一
    公式:值/总和
    return 值域[0,1]
    """
    new_value = value / value.sum()
    return new_value

(4) log ⁡ \log logarithmic conversion: This method is to convert the sample sequence that meets certain conditions through the logarithmic function, and the calculation formula is shown below .
X new = log ⁡ 10 X log ⁡ 10 X max X_{\text {new }}=\frac{\log _{10} X}{\log _{10} X_{\text {max }}}Xnew =log10Xmax log10X
​ 其中, X = { x 1 , x 2 , ⋯   , x i , ⋯   , x n } X=\left\{x_1, x_2, \cdots, x_i, \cdots, x_n\right\} X={ x1,x2,,xi,,xn} represents the original feature sequence, and needs to satisfyxi ≥ 1 x_i \geq 1xi1

# 参考代码

def log_transfer(value):
    """log转换,需要原始数据都大于1
    公式:log10(x)/log10(max)
    return 值域[0,1]
    """
    new_value = np.log10(value) / np.log10(value.max())
    return new_value

(5) Logistic function transformation: This method uses the Sigmoid function to transform the data, and its calculation formula is as follows.
X n new = 1 1 + e − x X_{n \text { new }}=\frac{1}{1+e^{-x}}Xn new =1+ex1

# 参考代码

def sigmoid(value):
    """logistic转换,定义Sigmoid函数
    new_value = 1.0/(1+np.exp(-value))
    return new_value

feature engineering

Feature engineering refers to the analysis and conversion of raw data to obtain a better expression of the target task, which is a necessary part of constructing an excellent model. Therefore, after data preprocessing, it is necessary to analyze and process the data through a series of feature engineering methods to mine key information to improve the stability and robustness of the model. Commonly used feature engineering methods mainly include feature encoding, correlation analysis, and feature screening.

1. Feature encoding

  1. One-hot encoding
    One-hot encoding is one of the commonly used encoding methods, which can map category features into dimensional vectors containing only 0 and 1 for output. Assuming that there are n different categories of categorical features, an n-dimensional vocabulary needs to be established according to the number of categories before encoding. When performing one-hot encoding on the i-th category in the vocabulary, an n-dimensional feature vector will be output. The value of position i in this vector is 1, and the values ​​of other positions are all 0. Although one-hot encoding is simple and easy to use, it cannot handle the correlation between categories, which will affect the model's fitting of parameters to a certain extent, and when the number of categories is huge, it will also make the encoded data become Very sparse, resulting in slower model training and poorer results.
    insert image description here

  2. Word2vec encoding
    Word2vec encoding first appeared in the field of NLP (Natural Language Processing), because of its good effect, it has been widely introduced into other fields for use. Word2vec encoding overcomes the shortcomings of one-hot encoding. Through Word2vec encoding, categorical features can be mapped into vectors with contextual relevance, and the vectors also have a fixed size and will not form a sparse matrix. Word2vec is encoded based on CBOW (Continuous Bag-of-Words Model) and Skip-Gram model. Each embedding vector encoded by Word2vec will be automatically updated during model training to obtain the best context-relevant expression, and a fixed encoding vector can be obtained after the model training is over.
    insert image description here

2. Feature correlation analysis

Generally, data features have a certain degree of linear and nonlinear correlations, and traditional models such as SVM and LR are difficult to learn these correlations between features. Therefore, it is necessary to use auxiliary methods to analyze the correlation of features. According to the analysis results, combined with knowledge in related fields and professional understanding of business problems, construct a better description of the target through feature combination, feature intersection, or addition, subtraction, multiplication, and division. key features of the problem. Common feature correlation analysis methods mainly include: Pearson correlation coefficient and maximum mutual information coefficient.

  1. Pearson Correlation Coefficient
    The Pearson correlation coefficient can be used to measure the degree of linear correlation between two features. Assuming that there are two characteristic variables X = (x1, x2, ..., xn) and Y = (y1, y2, ..., yn) in the data set, the Pearson coefficient of X and Y can be calculated by the formula, and the range of the output result of the formula is Between [-1, 1], zero means linear independence, positive number means positive correlation, negative number means negative correlation, and the greater the absolute value of the Pearson coefficient, the higher the degree of correlation between the two features.
    r = ∑ i = 1 n ( xi − x ˉ ) ( yi − y ˉ ) ∑ i = 1 n ( xi − x ˉ ) 2 ∑ i = 1 n ( yi − y ˉ ) 2 r=\frac{\sum_ {i=1}^n\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sqrt{\sum_{i=1}^n\left( x_i-\bar{x}\right)^2} \sqrt{\sum_{i=1}^n\left(y_i-\bar{y}\right)^2}}r=i=1n(xixˉ)2 i=1n(yiyˉ)2 i=1n(xixˉ)(yiyˉ)

​ Among them, xi x_ixisum yi y_iyiIndicates the ii of the data setFeature variable XX in i samplesXYYThe value corresponding to Y , x ˉ \bar{x}xˉy ˉ \bar{y}yˉRepresents the characteristic variable XXXYYMean of Y , nnn represents the sample size.

  1. Maximum Mutual Information Coefficient
    The maximum mutual information coefficient (MIC) can be used to measure the degree of linear or nonlinear correlation between two features, and the larger the MIC value, the higher the degree of correlation between the two features. Its calculation formula is as follows. The specific idea of ​​MIC is as follows: First, according to the values ​​of the two features, the data points are scattered into the two-dimensional plane in the form of a scatter diagram; then, the plane is gridded according to the specified grid resolution, and the grid is calculated The maximum mutual information value of different division methods under the resolution is normalized; finally, the maximum value of mutual information in each grid resolution is obtained as the MIC value.
    MIC ⁡ ( D ) = max ⁡ a × b < B ( n ) I ( a × b , X , Y ) log ⁡ ( min ⁡ ( a , b ) ) \operatorname{MIC}(D)=\max _{ a \times b<B(n)} \frac{I(a \times b, X, Y)}{\log (\min (a, b))} \quadMIC(D)=a×b<B(n)maxlog(min(a,b))I(a×b,X,Y)

​ 其中, I ( a × b , X , Y ) I(a \times b, X, Y) I(a×b,X,Y ) represents the grid resolutiona × ba \times ba×Characteristic variableXX under bXYYThe maximum mutual information value of Y , log ⁡ ( min ⁡ ( a , b ) ) \log (\min (a, b))log(min(a,b ) ) represents the normalization factor,B ( n ) B(n)B ( n ) represents the maximum grid resolution.

3. Feature selection

Combining domain expertise and related task requirements to filter high-dimensional features, the required features can be selected for subsequent steps such as model training. Commonly used feature screening methods are:
(1) Variance selection method: This method calculates the variance of each column feature, and judges whether the segment needs to retain or delete feature variables according to the set threshold. If the variance of a column of features is small, it is considered that all the data in the column features are almost unchanged, and these unchanged data have no meaning for the training of the model, so they need to be deleted.
(2) Tree model selection method: This method is based on tree models (such as XGBoost, RF) for feature selection, and the importance of each feature is scored by calculating information gain, so that features with high importance can be selected.
(3) Recursive feature elimination method: This method selects a base model (such as SVM, LR) and uses the data set to perform multiple rounds of training. After each round of training, delete the features with lower weights, and then enter the next step. For rounds of training, this process is repeated until the number of remaining features is consistent with the preset number of features.

Guess you like

Origin blog.csdn.net/m0_56075892/article/details/128306954