Examples of noise data understanding and Min-Max normalization and Score normalization (zero-mean normalization)【Data preprocessing】

This article is participating in the "Golden Stone Project. Share 60,000 cash prize"

1. Noise data

Noisy data is meaningless data , and the word is often used as a synonym for corrupted data.

1. Binning:

  • Smooths the values ​​of ordinal data by looking at the "neighbors" (surrounding values) of the data. locally smooth

2. Regression:

  • Fits a function (regression function) to the data to smooth the data.

3. Clustering:

  • Cluster similar values ​​into cluster A

4. Others:

  • Such as data reduction, discretization and concept hierarchy

1.1 Binning

Smooths the values ​​of ordinal data by looking at the "neighbors" (surrounding values) of the data. Locally smooth.

  • Division: equal frequency, equal width
  • Smooth: use bin mean, use bin median, use bin bounds (to replace each data in the bin)
  • The largest and smallest values ​​in a bin are considered bin boundaries. Each value in the bin is replaced by the nearest boundary value.

1.2 Binning smooth data

image-20221110194410187

1.3 Noise data

1. Regression:

Fits a function (regression function) to the data to smooth the data.

  • linear regression

  • multiple linear regression

2. Clustering: Clustering similar values ​​into clusters. Detect outliers

Noisy data can be beneficial.

1.4 Regression Regression

image-20221110194745289

1.5 Clustering Cluster Analysis

image-20221110194902707

1.6 Data cleaning as a process

1.6.1 Deviation detection

1. Use "metadata": data about data.

  • For example, what is the data type of each attribute? What is the domain of definition?

2. Coding format: there are inconsistent usage and inconsistent data representation

  • Example: dates "2015/12/08" and "08/12/2015"

3. Field overload:

  • The definition of the new property is squeezed into the unused (bit) part of the defined property

4. Uniqueness rules:

  • Every value for a given property must be different from other values ​​for that property.

5. Continuity rules:

  • There are no missing values ​​between the lowest and highest values ​​of the attribute, and all values ​​must also be unique (eg, test numbers)

6. Null value rules:

  • Describe the use of white space, question marks, special symbols, or other strings to indicate null-value conditions (for example, where the value of a given attribute cannot be used), and how such values ​​are handled.

1.6.2 Data Transformation (Correction of Bias)

  • Data cleaning tools: Using simple domain knowledge (postal address knowledge and spell check), check and correct errors in the data. These tools rely on analysis and fuzzy matching techniques when cleaning data from multiple data sources.
  • 数据审计工具:通过分析数据发现规则和联系,并检测违反这些条件的数据来发现偏差。
  • 数据迁移工具:允许说明简单的变换。
  • ETL(提取/变换/装入)工具:允许用户通过图形用户界面说明变换。
  • 通常这些工具只支持有限的变换

1.6.3迭代

  • 需要迭代执行偏差检测和数据变换(纠正偏差)这两步过程。
  • -通常需要多次迭代才能达到满意的效果。

1.6.4 加强交互性

  • 数据清理工具:
    • kettle是一个开源的数据清理工具
  • 开发数据变换操作规范说明语言

二. 数据集成和变换

  • 数据集成合并多个数据源中的数据,存放在一个一致的数据库(如数据仓库)中。
  • 源数据可能包括多个数据库,数据立方或一般文件。
  • 数据变换将数据转换或统一成适合于挖掘的形式。

2.1 数据集成

1.实体识别

  • 元数据可帮助避免错误

2.属性冗余与相关性分析-

  • 相关分析

3.数据重复(元组冗余)

4.数据值冲突的检测与处理

  • 表示、比例或编码不同

2.2 数据变换

1.平滑:去掉数据中的噪声。技术包括分箱、回归、聚类。

2.聚集Aggregation:对数据进行汇总或聚集。

3.数据泛化(概化):使用概念分层,用高层概念替换低层或“原始”数据。

4.规范化:将属性数据按比例缩放,使之落入一个小的特定区间。最小-最大、Z-Score、按小数定标规范化。

5.属性构造(特征构造):由给定的属性构造新的属性并添加到属性集中,以帮助挖掘过程。可以帮助提高准确率和对高维数据结构的理解。

数据立方体聚集

image-20221110201242751

概念分层

image-20221110201518876

image-20221110201538554

2.3规范化

2.3.1 Min-Max 规范化(最小-最大规范化)

规范化公式:

image-20221110202252129

2.3.2 Min-Max 规范化(最小-最大规范化)例子代码(红酒数据集)

1.事先准备,采用红酒数据集,将数据拿出:

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
y = wine.target
复制代码

2.支持向量机

from sklearn import svm
svm = svm.SVC()
复制代码

3.支持向量机 training score:

print("支持向量机 training score: ",svm.score(X,y))
svm.fit(wine_X,y)
复制代码

运行结果为:

image-20221110214149727

6.手动让其进行Min-Max 规范化(最小-最大规范化):

wind_X=X.copy()
for i in range(13):
    columu_X = X[:, i]
    wind_X[:, i] = (columu_X-columu_X.min())/(columu_X.max()-columu_X.min())
print(wind_X)

复制代码

7.对照组,输出之前的训练得分:

svm.fit(X,y)
print("支持向量机 training score: ",svm.score(X,y))
复制代码

8.支持向量机归一化后 training score:

svm.fit(wind_X,y)
print("支持向量机归一化后 training score: ",svm.score(wind_X,y))
复制代码

9.结果:可以看出,对其改善很大。

image-20221111101307189

2.3.4 缺点

1.若存在离群点,可能影响规范化

2. If new data is added after normalization, when the new data falls outside the interval [minA, max A] of the original data, it will cause an "out of bounds" error.

2.3.5 Score normalization (zero-mean normalization)

z-score normalization (zero-mean normalization): The value of attribute A is normalized based on the mean and standard deviation of A.

image-20221110202950708

insensitive to outliers

2.3.6 Score normalization (zero-mean normalization) example code (red wine data set)

1. Prepare in advance, use the red wine data set, and take out the data:

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
y = wine.target
复制代码

2. Support Vector Machine

from sklearn import svm
svm = svm.SVC()
复制代码

3. Use column-wise normalization in the model:

from sklearn import preprocessing
#数据预处理:按列归一化
wine_X=preprocessing.scale(X)
print(wine_X)
复制代码

4. The output result is:

image-20221111101611865

5. Manually write column normalization

wind_X=X.copy()
for i in range(13):
    columu_X = X[:, i]
    wind_X[:, i]=(columu_X-columu_X.mean())/columu_X.std()
print(wind_X)563
复制代码

6. View the results:

image-20221111101726242

Guess you like

Origin juejin.im/post/7166502046361714702