[python] Data mining analysis and cleaning - summary of standardization (data normalization) processing methods


Link to this article: https://blog.csdn.net/weixin_47058355/article/details/130342784?spm=1001.2014.3001.5501

foreword

Data standardization refers to converting data of different scales, units or ranges into a unified standard value for comparison and analysis.
#This article uses the Titanic data set, you can find the link from kaggle: Portal

1. Data standardization

1.1 Decimal normalization

Min-max normalization is one of the common data normalization methods, which maps it to the [0,1] interval through a linear transformation of the original data.
The advantage of decimal standardization is that it is simple and easy, and it can effectively compress the data into the [0,1] interval, which is convenient for subsequent processing. However, it should be noted that if there are extreme values ​​or outliers in the original data, it may cause large fluctuations in the normalization results. Therefore, in practical applications, it is necessary to select an appropriate data normalization method according to the specific situation.

def MinMaxScale(data):
     return(data-data.min())/(data.max()-data.min())
MinMaxScale(data['Fare'])

insert image description here
insert image description here

1.2 Standardization of standard deviation (Z-score)

Standard deviation normalization is also a method commonly used in data preprocessing. It is similar to dispersion standardization, but uses standard deviation as the benchmark for standardization. Specifically, when standardizing the standard deviation, we need to calculate the difference between each data point and the overall mean, and then divide these differences by the standard deviation of the data set, so that we can get a new, normalized data set.
The advantage of standard deviation normalization is that it handles outliers better because it takes into account the distance between each data point and the overall mean, and compares this distance to the standard deviation for normalization. Standard deviation standardization can also eliminate the influence of measurement units under different dimensions on the results, so that different variables can be compared more fairly.
It should be noted that when performing standard deviation standardization, if there are extreme values ​​(that is, outliers) in the data set, they may have a great impact on the calculation of standard deviation, thereby affecting the standardized results. Therefore, it is necessary to combine other methods for outlier detection and processing in practical applications.

def Z_score_Scale(data):
     return(data-data.mean())/data.std()
Z_score_Scale(data['Fare'])

insert image description here

In addition to using the functions you define, you can also use the library

from sklearn import preprocessing
data=pd.read_csv('./Titanic_train.csv') #读取文件
scaler = preprocessing.StandardScaler() #获得转换容器
data['Fare'] = scaler.fit_transform(data['Fare'].values.reshape(-1,1))#进行转换
data['Fare']

insert image description here

1.3 Standardization of decimal scaling

The purpose of this method is to scale the feature value by moving the decimal point position of the data so that it falls within a smaller range, thereby improving the efficiency and accuracy of model training.
The standardization of decimal scaling can make the data distribution more concentrated, avoiding the difficulty of model training caused by different eigenvalue sizes, and can also reduce the amount of calculation and storage space consumption. However, it should be noted that this method relies on extreme values ​​in the data set, so errors may occur for data sets with outliers.

def Decimal_Scale(data):
     return  data/10**(np.log10(data.abs().max()))
Decimal_Scale(data['Fare'])

insert image description here

Summarize

The benefits of data normalization include:

Improve model accuracy and reliability. Standardization can eliminate the impact of different scales and units of data, making the comparison between different variables more fair, thereby improving the prediction accuracy and reliability of the model.

Facilitate data comparison and analysis. The standardized data has similar magnitude, range and distribution, which can be compared and analyzed more conveniently, so as to discover the relationship and trend between the data.

Reduce the complexity of data processing. Standardization can convert data into a unified standard value, thereby reducing the complexity and difficulty of data processing, saving time and labor costs.

In short, data standardization is an indispensable step in data analysis, which can improve the accuracy and reliability of the model, facilitate data comparison and analysis, and reduce the complexity of data processing.

Friends are welcome to leave a message, like and collect, if it is helpful to you, you can buy me a cup of coffee

Guess you like

Origin blog.csdn.net/weixin_47058355/article/details/130342784