Data standardization and implementation of Python

First, the principle
        data normalization (Normalization): The data is scaled by a certain percentage, so as to fall between a particular cell.

Data normalization categories:

  • Min-Max standardization
  • Z-Score Standardization (Standard Score, standard score)
  • Fractional scaling (Decimal scaling) Standardization
  • Mean normalization
  • Vector normalization
  • Conversion index

1, Min-Max standardized
        Min-Max standardized, raw data refers to linear transformation, the mapping values between [0,1].

official:

Wherein, x is the original data, x min is the minimum value of the original data, x max is the maximum value of the original data.

2, Z-Score Standardization

        Also known as Standard Score (Score standard), refers to normalized data based on the average of the original data (mean) and standard deviation (standard deviation).

official:

Wherein, x is the original data, μ is the mean of the original data, σ is the standard deviation of the data of the original.

3, the fractional scaling (Deciaml scaling) standardization
        means for standardizing data by moving the position of the decimal point. Decimal digits depends on the maximum absolute value of the movement of the original data.

official:

 

Wherein, x is the original data, 10 j , j represents the maximum absolute value of the number of bits.

For example, an array is now [-309, -10, -43, 87, 344, 970], where 970 is the maximum absolute value, i.e., j = 3, data are normalized to [-0.309, -0.01, -0.043, 0.087, 0.344, 0.97]

4, mean normalization
        refers the mean from the original data, normalized to the maximum and minimum data.

official:

 

Wherein, x is the raw data, μ is the mean of the original data, x min is the minimum value of the original data, x max is the maximum value of the original data. Of course, also possible to use part of the denominator X max instead.

5, the vector normalizing
        means for normalizing the raw data through each data value is divided by the sum of all the data.

official:x^{'} = \frac{x}{\sum_{i=1}^{n}x_i}

 

Wherein, x is the raw data, and the denominator is the sum of all the data.

6, the index conversion
        means to the corresponding normalized data converted by an exponential function of the value of the original data. Exponential function converts common methods lg function, Softmax function and Sigmod function.

official:

(1) lg function:

 x^{'} = \frac{lg(x)}{lg(x_{max})}

Wherein, x is the raw data, x_{max}the maximum value of the original data.

(2) Softmax function:

 x^{'} = \frac{e^x}{\sum_{i=1}^{n}e^{x_i}}

Wherein, x is the raw data, e is the natural logarithm, the denominator represents the raw data of each data index and seeking e.

(3) Sigmoid function:
x^{'} = \frac{1}{1+e^{-x}}

Second, the code for data standardization

import numpy as np
import math
 
class DataNum:
    def __init__(self):
        self.arr = [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0]
        self.x_max = max(self.arr)  # 最大值
        self.x_min = min(self.arr)  # 最小值
        self.x_mean = sum(self.arr) / len(self.arr) # 平均值
        self.x_std = np.std(self.arr)   # 标准差
        print("原始数据:\n{}".format(self.arr))
 
    def Min_Max(self):
        arr_ = list()
        distance = self.x_max-self.x_min
        for x in self.arr:
            arr_.append(round((x-self.x_min)/distance,4))   # 保留4位小数
        print("Min_Max标准化结果:\n{}".format(arr_))
 
    def Z_Score(self):
        arr_ = list()
        for x in self.arr:
            arr_.append(round((x-self.x_mean)/self.x_std,4))
        print("Z_Score标准化结果:\n{}".format(arr_))
 
    def DecimalScaling(self):
        arr_ = list()
        j = self.x_max // 10 if self.x_max % 10 == 0 else self.x_max // 10 + 1
        for x in self.arr:
            arr_.append(round(x/(math.pow(10,j)),4))   # 保留4位小数
        print("DecimalScaling标准化结果:\n{}".format(arr_))
 
    def Mean(self):
        arr_ = list()
        distance = self.x_max - self.x_min
        for x in self.arr:
            arr_.append(round((x - self.x_mean) / distance, 4))  # 保留4位小数
        print("Mean标准化结果:\n{}".format(arr_))
 
    def Vector(self):
        arr_ = list()
        arr_sum = sum(self.arr)
        for x in self.arr:
            arr_.append(round(x / arr_sum, 4))  # 保留4位小数
        print("Vector标准化结果:\n{}".format(arr_))
 
    def exponential(self):
        arr_1 = list()  # lg
        arr_2 = list()  # SoftMax
        arr_3 = list()  # Sigmoid
        sum_e = sum([math.exp(x) for x in self.arr])
        for x in self.arr:
            arr_1.append(round(math.log10(x) / math.log10(self.x_max), 4))  # 保留4位小数
            arr_2.append(round(math.exp(x) / sum_e, 4))  # 保留4位小数
            arr_3.append(round(1 / (1+math.exp(-x)), 4))  # 保留4位小数
        print("lg标准化结果:\n{}".format(arr_1))
        print("SoftMax标准化结果:\n{}".format(arr_2))
        print("Sigmod标准化结果:\n{}".format(arr_3))
 
    def do(self):
        dn.Min_Max()
        dn.Z_Score()
        dn.DecimalScaling()
        dn.Mean()
        dn.Vector()
        dn.exponential()
 
if __name__ == '__main__':
    dn = DataNum()
    dn.do()

operation result:

 

 

Guess you like

Origin www.cnblogs.com/SysoCjs/p/11595540.html