Introduction to Machine Learning (3): Feature Engineering-Feature Preprocessing

Feature Engineering
1. Why do we need feature engineering?
Because "data and features determine the upper limit of machine learning, and models and algorithms just approach this upper limit", the use of professional background knowledge and skills to process data makes the algorithm better.
2. What is feature engineering? The
sklearn library is used for feature engineering, and the
pandas library is used for data cleaning and data processing.

Feature preprocessing

Definition : The process of transforming feature data into feature data more suitable for algorithm models through some conversion functions. Why do you need to normalize/standardize the
sklearn.preprocessing() function
? : The unit or size of the feature is quite different, or the variance of a particular certificate is several orders of magnitude larger than other features, which is easy to influence (dominate) the target result. Therefore, to make dimensionless is to convert different specifications into the same specification

1. Normalization

Definition: Transform the original data and map the data to (default (0,1)).
Insert picture description here
Insert picture description here
Example:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
def minmax_demo():
    """归一化
     :retur:"""
    #使用pandas读取数据
    data0=pd.read_csv('C:/Users/Admin/Desktop/a.TXT')
    print('data0:\n',data0)
    #txt中的数据,行都要,列只要前三列
    data=data0.iloc[:,:3]
    print('data:\n',data)
    #实例化一个转换器类
    transfer=MinMaxScaler()
    #调用fit_transform
    data_new=transfer.fit_transform(data)
    print('data_new:\n',data_new)
if __name__=='__main__':
    minmax_demo()

Insert picture description here
Insert picture description here

Disadvantages:

The normalization formula uses the maximum and minimum values ​​for calculation, but the maximum and minimum values ​​are very susceptible to abnormal points, so this method has poor robustness and is only suitable for traditional small data scenarios

2. Standardization

Definition: Transform the original data, transform the data to a mean value of 0 and a standard deviation of 1
Insert picture description here
Insert picture description here
Insert picture description here

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler,StandardScaler
import pandas as pd
def Stand_demo():
    """归一化
     :retur:"""
    #使用pandas读取数据
    data0=pd.read_csv('C:/Users/Admin/Desktop/a.TXT')
    print('data0:\n',data0)
    #txt中的数据,行都要,列只要前三列
    data=data0.iloc[:,:3]
    print('data:\n',data)
    #实例化一个转换器类
    transfer=StandardScaler()
    #调用fit_transform
    data_new=transfer.fit_transform(data)
    print('data_new:\n',data_new)
if __name__=='__main__':
    Stand_demo()

Insert picture description here
Advantages : relatively stable when there are enough samples, suitable for modern noisy big data scenarios
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45234219/article/details/114694058