Feature Engineering
1. Why do we need feature engineering?
Because "data and features determine the upper limit of machine learning, and models and algorithms just approach this upper limit", the use of professional background knowledge and skills to process data makes the algorithm better.
2. What is feature engineering? The
sklearn library is used for feature engineering, and the
pandas library is used for data cleaning and data processing.
Feature dimensionality reduction
Definition : the process of reducing the number of features (reducing the number of columns) to obtain a set of " irrelevant " main variables
Method 1: Feature selection
1. Filter (filter): mainly explores the characteristics of the feature itself, the correlation between the feature and the feature and the target value ①Variance
selection method: low-variance feature filtering
②Correlation coefficient method: measuring the relationship between features and features Relevance
2. Embedded: The algorithm automatically selects features (the association between features and target values)
① Decision tree: information entropy, information gain
② Regularization: L1, L2
③ Deep learning: convolution, etc.
Feature selection definition : the data center contains redundant or related variables, aiming to find the main features from the original features 1.
Filter (filtering) ①Variance
selection method: low variance feature filtering
principle: the variance of the feature is small, indicating that a certain feature is If the sample values are relatively similar, then delete the low-variance features; if the variance of the feature is large, it means that the sample value of a feature has a large difference, then the high-variance feature is retained.
②Correlation coefficient method
When the correlation coefficient between feature and feature is very high:
(1) Keep one of them
(2) Weighted summation
(3) Principal component analysis
Method 2: Principal Component Analysis (PCA)
For example: Given five points, drawn in a rectangular coordinate system on a bright surface, it is a two-dimensional, we use principal component analysis to reduce it to one-dimensional:
Case: Exploring user preferences for item categories, segmentation and dimensionality reduction
Processing flow:
read four tables:
merge ueser_id and aisle_id:
find the relationship between ueser_id and aisle_id:
take the first ten thousand data. Since there are too many 0s and too much redundancy, PCA dimensionality reduction is performed: