Introduction to Machine Learning (4): Feature Engineering-Feature Dimensionality Reduction

Feature Engineering
1. Why do we need feature engineering?
Because "data and features determine the upper limit of machine learning, and models and algorithms just approach this upper limit", the use of professional background knowledge and skills to process data makes the algorithm better.
2. What is feature engineering? The
sklearn library is used for feature engineering, and the
pandas library is used for data cleaning and data processing.

Feature dimensionality reduction

Definition : the process of reducing the number of features (reducing the number of columns) to obtain a set of " irrelevant " main variables

Method 1: Feature selection
         1. Filter (filter): mainly explores the characteristics of the feature itself, the correlation between the feature and the feature and the target value ①Variance
                  selection method: low-variance feature filtering
                  ②Correlation coefficient method: measuring the relationship between features and features Relevance
         2. Embedded: The algorithm automatically selects features (the association between features and target values)
                  ① Decision tree: information entropy, information gain
                  ② Regularization: L1, L2
                  ③ Deep learning: convolution, etc.

Feature selection definition : the data center contains redundant or related variables, aiming to find the main features from the original features 1.
Filter (filtering) ①Variance
       selection method: low variance feature filtering
       principle: the variance of the feature is small, indicating that a certain feature is If the sample values ​​are relatively similar, then delete the low-variance features; if the variance of the feature is large, it means that the sample value of a feature has a large difference, then the high-variance feature is retained.
Insert picture description here
         ②Correlation coefficient method
Insert picture description here
Insert picture description here
Insert picture description here
When the correlation coefficient between feature and feature is very high:
(1) Keep one of them
(2) Weighted summation
(3) Principal component analysis

Method 2: Principal Component Analysis (PCA)

Insert picture description here
For example: Given five points, drawn in a rectangular coordinate system on a bright surface, it is a two-dimensional, we use principal component analysis to reduce it to one-dimensional:
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Case: Exploring user preferences for item categories, segmentation and dimensionality reduction

Insert picture description here
Processing flow:
Insert picture description here
read four tables:
Insert picture description here
merge ueser_id and aisle_id:
Insert picture description here
find the relationship between ueser_id and aisle_id:
Insert picture description here
take the first ten thousand data. Since there are too many 0s and too much redundancy, PCA dimensionality reduction is performed:
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45234219/article/details/114821567