【Data analysis】—feature engineering, feature design, feature selection, feature evaluation, feature learning

What is feature engineering? (Feature Engineering)

  • After data preprocessing (or during data preprocessing), how to extract effective features from the data so that these features can express the information in the original data as much as possible, so that the subsequent data model can achieve better results, This is what feature engineering does.
  • insert image description hereinsert image description here

Significance of feature engineering

  • Famous data scientist Andrew Ng described feature engineering as follows: "Although extracting data features is very difficult, time-consuming and requires expert knowledge in related fields, the basis of machine learning applications is feature engineering"
  • The better the features, the greater the flexibility.
    Good features enable general models to achieve good performance, run fast on uncomplicated models, and are easy to understand and maintain.
  • The better the features, the simpler the built model.
    Good features do not need to spend too much time to find the optimal parameters, which reduces the complexity of the model and makes the model simpler.
  • The better the features, the better the performance of the model. There
    is no doubt that good features can make the model perform better, and the ultimate goal of feature engineering is to improve the performance of the model.

The process of feature engineering

insert image description here

characteristic design

How to design features from raw data?

Extraction of basic features

The process of extracting basic features is to preprocess the raw data and transform it into usable numerical features. Common methods include: data normalization, discretization, missing value completion, and data transformation .

Create new features

According to the corresponding domain knowledge, on the basis of the basic features, the ratio and cross changes between the features are constructed to construct new features.

Function Transformation Features

  • The left picture is a sequence diagram obtained according to two Sin functions (7 and 17 cycles per second, respectively) and some noise data;
  • The picture on the right is a frequency map obtained by Fourier transform . It can be seen that after the transformation, two frequencies 7 and 17 with the highest probability have been successfully obtained (the vertical axis is the amplitude, that is, the probability value)
    insert image description here

One-hot feature representation One-hot Representation

  • Represent each attribute as a very long vector (each dimension represents an attribute value, such as a word)
    • Function: [0, 0, 1, 0, 0, ..., 0, 0, 0, 0]
    • Image: [0, 0, 0, 0, 0, ..., 0, 0, 0, 1]
  • Pros: intuitive, concise
  • defect:
    • The "curse of dimensionality" problem : Especially when the corpus we build contains a lot of word data, the overhead of one-hot representation in space and time is very huge
    • "Semantic gap" phenomenon : Any two words are completely isolated, and it is impossible to describe the word order information of the words in the sentence (the same is true for the bag of words model mentioned earlier). For example, we cannot judge the connection between "function" and "even function" through one-hot representation (but in fact these two words are very related).

Statistical Characteristics of Data

  • Such as: word frequency statistics in documentsinsert image description here
  • dictionaryinsert image description here
  • Document word frequency featureinsert image description here

TF-IDF (term frequency-inverse document rate)

  • The algorithm is simple and efficient, and the industry is used for the initial data preprocessing
  • Main idea: find the **"keyword"** that can represent the document
  • Term Frequency (TF, Term Frequency)
    • TF = frequency of a word (feature value) appearing in a sentence (data)
  • Inverse Document Frequency (IDF, Inverse Document Frequency)
    • IDF = log(corpus (database) total number of sentences (data) / total number of sentences (data) containing the word (feature value))
  • Importance of each feature value (word)
    • w i j = t f ∗ i d f = T F i j ∗ l o g ( N / D F i ) w_{ij}= tf*idf = TF_{ij}*log(N/DF_i) wij=tfidf=TFijlog(N/DFi)

How to find key features (words)?

  1. According to TF, you can find the high-frequency words (feature values) in a sentence (delete meaningless words, such as stop words "的", "是", "了", etc.)
  2. Continue to assign weights and sort the remaining words in the sentence according to IDF . The more common words (feature values) in the database, the smaller the weight
  3. According to TF-IDF , we can get the TF-IDF values ​​of all words (feature values) in a sentence (data), and then sort and filter to get the most representative features ("keywords") of each sentence

Calculate TF-IDF

insert image description here
insert image description here

  • advantage
    • Simple and fast word (feature) importance representation method, the result is more in line with the actual situation
    • Broad application: not limited to text data
  • shortcoming
    • Measuring the importance of a word simply by "word frequency" is not comprehensive enough, and sometimes important words may not appear many times
    • It cannot reflect the position information and order information of words , and the words that appear in the front position and the words that appear in the back position are regarded as having the same importance
    • Unable to discover implicit connections of words (features), such as synonyms, etc.

TF-IDF (term frequency-inverse document rate) - application

  • search engine; keyword extraction; text similarity; text summarization
  • Recommended system
    • The features of "user-tag-item" can be calculated
    • TF-IDF for user-tagsinsert image description here
    • User: i. Labels: l. Total number of users: M.insert image description here

Feature Combination: Constructing Higher-Order Features

The characteristics of all the above structures can be combined: two by two, three by three, ...

Example: The 2nd "China University Computer Contest-Big Data Challenge"

Simply put, the goal of solving this problem is to use data analysis to distinguish the artificial mouse trajectory from the mouse trajectory generated by code. The mouse track here refers to a means of completing a verification—the track of the mouse when dragging the slider to the designated area. insert image description here
Original data format: the coordinates of a series of continuous points and their corresponding time, the coordinates of the target point,
for example: (2,3,4),(2,5,6)(4,3,7) (4,3), the The trajectory contains the coordinates of three points, expressed in (x,y, time) time, and the end point coordinates are (4,3)

Extraction of basic features

  • Statistical value of trajectory motion data: average/extreme/maximum/median of velocity/acceleration/angular acceleration/angular velocity
  • Description of the trajectory: whether the movement is unidirectional in the x-axis direction, the smoothness of the curve, etc.

Create new features

  • Simple binary operations on basic features, add/subtract/multiply/divide/square sum/sum square/reciprocal sum
  • Partial derivative of motion data in a certain dimension
  • Domain Expertise

How to select effective features (Subset Selection problem)

  • In practical applications, the number of features is often relatively large, and there may be irrelevant features among them.
  • The larger the number of features, the longer it takes to analyze the features and train the model, and it is easy to cause the " dimension disaster " and make the model more complex.
  • Feature selection reduces the number of features by eliminating irrelevant features or redundant features , thereby simplifying the model and improving the generalization ability of the model.

How to generate feature subsets

insert image description here

Example:

insert image description here

How to evaluate feature subsets?

Different feature selection algorithms not only have different evaluation criteria for feature subsets, but some also need to be combined with subsequent learning
algorithm models. Therefore, according to the combination of subset evaluation criteria and subsequent algorithms in feature selection, it is mainly divided into three types : Filter, Wrapper and Embedded.

1. Filter evaluation strategy method

  • Analyze inherent properties of datasets independently of subsequent learning algorithm models
  • Use some heuristic criteria based on information statistics to evaluate feature subsets
  • Heuristic evaluation functions: distance measure, information measure, dependency measure, consistency measureinsert image description here

2. Wrapper Evaluation Strategy Method

  • To use feature selection as a component of the learning algorithm, it is necessary to combine the subsequent learning algorithm , and directly use the classification performance of the learning algorithm as the evaluation standard of feature importance
  • Directly use the performance of the classifier as the evaluation standard , and the selected feature subset must have the best performance for classification
  • Compared with the Filter selection method, the size of the feature subset selected by the Wrapper method is much smaller, which is conducive to the identification of key features, and the classification performance of the model is better . However, the generalization ability of the Wrapper method is poor. When the learning algorithm is changed, feature selection needs to be re-selected for the learning algorithm, and the computational complexity of the algorithm is high .insert image description here

3. Embedded Evaluation Strategy Method

The embedded feature selection method based on Embedded combines learning algorithm and feature selection mechanism to evaluate the features considered in the learning process. The feature selection algorithm is embedded in the learning and classification algorithm, that is, the feature selection is a part of the algorithm model , and the algorithm model training and feature selection are carried out simultaneously and combined with each other (that is, the algorithm has the function of automatic feature selection) . Common methods are:

1). Feature selection method with penalty item

The basic idea is to add a penalty term to the model loss function. During model training, the penalty term is used to punish the coefficients of the features . In the feature selection method, the L1 regularization term is often used. insert image description here
Regularization is to add additional constraints or penalties to the existing model (loss function) to prevent overfitting and improve generalization .
The loss function is composed of the original E ( X , Y ) E(X,Y)E ( X ,Y ) changes toE ( X , Y ) + lambda ∣ ∣ w ∣ ∣ 1 E(X,Y)+lambda||w||_1E ( X ,Y)+lambda∣∣w1
w w w is a vector composed of model coefficients (also called parameter parameters, coefficients in some places),∣ ∣ ⋅ ∣ ∣ ||·||∣∣∣∣Generally L1 or L2 norm,lambda lambdal amb d a is a tunable parameter that controls the strength of the regularization. When used on linear models,L1 regularization and L2 regularization are also known as Lasso and Ridge
insert image description here

2). Feature selection method based on tree model

These algorithms must select a feature at each step of the tree growth process to divide the sample set into subsets with higher purity, and each time they select the feature that makes the best division effect, so the generation process of the decision tree is The process of feature selection . When the decision tree is fully generated, the set of features used for each node split is the final filtered feature subset. For example, algorithms such as iterative decision tree (GBDT) and random forest (RF) are often used in competitions .

Example:
  • Input the 200-dimensional features obtained from the previous preliminary screening into xgboost (an efficient gradient boosting machine (GBM, Gradient boosting machine) algorithm)
  • The importance of the feature is obtained by training, that is, the weight of the role played when splitting the tree node, and the threshold value is selected by itself to select a subset of features
  • In order to ensure that important features are not missed, it is advisable to set the depth of the tree higher

Disadvantages of traditional feature engineering

insert image description here
insert image description here
insert image description here

feature learning

How to independently learn features from data, here we mainly introduce three network structures commonly used in deep learning.

Self-encoding structure (Auto-Encoder)

insert image description here

Convolutional Neural Network (CNN): Commonly used in image feature extraction

insert image description here

Convolutional Neural Network (CNN): Commonly used in image feature extraction

insert image description here

Recurrent Neural Network (RNN): Commonly used for feature extraction of sequence data

insert image description here

Feature learning using standard datasets (feature pre-training)

  • Role: model effect verification & model pre-training in application problems
  • Image data pre-training: ImageNet
    • http://www.image-net.org/
    • 14 million pictures, 20,000 categories, labeled
    • Commonly used models: ResNet, AlexNet, VGG, etc.
    • Common applications: image classification, object detection, object location, scene classification, etc.
  • Text data pre-training: Twitter, Wiki
    • https://nlp.stanford.edu/projects/glove/
    • 2 Billon tweets, 27 Billion words, 1.2M vocabulary
    • Commonly used models: CBOW, Skip-gram, Glove and other Word2Vec models
    • Common applications: text classification, text reasoning, translation, etc.
  • The trained features can be directly used as the input of other models

Guess you like

Origin blog.csdn.net/weixin_56462041/article/details/130145480