Feature Engineering: Feature Generation, Feature Selection (3)

Reprinted from: https://blog.csdn.net/cymy001/article/details/79169862

feature generation

The new feature introduced in feature engineering needs to be verified that it can improve the prediction accuracy, rather than adding a useless feature to increase the complexity of the algorithm operation.

1. Timestamp processing

Timestamp attributes usually need to be separated into multiple dimensions such as year, month, day, hour, minute, second. But in many applications, a lot of information is not needed. For example, in a supervised system, trying to use a function of 'location + time' to predict the degree of traffic failure in a city, in this instance, most of them will be misled to learn the trend only by different seconds, which is actually unreasonable. And the dimension 'year' is not very good for adding value changes to the model, we may only need dimensions such as hours, days, and months. So when rendering time, try to make sure that all the data you provide is what your model needs. And don't forget the time zone, if your data sources come from different geographic data sources, don't forget to use the time zone to normalize the data.

2. Decomposing category attributes (one-hot)

Some attributes are categorical rather than numeric. For a simple example, a color attribute consisting of {red, green, blue}, the most common way is to convert each categorical attribute into a binary attribute, that is, from {0, 1} takes a value. So basically the added attribute is equal to the corresponding number of categories, and for each instance in your dataset, only one is 1 (the others are 0), which is one-hot encoding (similar to converting to dummy variable). 
If you don't understand the encoding, you may feel that the decomposition will add unnecessary trouble (because the encoding greatly increases the dimensionality of the dataset). Instead, you might try converting a category property to a scalar value, for example a color property might use {1,2,3} for {red, green, blue}. There are two problems here, first, for a mathematical model, this means that red and green are in some sense more "similar" than blue (since |1-3| > |1-2|). Unless your class has a sorted property (such as stations on a railway line), this may mislead your model. Then, it can render statistical metrics (like the mean) meaningless, or worse, mislead your model. As an example of color, if your dataset contains the same number of instances of red and blue, but no green ones, then the mean of the colors might still get 2, which means green. 
Being able to convert the category attribute into a scalar, the most efficient scenario should be the case where there are only two categories. That is, {0,1} corresponds to {category 1, category 2}. In this case, no ordering is required, and you can interpret the value of the attribute as the probability of belonging to either class 1 or class 2.

3. Binning/partitioning (numerical type to type)

Sometimes it makes more sense to convert a numeric attribute to a categorical type, while dividing a range of values ​​into definite blocks to make the algorithm less noisy. Partitioning is meaningful only when the domain knowledge of the property is understood and it is determined that the property can be divided into concise ranges, that is, all values ​​that fall into a partition can exhibit common characteristics. In practice , partitioning can avoid overfitting when you don't want your model to always try to distinguish between values ​​that are too close together. For example, if you are interested in a city as a whole, then you can integrate all dimension values ​​that fall into the city into a whole. Binning can also reduce the impact of small errors, by dividing a given value into the nearest block. If the number of bins is close to all possible values, or if accuracy is important to you, then binning is not suitable.

4. Cross features (combined categorical features)

Cross-feature is one of the most important methods in feature engineering, which combines two or more class attributes into one. This is a very useful technique when the combined features are better than the individual features. Mathematically speaking, it is a cross-multiplying of all possible values ​​of the categorical feature. If you have a feature A, A has two possible values ​​{A1, A2}. Having a feature B, there are possible values ​​such as {B1, B2}. Then, the intersection features between A & B are as follows: {(A1,B1),(A1,B2),(A2,B1),(A2,B2)}, and you can give any name to these combined features, but need to understand Each combined feature actually represents the synergy of the respective information of A and B. A better example of a good intersection feature is something like (longitude, latitude). The same longitude corresponds to many places on the map, as does the latitude. But once you combine longitude and latitude, they represent a geographically specific area, each of which has similar properties.

5. Feature selection

To get a better model, some algorithm is used to automatically select a subset of the original features. In this process, you do not build or modify the features you have, but prune the features to reduce noise and redundancy. Among the data features, there are some features that are more important than others to improve the accuracy of the model, and some are redundant with other features. Feature selection is to automatically select the most useful feature subset for solving the problem. to solve the above problems. Feature selection algorithms may use scoring methods to rank and select features, such as correlation or other methods of determining feature importance, and further methods may require trial and error to search for subsets of features. 
Auxiliary models can also be built. Stepwise regression is an example of a feature selection algorithm that is automatically executed during model construction. Regularization methods like Lasso regression and ridge regression are also included in feature selection by adding additional constraints or penalties. Added to an existing model (loss function) to prevent overfitting and improve generalization.

6. Feature scaling

Some features have larger span values ​​than others. An example would be comparing a person's income to his age, more specific examples like some models (like ridge regression) require that you have to scale the eigenvalues ​​to the same range of values. Scaling can prevent some features from getting very different weights than others.

7. Feature extraction

Feature extraction involves a series of algorithms that automatically generate some new feature sets from the original attributes, and dimensionality reduction algorithms fall into this category. Feature extraction is a process of automatically reducing the dimensionality of observations to a small dataset sufficient to model. For tabular data, available methods include projection methods like principal component analysis and unsupervised clustering algorithms. For graphics data, it may include some line detection and edge detection, and there are separate methods for different fields. 
The key point of feature extraction is that these methods are automatic (only need to be designed and built from simple methods), and can also solve the problem of uncontrolled high-dimensional data. In most cases, these different types of data (images, language, video, etc.) are stored in digital format for analog observation.

Feature selection

(1) Subset generation: Generate candidate feature subsets according to a certain search strategy; 
(2) Subset evaluation: Evaluate the pros and cons of feature subsets through a certain evaluation function; 
(3) Stop condition: Determine when the feature selection algorithm is Stop; 
(4) Subset verification: used to verify the validity of the finally selected feature subset.

The search strategies of feature selection are divided into: full search strategy, heuristic strategy and random search strategy. 
Feature selection is essentially a combinatorial optimization problem. The most direct way to solve the combinatorial optimization problem is to search. In theory, all possible feature combinations can be searched through the exhaustive method, and the feature subset that makes the evaluation criteria optimal is selected as the final output. , but the search space of n features is 2n, and the computational complexity of the exhaustive method increases exponentially with the increase of the feature dimension. In practical applications, hundreds or even thousands of features are often encountered, so the exhaustive method is simple. But it is difficult to apply practically. Other search methods include heuristic search and random search. These search strategies can find a better balance between computational efficiency and feature subset quality, which is also the goal of many feature selection algorithms.

Complete search (Complete)

Breadth First Search: Breadth first traverses the feature subspace. Enumerate all combinations, exhaustive search, not very practical. 
Branch and Bound: Add branch and bound on an exhaustive basis. For example: prune some branches that are impossible to search for better than the current optimal solution. 
Others, such as Beam Search, Best First Search, etc.

Heuristic

Sequential Forward Selection (SFS, Sequential Forward Selection): Starting from the empty set, each time an optimal selection is added. 
Sequential Backward Selection (SBS, Sequential Backward Selection): Starting from the complete set, each time the optimal selection is reduced by one. 
Add L to R selection algorithm (LRS, Plus-L Minus-R Selection): start from the empty set, add L L each time, subtract R R, and select the optimal ( L > R L > R) or from the complete set At the beginning, subtract R R each time, increase L L, and choose the best ( L < R L < R). 
Others, such as Bidirectional Search (BDS, Bidirectional Search), Sequential Floating Selection, etc.

Random search (Random)

Random Generation plus Sequential Selection (RGSS, Random Generation plus Sequential Selection): Randomly generate a subset of features, and then execute the SFS and SBS algorithms on this subset. 
Simulated Annealing (SA, Simulated Annealing): Accept a solution that is worse than the current solution with a certain probability , and this probability gradually decreases over time
Genetic Algorithm (GA, Genetic Algorithms): The next generation feature subsets are reproduced through operations such as crossover and mutation , and the higher the score of the feature subsets, the higher the probability of being selected to participate in reproduction
Common shortcomings of random algorithms: relying on random factors, the experimental results are difficult to reproduce. 
write picture description here

Embedded: In embedded feature selection, the feature selection algorithm itself is embedded as part of the learning algorithm. The most typical ones are decision tree algorithms, such as ID3, C4.5, and CART algorithms. Decision tree algorithms must select a feature at each recursive step in the tree growth process, divide the sample set into smaller subsets, and select features. The basis of is usually the purity of the sub-nodes after division. The purer the sub-nodes after division, the better the division effect. It can be seen that the process of decision tree generation is also the process of feature selection.

Filter: The evaluation criteria for filtering feature selection are obtained from the inherent properties of the data set itself and have nothing to do with a specific learning algorithm, so it has good generality. Usually select features or feature subsets with high correlation with the category. Researchers of filtered feature selection believe that features or subsets of features that are more relevant will achieve higher accuracy on the classifier. There are four evaluation criteria for filtering feature selection, namely distance measure, information measure, correlation measure and consistency measure. 
Advantages: The algorithm has strong versatility; the training steps of the classifier are omitted, and the algorithm complexity is low, so it is suitable for large-scale data sets; a large number of irrelevant features can be quickly removed, and it is very suitable as a pre-filter for features. Disadvantages: Since the evaluation criteria of the algorithm are independent of the specific learning algorithm, the selected subset of features is usually lower than the Wrapper method in terms of classification accuracy.

Wrapper: Encapsulated feature selection is to use the performance of the learning algorithm to evaluate the pros and cons of a subset of features. Therefore, for a feature subset to be evaluated, the Wrapper method needs to train a classifier, and evaluate the feature subset according to the performance of the classifier. There are various learning algorithms used to evaluate features in the Wrapper method, such as decision trees, neural networks, Bayesian classifiers, nearest neighbor methods, support vector machines, and so on. 
Advantages: Compared with the Filter method, the Wrapper method finds the feature subset classification performance is generally better. Disadvantages: The features selected by the Wrapper method are not universal. When the learning algorithm is changed, the feature selection needs to be re-selected for the learning algorithm. Since the classifier is trained and tested for each evaluation of the subset, the algorithm calculates The complexity is high, especially for large-scale datasets, the execution time of the algorithm is very long.

The goal of unsupervised feature learning is to capture the underlying structure in high-dimensional data and mine low-dimensional features. 
In feature learning, the K-means algorithm can cluster some unlabeled input data, and then use the centroids of each category to generate new features.

The above content reference: 
1) http://www.dataguru.cn/article-9861-1.html 
2) https://www.jianshu.com/p/ab697790090f

Copyright statement: This article is an original article by the blogger and may not be reproduced without the blogger's permission. https://blog.csdn.net/cymy001/article/details/79169862
Personal classification:  Python implementation of ML and DL

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325514385&siteId=291194637