Project Description and features seven kinds of commonly used methods

First, the project features Introduction

  • Simply put, the feature is the ability to project data as-art presentation technology. Because a good feature works well mixed areas of expertise, intuition and basic math skills;

  • Essentially, the data presented to the algorithm should be able to have the relevant structure or properties of basic data. When you do feature project, in fact, is the process of converting the data attribute data characteristics, properties represent all the dimensions of the data in the data modeling, if learning all the attributes of the original data, and can not find a good data the underlying trend, while pretreatment characteristics of your data by project, then your algorithm model can reduce interference by noise, so better able to identify trends;

  • In fact, a good feature can help you achieve even use a simple model to achieve good results;

  • However, the new feature project features referenced, it does need to verify the accuracy of the forecast increase, rather than adding a useless feature, or will only increase the complexity of the algorithm operation.

Second, the common method

1, the time stamp processing

Timestamps typically require separation into multiple dimensions such as year, month, day, hour, minute, second. However, in many applications, a lot of information is not required, so when we render time, try to ensure that all data provided by you is your model need, and do not forget the time zone, your data source from different geographical data sources, do not forget to use the time zone data were normalized.

2, discrete variable processing

As a simple example, the {red, yellow, blue} discrete variable composition, it is the most common way for each variable value into Binary Feature, i.e. taking a value from the set {0,1}, it is often He said one-hot encoding (One-hot code) .

3, bin / partition

Sometimes, converting the continuous variable into categories exhibit more significant, while enabling the algorithm to reduce noise interference, values ​​within a range determined by dividing into the block. For example, we have to predict what people will buy our shop features merchandise, the user's age is a continuous variable, we can be divided into the age of 15 or less, 15-24,25-34,35-44, 45 and above. Moreover, instead of these categories is divided into two points, you can use a scalar value, because similar age showed similar properties.

Only by understanding the basic knowledge of the field variables to determine when the property can be divided into simple range partitioning makes sense, that is, all the values ​​fall into a partition can show common characteristics. In actual use, when you do not want your model always try to score if too close to the area between the partition can avoid over-fitting. For example, if you are interested in the city as a whole, then you can integrate all dimensions fall of the city as a whole. Binning can be reduced little influence errors, by a given value to the nearest classified blocks. If the number of divisions and range of all possible values ​​of similar, or accuracy is important to you, then, when the bin is inappropriate.

4, wherein the cross

Cross features regarded as one of the project features a very important method, two or more categories will be combined into one property. When the combination of features better than a single feature, which is a very useful technique. Mathematically, it is characteristic of all values ​​of categories cross multiplication.

If a feature has A, A has two possible values ​​{A1, A2}. It has a feature B, the presence of possible values ​​{B1, B2} and so on. Then, the intersection between the characteristics A & B as follows: {(A1, B1), (A1, B2), (A2, B1), (A2, B2)}, and you can take any name for these combined features. However, in fact, we need to understand the characteristics of each composition A and B represent the respective information synergy.

5, feature selection

In order to obtain a better model, using some algorithm automatically select a subset of the original features. This process, you will not build or modify features you have, but will be achieved by building redundancy and reduce noise characteristics.

Feature selection algorithm may be used to rank and select the scoring method features, such as the importance of the method of determining the relevance or other features, the method further through trial and error may be required to search a prime feature subset.

There are building support by the process model, stepwise regression model is automatically performed during a configuration example of feature selection algorithm, as well as Lasso regression and ridge regression regularization method can also be classified into the feature selection, by adding additional constraints or a penalty term to the existing models (loss function) so as to prevent over improve the generalization ability to be merged.

6, feature scaling

Sometimes, you may notice that certain features of the span has a much higher value than other features. For example, a person's income and his age were compared, more specific examples, such as some models (like ridge regression) requires you to have the zoom feature values ​​into the same range of values. Some features can be avoided to get the size of the very poor weight value by scaling features.

7, feature extraction

Feature extraction involves a series of algorithms to automatically generate a number of new features from the original set of properties, dimension reduction algorithms fall into this category. Feature extraction is the observed value of an automatic dimension reduction process to a sufficiently small set of data modeling.

For a list of data, a number of methods that can be used include a projection method, such as principal component analysis and unsupervised clustering algorithm.

For graphics data, it may include some straight edge detection and monitoring, with respective methods for different areas.


This article is taken boiled water sweetened blog

Reproduced in: https: //www.jianshu.com/p/b718547e4c72

Guess you like

Origin blog.csdn.net/weixin_34293246/article/details/91086590