Machine Learning - features works

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/xiao_lxl/article/details/97391338


For machine learning, data and characteristics determine the selection and optimization cap, model and algorithm results is gradually close reach this upper limit.

Feature project is a series of original data processing project, which will refine characterized as an input for the use of algorithms and models. Feature engineering is a process of representation and presentation of data. In practice, the main feature of the project is to remove impurities and the raw data redundancy, more efficient design features to describe the relationship between solving the problems and prediction models.

1. Why do we need to value the type of features do normalization?

In order to eliminate the influence dimension between the data features, characteristics we need to be normalized, the comparability between different indicators.

Type of feature values ​​do normalized (Normalization) can have all the features are unified into a substantially same interval.

The most commonly used normalization in two ways:

1) linear data were normalized (Min-Max Scaling). By linear transformation of the original data, so that the results are mapped to [0,1]. Achieve proportional scaling raw data.

Normalization formula is:

X_norm = (X - X_min) / (X_max - X_min)

Wherein, X is the original data, X_max, X_min data set respectively the maximum and minimum

2) Another most common normalization is zero-mean normalization (z-score normalization), which will feature in mapping of the mean of 0 and a standard deviation of a normal distribution. Accurate, assuming original features mean μ, standard deviation σ, then the z-score normalization is defined as:
Here Insert Picture Description
Why do generally require numerical normalization of it? We can illustrate the importance of normalization by means of stochastic gradient descent. Suppose there are two numerical characteristics, the range of x1 to [0, 10], x2 is in the range [0, 3], an objective function can be constructed with the following contour plots:
Here Insert Picture Description
in the case of the same learning rate , x1 update rate will be greater than x2, we need more iterations to find the optimal value [1]. If x1 and x2 will be normalized to the same interval, the optimizer will become the target contour maps of the right circular, would be more consistent with x1 and x2 of the update rate, faster to find the optimal value.

Then the normalization which model is suitable for, and for which the model is not applicable to it? First, the model is solved by gradient descent method is the need for normalization, including including linear regression, logistic regression, support vector machine (Support Vector Machine), neural networks (Neuro Network). But for the decision tree model is not applicable to C4.5 for example, when making the decision tree node split mainly based on x> = threshold and x <information gain than the threshold, and the information gain than with whether through x normalization is irrelevant, since the normalized sample does not change on the relative order of x.

2. How to deal with categorical characteristics?

Category type features (categorical feature) mainly refers to the gender (male and female), blood group (A, B, AB, O) and similar values ​​within a limited option feature. Typically category type is characterized in the original input string, only a few decision tree model can directly address input string. For Logistic Regression, linear support vector machine model, a category feature type must be converted into numerical processing feature to work properly.

This section describes three commonly used conversion method:
Ordinal Encoding,
One-Hot Encoding,
Binary Encoding.

Ordinal Encoding (ID code) for inter-process type generally has a data size relationship, for example, results can be divided into Low, Medium, High third gear, and there is a High> Medium> Low sort of relationship. Ordinal Encoding imparts a characteristic value type according to the category ID of the magnitude relation, as expressed for example High 3, Medium denoted as 2, Low denoted as 1, after conversion retains a magnitude relation.

One-hot Encoding (hot encoded) features are not commonly used have a size relationship between the type of processing to blood as an example, a total of four blood values (A, B, AB, O ), in the One-hot Encoding blood will become a 4-dimensional sparse vector: a type blood is expressed as (1, 0, 0, 0), B-type blood expressed as (0, 1, 0, 0), type AB is expressed as (0, 0, 1 , 0), expressed as a blood type O (0, 0, 0, 1). For One-hot Encoding value category under more circumstances need to pay attention to the following issues:
the use of a sparse vector to save space: in the One-hot Encoding, only one dimension value of 1, the value of other positions are 0. So you can use sparse representation vectors to save space, and now most of the sparse vector algorithm will achieve acceptable form of input.
Mating feature selected to reduce the dimensions: high dimensional feature raises two aspects: (1) in the K-nearest neighbor algorithm, the spatial distance between two latitude difficult to obtain valid measured; (2) Logistic Regression , the number of parameters of the model will increase as the dimension increases, easily lead to over-fitting problem; (3) the dimensions are usually only partially helpful to classify, predict, and therefore can be considered fit to reduce the dimension of feature selection.

Finally Binary Encoding (binary coding), which is divided into two steps: an ID given to each class by Ordinal Encoding, corresponding to the ID and then as a result of binary coding. In blood group A, B, AB, O Case, Binary Encoding procedure as shown below: A type blood Ordinal Encoding is 1, a binary representation is (0, 0, 1); Ordinal Encoding B blood type is 2, binary representation is (0, 1, 0), and so on can be obtained and AB blood type O blood binary representation. As can be seen, essentially the Binary Encoding the binary coding ID hash, the finally obtained 0/1 vector dimension be less than the One-hot Encoding, saving space.
Here Insert Picture Description
In addition to this chapter Encoding method, the interested reader may refer to [2] for other encoding methods, including Helmert Contrast, Sum Contrast, Polynomial Contrast , Backward Difference Contrast.

3. How to deal with high-dimensional combination of features?

In order to improve the ability to fit the complex relationships in the project it will often feature a first-order discrete features two high-order features two sets of synthetic, constitute interactive features (Interaction Feature). Click on the ad to estimate the problem, for example, as shown in the original language, and there are two types of data discrete features 1. In order to improve the fitting ability, and the language type may be composed of second order features, shown in Figure 2:

Here Insert Picture Description
To Logistic Regression, for example, assume that the data of feature vectors X = (x_1, x_2, ..., x_k), we have
Here Insert Picture Description
it wij dimensions equal to | xi | * | xj |. In the ad clicks prediction problem above them, w dimension is 4 (2x2, value of Chinese or English language, type values for the film or TV series).
The above problem seems to predict ad combination of features is no problem, but when introducing the feature type ID, problems arise. Data recommendation problem, for example, the raw data, the user ID and the article ID with a new combination of features shown in Figure 3 after 4:
Here Insert Picture Description
Assuming that the number of users is m, the number of items is n, then the need to learn size parameter is m × n. In the Internet environment, number of users and the number of items that can reach ten million of the order, is almost impossible to learn the parameters of the size m × n. In this case, an effective method is to represent users and items (k << m, k << n ) with the k-dimensional low dimensional vectors,
Here Insert Picture Description
where wij = xi '· xj', where xi 'and xj' respectively corresponding to low dimensional vectors. In the above recommendation problem, the need to learn the scale parameter is m × k + n × k. Familiar recommendation algorithm students can be seen which is equal to the matrix decomposition, we hope this article provides an alternative understanding of the idea of matrix decomposition.

4. How to effectively find the combination of features?

In the previous section we introduced the parameters of how to reduce the dimensionality reduction method combining two high-dimensional feature to learn. However, in many of which the actual data, we often need to face a variety of high-dimensional feature. If the simple pairwise combinations of parameters too easy to over-fitting problems remain, and not all combinations of features are meaningful. Therefore, we urgently need for an effective way to help us find what features should be combined.

This section describes the combination of decision trees based on feature to find ways [3]. Click to predict the issue as an example, assuming that the original input features include age, gender, type of user (probation, paid), four types of items of information (skin care, food, etc.), and we assume that according to the original input and label (click and NO) decision tree constructed of two, then each path from the root node to the leaf nodes can be viewed as a way to combinations of features. Specifically, we can "age <= 35" and "gender = female" as a combination of features, the "age <= 35" and "item category = skin care" to see is a combination of features, the "user type = paid "and" food item type = "is a combination of features as the" user type = pay "and" Age <= 40 "as a combination of features. Suppose we have two samples shown in the following table, the first row may be encoded as (1, 1, 0, 0), both as to meet the "Age <= 35" and "Female Gender =" satisfies "Age <= 35 "and" item category = skin. " The same can be seen that the second sample may be coded as (0, 0, 1, 1), because both meet the "type = user pays" and "food item type =", and meet the "type = user pays" and "age <= 40. "

Here Insert Picture Description
Finally, given that the original input multiple decision trees how to construct it? Here we use the gradient to enhance decision tree (gradient Boosting Decision Tree) , which is to build on the idea each time before building a decision tree residuals (residual) a decision tree.

Guess you like

Origin blog.csdn.net/xiao_lxl/article/details/97391338