feature engineering

There is a saying that is widely circulated in the industry: data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit.

Feature engineering is essentially an engineering activity to maximize the extraction of features from raw data for use by algorithms and models. By summarizing and generalizing, it is believed that feature engineering includes the following aspects:

The classic experience of feature selection can be summed up in three main ways:

1) Filter: Filter method, score each feature according to divergence or correlation, set a threshold or the number of thresholds to be selected, and select features.

2) Wrapper: Wrapper method, according to the objective function (usually the prediction effect score), selects several features at a time, or excludes several features.

3) Embedded: Embedded method, first use some machine learning algorithms and models for training, get the weight coefficients of each feature, and select features from large to small according to the coefficients. Similar to the Filter method, but the quality of the features is determined by training.


Feature preprocessing is to convert data into data required by the algorithm through statistical methods (mathematical methods), so feature preprocessing is also called data preprocessing.

Normalized

Features: Map the data to (by default [0,1]) by transforming the original data

official:


The formula acts on each column (column: feature, row: sample), max and min represent the maximum and minimum values ​​of each column, X'' is the final result, mx, mi are the maximum and minimum values ​​of the specified mapping interval, respectively The minimum default values ​​are 1 and 0.

Here is an example of normalization:


When multiple features are equally important, normalization is required so that each feature does not have a greater impact on the final result.

The disadvantage of normalization is that abnormal points have too much influence on the maximum and minimum values, so it is only suitable for traditional accurate small data scenarios.

Here is the normalized Python code (using sklearn):

from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])  
min_max_scaler = preprocessing.MinMaxScaler()
X_minMax = min_max_scaler.fit_transform(X)
print(X_minMax)
The following is the running result:
[[ 0.5         0.          1.        ]
 [ 1.          0.5         0.33333333]
 [ 0.          1.          0.        ]]

standardization

Features: Transform the data to a mean of 0 and a variance of 1 by transforming the original data.

Formula: (Due to the inconvenience of typing, it is directly quoted)


The advantage of standardization is that it has a certain amount of data, and a small number of outliers have little impact on the average value, that is, the outliers have a relatively small impact on the variance and the average value.

Here is the normalized Python code:

from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
# calculate mean
X_mean = X.mean(axis=0)
print('X_mean:\n',X_mean)
# calculate variance
X_std = X.std(axis=0)
print('X_std:\n',X_std)
# standardize X
X1 = (X-X_mean)/X_std
print('X1:\n',X1)
# use function preprocessing.scale to standardize X
X_scale = preprocessing.scale(X)
print('X_scale:\n',X_scale)
operation result:
X_mean:
 [ 1.          0.          0.33333333]
X_std:
 [ 0.81649658  0.81649658  1.24721913]
X1:
 [[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
X_scale:
 [[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]

Regularization (Normalization)

The process of regularization is to scale each sample to the unit norm (the norm of each sample is 1), if you want to use a quadratic (dot product) or other kernel method later to calculate the similarity between the two samples This method will be useful.

The main idea of ​​Normalization is to calculate its p-norm for each sample, and then divide each element in the sample by the norm. The result of this processing is to make the p-norm of each processed sample (l1-norm, l2-norm) is equal to 1.

         p-范数的计算公式:||X||p=(|x1|^p+|x2|^p+...+|xn|^p)^1/p
  • 1
  • 2

This method is mainly used in text classification and clustering. For example, taking the dot product of the l2-norm of two TF-IDF vectors, one can get the cosine similarity of the two vectors.

1. You can use the preprocessing.normalize() function to transform the specified data:

>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')
>>> X_normalized                                      
array([[ 0.40..., -0.40...,  0.81...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.  ...,  0.70..., -0.70...]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2. You can use the processing.Normalizer() class to fit and transform the training set and test set:

>>> normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
>>> normalizer
Normalizer(copy=True, norm='l2')
>>> normalizer.transform(X)                            
array([[ 0.40..., -0.40...,  0.81...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.  ...,  0.70..., -0.70...]])
>>> normalizer.transform([[-1.,  1., 0.]])             
array([[-0.70...,  0.70...,  0.  ...]])

When convolutional neural networks process image problems, there are 3 common preprocessing of image data that may be used, as follows. We assume that the data is represented as a matrix X, where we assume Xa [N*D] dimensional matrix (N is the amount of sample data and D is the length of the data vector for a single image).

  • De-mean, this is the most common image data preprocessing. In short, what it does is to subtract the feature mean of all training set images from the features of each image for training. The intuitive meaning of this is , we center the data of each dimension of the input data to 0. Using python's numpy toolkit, this step can be X -= np.mean(X, axis = 0)easily accomplished with . Of course, there are actually different approaches here: to be simple, we can directly find the mean of all pixels, and then subtract the same value from each pixel; with a little optimization, we can do this in the three color channels of RGB thing.
  • Normalization, the intuitive understanding of normalization means that we do some work to ensure that the data in all dimensions is in a range of change. Usually we have two ways to achieve normalization. One is to divide the data in each dimension by the standard deviation ( X /= np.std(X, axis = 0)) of the data in this dimension after the data are all mean. Another way is that we divide by the maximum absolute value of the data to ensure that all data are normalized between -1 and 1. To put it another way, in fact, you can consider normalization on any data set where you think the magnitude of each dimension varies greatly. However, for images, in fact, this step can be done or not, because everyone knows that the value range of pixels is between [0, 255], so in fact, the natural amplitude of the image input data is the same.


The above E(xk) refers to the average value of each batch of training data neurons xk; then the denominator is a standard deviation of the activation of each batch of data neurons xk. The use of BN on the convolutional layer actually uses a strategy similar to weight sharing. A whole feature map is treated as a neuron , and the average value of all neurons in a feature map corresponding to all samples is obtained. variance, and then normalize this feature map neuron.

  • PCA and whitening/whitening , which is another form of data preprocessing. After the de-averaging operation, we can calculate the covariance matrix of the data, so that we can know the correlation between the various dimensions of the data. When it comes to the preprocessing of the input data of the neural network, the best algorithm is whitening preprocessing. However, the amount of whitening calculation is too large, which is not cost-effective, and whitening is not differentiable everywhere, so in deep learning, whitening is rarely used.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325645685&siteId=291194637