Processing and selection of features in machine learning


basic concepts

Feature engineering is the process of converting original data attributes into data features through processing and processing of raw data. Attributes are the dimensions of the data itself, and features are an important feature presented in the data, usually Obtained by calculation, combination or transformation of properties. For example, principal component analysis is the process of converting a large number of data attributes into a few features. To some extent, good data and features are often the foundation of a well-performing model.

Since it is called feature engineering, it naturally covers a lot of content, and the more important part involved is the processing and selection of features.

Feature processing includes:

  • Data cleaning
  • data normalization
  • Feature Construction and Derivation

Feature selection includes:

  • Feature filtering
  • wrapper method
  • embedded method


Data cleaning

Data cleaning is the process of finding and correcting identifiable errors in data files and processing to obtain data needed for the modeling process.

Data cleaning includes:

  • Missing value handling
  • Outlier detection and processing
  • Allocate sample proportions and weights


Missing value handling

Missing values ​​refer to the clustering, grouping, censoring, or truncation of data in coarse data due to lack of information. It refers to the fact that the value of one or some attributes in an existing dataset is incomplete.

There are currently two main methods for dealing with missing values: deleting missing values ​​and filling missing values.

1. Remove missing values

If the missing value contained in a sample or variable exceeds a certain proportion, such as more than half of the sample or variable, the information contained in the sample or variable is limited. If we force the data to be filled, it may be added Excessive artificial information leads to a reduction in the modeling effect. In this case, we generally choose to remove the entire sample or variable from the data, that is, delete missing values.

2. Missing value filling

  • random fill method

Literally, it is to find a random number and fill in the missing value. This method does not consider any data characteristics, and there may still be abnormal values ​​after filling. Generally, it is not recommended.

  • mean filling method

Find the variable with the greatest correlation with the missing value variable, divide the data into several groups, calculate the mean of each group separately, and then fill in the missing position as its value, if no variable with good correlation is found. , you can also count the mean of the existing data of the variable, and then fill it in the missing position. This approach changes the distribution of the data to some extent.

  • most similar filling method

Find a sample that is most similar to it in the dataset, and then fill in the missing values ​​with the value of this sample.
Similar to the mean filling method, find the variable (such as y) that has the greatest correlation with the missing value variable (such as x), then sort according to the value of the variable y, and then get the corresponding sorting of x, and finally use the location of the missing value. to replace the missing value with the previous value.

  • regression filling method

Take the missing value variable as a target variable y, take the existing data of the missing value variable as the training set, find the variable x that is highly correlated with it, and establish a regression equation, and then use the x corresponding to the position of the missing value variable y as the prediction set. Make predictions and replace missing values ​​with the predicted results.

  • k-nearest neighbor filling method

Using the knn algorithm, the nearest k nearest neighbors of the missing value are selected, and then the missing value is estimated by weighted average according to the distance of the point where the missing value is located from these points.


Outlier detection and processing

Outliers refer to the measured values ​​that deviate from the mean by more than two standard deviations, and the measured values ​​that deviate from the mean by more than three standard deviations, which are called highly abnormal outliers. The generation of outliers is generally caused by systematic error, human error or variation of the data itself.

  • Univariate Outlier Detection (Grubbs Method)

First, arrange the variables in ascending order of their values ​​x1,x2.....xn

Second, calculate the mean x and standard deviation S,

At the same time, the deviation value is calculated, that is, the difference between the average value and the maximum value and the difference between the average value and the minimum value, and then a suspicious value is determined, which is generally the one with a larger deviation from the average value.

Calculate the statistic gi (ratio of residual to standard deviation), where i is the serial number of the suspect value.

Furthermore, compare gi with the critical value GP(n) given by the Grubbs table. If the calculated Gi value is greater than the critical value GP(n) in the table, it can be judged that the measured data is an abnormal value and can be eliminated. Here the critical value GP(n) is related to two parameters: the detection level α and the number of measurements n .

Detection level α: If the requirements are strict, the detection level α can be set smaller, for example, α = 0.01, then the confidence probability P = 1-α = 0.99; if the requirements are not strict, α can be set larger, such as α=0.10, namely P=0.90; usually α=0.05, P=0.95.

  • Multivariate outlier detection (based on distance calculation)

The distance-based multivariate outlier detection is similar to the idea of ​​the k-nearest neighbor algorithm. The general idea is to calculate the distance from each sample point to the center point. If the distance is too large, it is judged as an outlier. The measure of the distance here generally uses the Mahalanobis distance. (Mahalanobis Distance). Because Mahalanobis distance is not affected by dimensions, and in multivariate conditions, Mahalanobis distance also considers the correlation between variables, which makes it better than Euclidean distance.

  • Outlier Handling

In the case of univariate, outliers can be considered similar to the deletion method of missing values, mean filling method or regression filling method, and in the case of multivariate, you can try to fill or delete with the mean vector.

In general, the treatment of missing values ​​and outliers should be based on the actual situation to determine the appropriate method, because in some cases outliers can just reflect some real problems.


Allocate sample proportions and weights

When the sample imbalance occurs in the data set, the proportion and weight of the samples need to be adjusted so that a model with better performance can be trained. For the specific method, please refer to the previous article: Class Imbalance in Machine Learning
http://www. cnblogs.com/wkslearner/p/8870673.html


data normalization

In machine learning, due to the needs of different models, we often need to perform different normalization processing on multiple data in order to obtain models with better performance.

In data processing, data normalization operations that are often encountered are:

  • Data dimensionless
  • Discretization of Continuous Variables
  • Discrete Variable Quantization Processing
  • data conversion


Data dimensionless

Dimensionless transforms data of different specifications into the same specification, which is often used in some index processing for comparison and evaluation. It removes the unit limitation of the data and converts it into a dimensionless pure value, which is convenient for different units or magnitudes. The indicators can be compared and weighted.

Common methods of data dimensionlessization include:

  • standardized method
  • extreme value method
  • Averaging method
  • standard deviation method


1. Standardized method

Standardization is done by dividing the difference between each value of a variable and its mean by the standard deviation of that variable, which is 0 for the mean and 1 for the standard deviation after dedimensionalization. After using this method, the mean and standard deviation of different variables are the same after dimensionless, that is, the difference in the degree of variation between variables is eliminated at the same time.
The normalized formula is:

2. Extreme value method

The extreme value method usually converts the original data into data within a specific range through the maximum and minimum values ​​of the variable values, thereby eliminating the influence of dimension and order of magnitude. This method relies heavily on two extreme values.

In general, there are 3 ways of extreme value method:

The first method is to divide the value of the variable by the full distance of the variable, and the value range of each variable after normalization is [-1, 1].
The formula is:

The second method is to divide the difference between the variable value and the minimum value by the full distance of the variable, and the value range after normalization is [0,1].
The formula is:

The third method is to divide the variable value by the maximum value of the variable, and the maximum value of the variable after normalization is 1.
The formula is:

3. Averaging method

The averaging method is to directly divide the value of a variable by the average value of the variable. Unlike the normalization method, the averaging method can retain information about the degree of difference in the value of the variables.
The average method formula:

4. Standard deviation method

The standard deviation method is a variant of the standard deviation method that directly divides the variable value by the standard deviation, rather than subtracting the mean and dividing by the standard deviation. The standard deviation method means that the mean of the variable after dimensionless is the ratio of the mean of the original variable to the standard deviation, not 0.
The formula is:


Discretization of Continuous Variables

When using some algorithms, we need to convert continuous variables into discrete variables. In some cases, discrete variables can simplify the model calculation and improve the stability of the model. For example, logistic regression often uses discrete variables for training, which can reflect the model. training speed and improve the interpretability of the model.

There are roughly two types of methods for discretizing continuous variables:

  • Chi-square test method
  • information gain method


1. Chi-square test method

Usually, the variables are arranged by value, and each value is regarded as a group, and then the chi-square value is calculated for each pair of adjacent groups, and the smallest pair of combinations is combined, and then the above operation is repeated. , until a certain condition we set is met, such as the minimum number of groups of 5, that is, the continuous variables are divided into 5 groups.

A chi-square statistic is a measure of the difference between the distribution of the data and a chosen expected or assumed distribution. It is obtained by dividing the square of the difference between the actual number of observations (fo ) and the number of theoretical distributions (fe ) by the theoretical number, and then summing it up. Its calculation formula is:

The chi-square value contains two pieces of information:

  • The absolute magnitude of the deviation between the actual value and the theoretical value.
  • The relative magnitude of the difference and the theoretical value.


2. Information Gain Method

Information gain methods are top-down splitting techniques that use information computation to determine split points.

The first is to treat each value as a split point, split the data into two parts, and choose the one that yields the least information entropy among the many possible splits. Then, in the two divided intervals, find the maximum entropy interval, and continue to divide according to the previous method until the conditions are met, such as ending the process when the specified number is met.

The information attribute of the data is related to the task. For the classification task, the amount of information contained in the label value y is:

Among them, p(y) is the probability of occurrence of y. The smaller p(y), the greater the amount of information y contains. This is intuitive.

Entropy is defined as the expected value of information.
A data set S that can be divided into m categories, its information entropy is the expected value of the amount of information contained in a randomly obtained label:

The information entropy of a data set represents the chaotic degree of this data set. The larger the entropy, the more chaotic.

If S is divided in a certain way, for example, according to the value of a certain attribute, n subsets are obtained. The new subsets have their own information entropy, and the difference between the sum of their entropy and the entropy of the original S is the information entropy gain brought by this division operation.

To be continued.....

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324935788&siteId=291194637