Twelve kinds of features related engineering technical brief

http://blog.itpub.net/29829936/viewspace-2648602/

This article contains the directory are as follows:

  • I. Introduction
  • Second, errors and missing data values
  • Third, the type of feature
  • Fourth, the characteristics of engineering skills
  • 4.1, the minute box (Binning)
  • 4.2, hot encoded (One-Hot Encoding)
  • 4.3, features a hash (Hashing Trick)
  • 4.4, nested method (Embedding)
  • 4.5, logarithmic (Log Transformation)
  • 4.6, feature scaling (Scaling)
  • 4.7 Standardization (Normalization)
  • 4.8, feature interaction (Feature Interaction)
  • Fifth, wherein the processing time
  • 5.1, binning method
  • 5.2, the trend line (Treadlines)
  • 5.3, close to the event (Closeness to major events)
  • 5.4, ​​the time difference (Time Difference)

Twelve kinds of features related engineering technical brief

 

 

I. Introduction

Machine learning feature project is to convert the raw input data into feature, in order to better represent potential problems and help improve the accuracy of prediction model process.

Find the right features is very difficult and time consuming task, it requires expert knowledge and application of machine learning can also be understood as a basic feature of the project. However, the characteristics of the project has a great impact on the application of machine learning model, there is a saying called "data and determine the performance characteristics of the upper limit of the machine learning model."

Twelve kinds of features related engineering technical brief

 

Second, errors and missing data values

Engineering features required prior to missing data and error data for processing. Data error can be corrected, and some error is the wrong format, such as date format might be "201-8-09-19" and "20180920" This mixed to unity.

The treatment of missing data:

  1. Remove the row / column
  2. Averaging
  3. Median
  4. 众数
  5. Using an algorithm to predict

Third, the type of feature

Input feature comprises several machine learning:

Numerical characteristics: including shaping, floating point, etc., may have a sense sequence or order data.

Classification characteristic: The ID, sex and the like.

Time characteristics: time series such as month, year, quarter, day, hour, and so on.

Spatial characteristics: latitude and longitude can be converted to a zip code, city and so on.

Text features: documents, natural language, statements, and so, here temporarily introduction process.

Fourth, the characteristics of engineering skills

4.1, the minute box (Binning)

Binning data (Binning) is a data pre-processing technique for reducing the effects of errors was slightly observed. Given the value falls short intervals of raw data values ​​from the bin representing the interval (usually the center value) instead. This is a quantitative form. Statistical data binning is a method of grouping a plurality of more or less continuously to a smaller number of "bin" value. For example, if you have data on a group of people, you may want to arrange their age to younger interval. For some time the data binning operation may be performed, for example, 24 hours a day may be divided into morning [5,8), AM [8, 11), at noon [11,14), PM [14,19), the night [10,22) late at night [19, 24) and [24,5). Because such as 11:00 and 12:00 in fact not very different, then you can use binning processing techniques can reduce these "errors."

4.2, hot encoded (One-Hot Encoding)

Hot encoded (One-Hot Encoding) is a data pre-processing techniques, it can become the same category data length feature. For example, a person's gender into male and female, of every individual records only male or female, then we can create a dimension characterized by 2, if it is male, then with (1,0) said that if a woman is using (0 ,1). The creation of a dimension vector of the total category, the value of the corresponding dimension of a record denoted by 1, 0 can be recorded as other. For the few categories of categorical variables, one-hot encoding can be used.

4.3, features a hash (Hashing Trick)

For many categories of classification number of variables may be employed wherein a hash (Hashing Trick), wherein hashing is to convert a target data point as a vector. Using a hash function is to convert the raw data into a hash value within a specified range, as compared to single thermal model has many advantages, such as support for online learning, much reduced dimension lamp. Wherein the processing of the specific reference data characteristic hash (Feature Hashing).

4.4, nested method (Embedding)

Nested method (Embedding) is a method using a neural network to convert the raw input data into a new feature, according to the task is actually embedded you want to achieve your higher dimensional feature space is projected, thus embedding space more or less similar characteristics having a small distance between them. This allows the sorter to better in a more comprehensive way to learn representation. For example, word embedding is mapped to a single word dimension is hundreds or even thousands dimension dimensional vector, carrying a document classification, originally a distance between vectors after the word semantic similarity mapping is relatively small, in turn, can help we further application of machine learning, this is much better than the one-hot model.

4.5, logarithmic (Log Transformation)

Refers to the logarithmic value of log-transformed to do, it can be converted into a large range of values ​​smaller scale interval. Log conversion has a great influence on the shape of the distribution, it is usually used to reduce the skewness the right, so that the final shape of the distribution of some of the more symmetrical. It does not apply to a value of zero or negative. For a unit on a logarithmic scale multiplier for multiplying the number of use. In some models of machine learning, a number of features to make the conversion can be certain even by summing become more simple, it does not belong to this part of the range.

As described above, log conversion can be reduced in a large range of values ​​within a certain range, which process an abnormal value for some it is effective, for example, the number of pages the user sees is a long-tailed distribution, a user in a short time View the 500 and 1000 this results page may belong to outliers, their behavior may difference is not so big, it can also reflect the use of log-transformed.

Twelve kinds of features related engineering technical brief

 

4.6, feature scaling (Scaling)

Wherein the scaling is a method of independent variables or standardized for data characteristic range. In data processing, it is also referred to as normalized data, and is typically performed during data preprocessing step. Zoom features can be defined in a wide range of data within a specified range. Due to the large scope of change in the value of the raw data, in some machine learning algorithm, if not standardized, objective function will not work. For example, most of the distance between two points is calculated by the Euclidean distance classifier. If a feature has a wide range of values, then the distance will have control over this particular element. Thus, to deal with all the features of the normalized range, characterized in that each substantially proportional to the final distance.

Another reason for scaling is to use gradient descent characteristic feature scaling with much faster convergence than without it. Zoom features include two types:

  • Maximum and minimum scaling (Min-max Scaling)
  • Standardization scaling (Standard (Z) Scaling)

4.7 Standardization (Normalization)

In the simplest case, standardization means that measured on different scales is adjusted to a common scale conceptual. In more complex cases, may refer to more complex normalized adjusted, wherein the adjustment is intended that the entire probability distribution of values ​​alignment. In general, with the normal distribution may be interested in the alignment.

In another usage statistics, the normalized values ​​on different units in the range of conversion can be compared to each other, to avoid the influence of the total size. Normalized data are also important for some other optimization algorithms such as gradient descent.

4.8, feature interaction (Feature Interaction)

Adding interaction terms in the regression model is a very common approach. It can greatly expand the interpretation of the regression model between the dependent variable. See Introduction to specific interaction term (Interactions in Regression) regression model.

Fifth, wherein the processing time

Wherein almost all the time to be processed, the time sequence of the feature that a meaningful order. Here a simple approach to name a few.

5.1, binning method

This is the most commonly used method, as described above. Sometimes the difference between 11 and 12 does not make sense, the above-described processing method may be employed bin.

5.2, the trend line (Treadlines)

Multi-use trends rather than the total amount to encode, for example, last week, spending, spending last month, last year's spending, rather than total spending. Total spending two same customer may be very different on consumer behavior.

5.3, close to the event (Closeness to major events)

A few days before the holiday, the first Saturday of each month and so on. Value in the vicinity of this important time node may make more sense.

5.4, ​​the time difference (Time Difference)

Last time user interaction to the user interaction time interval, the time difference is also great significance.

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/93520364