Hundred-faced machine learning-Chapter 1 feature engineering (self-study)

Feature engineering: A series of engineering processes are performed on the original data, which is refined into features and used as input for algorithms and models. In essence, feature engineering is a process of representing and presenting data. In actual work, feature engineering aims to remove impurities and redundancy in the original data, and design more efficient features to characterize the relationship between the solved problem and the prediction model.

For machine learning problems, data and features often determine the upper limit of the results, while the selection and optimization of algorithms and models are gradually approaching this upper limit. The framework of feature engineering:

Commonly used data types:

(1) Structured data: It can be regarded as a table of a relational database, each column has a clear definition, including two basic data types, numerical and categorical; each row represents information about a sample.

(2) Unstructured data: It mainly includes text, images, audio, and video tutorials. The information it contains cannot be represented by a simple numerical value, and there is no clear category definition, and the size of the data varies.

Question 1: Why is the data normalized?

In order to eliminate the dimensional influence between data features, the data is normalized so that different indicators are comparable. Normalizing the numerical features can unify all the features into a roughly the same numerical interval. Common methods:

(1) Linear function normalization (Min-Max Scaling): It is to linearly transform the data, so that the result is mapped to the range of [0,1], and the data is scaled proportionally.

(2) Z-score Normalization: It maps data to a distribution with a mean of 0 and a standard deviation of 1. Assuming that the mean of the original feature is and the standard deviation is , the formula is

Example: In gradient descent, after normalizing the data, the data can find the optimal solution faster.

In practical applications, the model solved by the gradient descent method usually needs to be normalized, including linear regression, logistic regression, support vector machine, neural network model, etc. But it is not applicable to the decision tree model. The normalized data will not change the information gain of the sample on the feature X.

Question 2: How to deal with categorical features in data processing?

Categorical features are mainly features that are only available in limited options, such as gender (male and female), blood type (A, B, AB, O). The original input of categorical features is usually a string type, except for a few models such as decision trees that can directly process input in the form of strings, for most models such as logistic regression, they need to be processed and converted into numerical features.

Processing method: Ordinal Encoding, One-hot Encoding, Binary Encoding

Serial number coding: used to process data with size relationship between categories. For example, the grades can be divided into low, medium and high grades. There is also a sorting problem, which is expressed as high 3, medium 2, and low 1.

One-hot encoding: used to deal with features that do not have a size relationship between categories. If the blood type has four values, it can be converted into a 4-dimensional sparse variable, such as:

blood type A B FROM O
  1 0 0 0
  0 1 0 0
  0 0 1 0
  0 0 0

  1

When using one-hot encoding when there are many category values, you should pay attention to:

(1) Use sparse vectors to save space. In one-hot encoding, only one dimension of the feature vector takes the value 1, and the other positions take 0. Therefore, the sparse representation of the vector can be used to effectively save space.

(2) Cooperate with feature selection to reduce difficulty. High-dimensional features will bring about the following problems: First, in KNN, it is difficult to effectively measure the distance between two points in a high-dimensional space. The second is that in the logistic regression model, the number of parameters will increase as the dimension increases, which is likely to cause over-fitting problems. Third, usually only part of the dimensions are helpful for classification and prediction, so you can consider cooperating with feature selection to reduce the dimensions. (Note: I don’t know what matching feature selection is. Under this code, only part of the feature is taken, how to take this feature, and the method? Will it cause feature loss?)

Binary encoding: It is mainly divided into two steps: first use serial number encoding to assign a category ID to each category, and then use the binary code corresponding to the category ID as the result. as follows:

blood type Category ID Binary representation
A 1 001
B 2 010
FROM 3 011
O 4 100

 

Binary is to perform hash mapping on ID, and finally get 0/1 feature vector, and the dimension is less than one-hot encoding, which saves storage space.

Other coding methods: Helmert Contrast, Sum Contrast, Polynomial Contrast, Backward Difference Contrast.

 

 

Guess you like

Origin blog.csdn.net/qq_28409193/article/details/88057629