Feature Engineering – Feature Engineering

Feature engineering is an important part of the machine learning workflow. It is the "translation" of raw data into a form understandable by the model.

The Importance of Feature Engineering

Simple models based on large amounts of data outperform complex models based on small amounts of data.

More data is better than smart algorithms , and good data is better than a lot of data.

Therefore, how to maximize data value based on given data is what feature engineering should do.

In a survey in 2016, it was found that 80% of the work of data scientists is in acquiring, cleaning and organizing data. Less than 20% of the time to construct a machine learning pipeline. Details are as follows:

Data scientists spend 80% of their time acquiring, cleaning and organizing data

Set training set: 3%
Cleaning and organizing data: 60%
Collect data sets: 19%
Mining data patterns: 9%
Adjustment Algorithm: 5%
Other: 4%

What is feature engineering

Let's first look at the position of feature engineering in the machine learning process:

The Place of Feature Engineering in the Machine Learning Pipeline

As can be seen from the figure above, feature engineering is between raw data and features. His task is the process of "translating" raw data into features.

Features: It is the numerical expression of the original data, and it is an expression that can be directly used by the machine learning algorithm model.

Feature engineering is a process that converts data into features that better represent business logic, thereby improving the performance of machine learning.

That may not be easy to understand. In fact, feature engineering is very similar to cooking:

We buy the ingredients, wash them, chop them, and start cooking them to our liking to create delicious meals.

In the example above:

Ingredients are like raw data

The process of cleaning, cutting vegetables, and cooking is like feature engineering

The final delicious food is the characteristic

Humans need to eat processed food, which is safer and tastier.

The machine algorithm model is also similar. The original data cannot be directly fed to the model, and the data needs to be cleaned, organized, and converted.

Finally, the features that the model can digest can be obtained.

In addition to converting raw data into features, there are two key points that are easily overlooked:

Focus 1: Better representation of business logic

Feature engineering can be said to be a mathematical expression of business logic.

We use machine learning to solve specific problems in business. There are many ways to convert the same raw data into features, and we need to choose those that can "better represent the business logic" to better solve the problem. rather than those simpler methods.

Focus 2: Improving Machine Learning Performance

Performance means shorter time and lower cost. Even the same model will have different performance due to different feature engineering. So we need to choose those feature engineering that can perform better.

4 Steps to Evaluate Feature Engineering Performance

The business evaluation of feature engineering is very important, but there are various methods, and different businesses have different evaluation methods.

Get a baseline performance of your machine learning model before applying any feature engineering
Apply one or more feature engineering
For each feature engineering, obtain a performance metric and compare it to the baseline performance
If the delta in performance is greater than a certain threshold, feature engineering is considered beneficial and applied in the machine learning pipeline

For example: the accuracy rate of the baseline performance is 40%, after applying some kind of feature engineering, the accuracy rate increases to 76%, then the change is 90%.

（76%-40%）/ 40%=90%

Summarize

Feature engineering is the most time-consuming work in the machine learning process, and it is also one of the most important work contents.

Definition of feature engineering: It is a process that converts data into features that can better represent business logic, thereby improving the performance of machine learning.

Two key points that are easily overlooked in feature engineering:

Better representation of business logic
Improve machine learning performance

The 4 steps of feature engineering performance evaluation:

Get a baseline performance of your machine learning model before applying any feature engineering
Apply one or more feature engineering
For each feature engineering, obtain a performance metric and compare it to the baseline performance
If the delta in performance is greater than a certain threshold, feature engineering is considered beneficial and applied in the machine learning pipeline