[Practical Machine Learning] 2.4 Feature Engineering

What is feature engineering

  • Feature Engineering (Feature Engineering) Feature engineering is the process of transforming raw data into features that better express the essence of the problem, so that applying these features to the prediction model can improve the prediction accuracy of the model for unseen data.
  • Simply put, feature engineering is to discover features that have a significant impact on the dependent variable y, usually called the independent variable x as a feature, and the purpose of feature engineering is to discover important features.
  • How can raw data be decomposed and aggregated to better express the essence of the problem? This is the purpose of feature engineering. “feature engineering is manually designing what the input x's should be.” “you have to turn your inputs into things the algorithm can understand.”
  • Feature engineering is the most time-consuming and important step in data mining model development.

Feature engineering was crucial to machine learning (ML) before the advent of deep learning (DL). Deep learning can automatically extract features (for example, CNN is used to extract image features), making feature extraction easy, but it consumes data and computing power resources. In traditional machine learning, features need to be extracted manually, which has become the core work of machine learning.

Untitled

Feature engineering for common types of data

Tabular Data Feature

  • Inf/float: Use it directly, or divide the data into n intervals. For example, in our eyes, there is not much difference between 1 million and 1.01 million (we don't need such fine-grained data), but they are completely different things to machines, so they can be divided into 100 Between -1.1 million.

  • Categorical data: One-hot Encoding. For example, we want to classify houses. If there are 100 types, maybe only the first 10 types are more common, and the latter ones may be noise or uncommon. Those long-tail types can all be turned into types Unknown.

    Untitled

  • Time data (Date-time): A time data type can be expanded and expressed as the following vector. You can see that day_of_yearindicates the day of the year, week_of_yearthe number of weeks of the year, etc. Because people behave differently at different times of the year, it can be used as an important feature.

    Untitled

  • Feature combination: We can also cross different features, cross-multiply [cat, dog]e and [male, female] to generate 4 new categories, and then perform one-hot encoding on the new categories. Features tend to combine features that we think have related rows.

    Untitled

    • (cat, male)⇒[1, 0, 0, 0]
    • (cat, female)⇒ [0, 1, 0, 0]
    • (dog, male)⇒[0, 0, 1, 0]
    • (dog, female)⇒[0, 0, 0, 1]

Text Feature

  • Represent text with token features
    • Bag of Words (BoW) model: first express each word with one-hot encoding, and then add them all together. The limitation is that we need to carefully build the dictionary, neither too large nor too small; and timing information will be lost.

      Untitled

    • Word Embeddings: such as Word2vec. Embedding words into a vector space such that similar words are closer together in the space.

      Untitled

    • Pre-trained universal language models: such as BERT, GPT-3. At present, relying on the huge transformer model, it is trained through self-supervised learning on unlabeled data, which can extract better timing information. The effect is better than the previous method, but the training cost is relatively high.

      Untitled

Image/Video Features

  • Traditionally, features are manually extracted, such as SIFT.
  • At present, all rely on pre-trained deep neural networks to achieve feature extraction.

Untitled

Summarize

  • Features are a very important part of machine learning
  • At present, it is mainstream to use deep learning to extract features. However, it is still a bit difficult to use deep learning on Tabular data, because there is no large public Tabular data dataset.

References

2.4 Feature Engineering [Stanford 21 Fall: Practical Machine Learning Chinese Version]_哔哩哔哩_bilibili

Guess you like

Origin blog.csdn.net/weixin_46421722/article/details/129508636