实用机器学习笔记(六):特征工程

1. Feature Engineering(特征工程)

  • Machine learning algorithms prefer well define fixed length input/output (机器学习更喜欢固定的输入输出)

  • Feature engineering(FE) is the key to ML method before deep learning(DL)

    • in a computer vision task ,people try various FE methods and then train a SVM model
  • DL train deep neural networks to extract features(深度学习可以自动提取特征,而很多机器学习方法需要FE提取特征

    • features are relevant to the task

2. Tabular data features(表格数据)

  • int/float : directly use or bin to n unique int values (数据转换)

  • categorical data:one-hot encoding (数据独热编码)

    • map rare categories into “unknown”
  • Data-time :a feature list such as (时间变换)

    • [year,month,day,day_of_year,week_of_year,…]
  • Feature combination: Cartesian product of two feature groups (数据组合)

    • [cat ,dog] * [male,female] -->
    • [(cat,male),(cat,female),(dog,male),(dog,female)]

3. Text features (文本数据)

  • Represent text as token features (将文本转换为token)

    • Bag of words(BoW) model

      • limitations: needs careful vocabulary design ,missing context
    • Word embeddings(e.g. Word2vec) (词嵌入)

      • vectorizing words such that similar words are placed close together
      • trained by predicting target word from context words
  • Pre-trained language models(e.g. BERT ,GPT-3) : (预训练深度神经网络抽取特征)

    • giant transformer models
    • traind with large amount of unannotated data
    • fine-tuning for downstream task

4. image/video features (图片/视频数据)

  • traditionally extract images by hand-craft features such as SIFT (手动提取)
  • now commonly use pre-trained deep neural networks (预训练神经网络)
    • ResNet:trained with ImageNet(Image classification)
    • I3D:trained with Kinetics(action classifition)

5. Summary

  • Features matter
  • Features are hand-crafted or learned by deep neural networks (要不手动,要不深度神经网络预训练)

猜你喜欢

转载自blog.csdn.net/jerry_liufeng/article/details/123455027
今日推荐