文章目录
1. Feature Engineering(特征工程)
-
Machine learning algorithms prefer well define fixed length input/output (机器学习更喜欢固定的输入输出)
-
Feature engineering(FE) is the key to ML method before deep learning(DL)
- in a computer vision task ,people try various FE methods and then train a SVM model
-
DL train deep neural networks to extract features(深度学习可以自动提取特征,而很多机器学习方法需要FE提取特征)
- features are relevant to the task
2. Tabular data features(表格数据)
-
int/float : directly use or bin to n unique int values (数据转换)
-
categorical data:one-hot encoding (数据独热编码)
- map rare categories into “unknown”
-
Data-time :a feature list such as (时间变换)
- [year,month,day,day_of_year,week_of_year,…]
-
Feature combination: Cartesian product of two feature groups (数据组合)
- [cat ,dog] * [male,female] -->
- [(cat,male),(cat,female),(dog,male),(dog,female)]
3. Text features (文本数据)
-
Represent text as token features (将文本转换为token)
-
Bag of words(BoW) model
- limitations: needs careful vocabulary design ,missing context
-
Word embeddings(e.g. Word2vec) (词嵌入)
- vectorizing words such that similar words are placed close together
- trained by predicting target word from context words
-
-
Pre-trained language models(e.g. BERT ,GPT-3) : (预训练深度神经网络抽取特征)
- giant transformer models
- traind with large amount of unannotated data
- fine-tuning for downstream task
4. image/video features (图片/视频数据)
- traditionally extract images by hand-craft features such as SIFT (手动提取)
- now commonly use pre-trained deep neural networks (预训练神经网络)
- ResNet:trained with ImageNet(Image classification)
- I3D:trained with Kinetics(action classifition)
5. Summary
- Features matter
- Features are hand-crafted or learned by deep neural networks (要不手动,要不深度神经网络预训练)