Article directory
1. Feature Engineering
-
Machine learning algorithms prefer well define fixed length input/output (Machine learning prefers fixed input and output)
-
Feature engineering(FE) is the key to ML method before deep learning(DL)
- in a computer vision task ,people try various FE methods and then train a SVM model
-
DL train deep neural networks to extract features(Deep learning can automatically extract features, while many machine learning methods require FE to extract features)
- features are relevant to the task
2. Tabular data features
-
int/float : directly use or bin to n unique int values (data conversion)
-
categorical data:one-hot encoding (data one-hot encoding)
- map rare categories into “unknown”
-
Data-time :a feature list such as (time transformation)
- [year,month,day,day_of_year,week_of_year,…]
-
Feature combination: Cartesian product of two feature groups (data combination)
- [cat ,dog] * [male,female] -->
- [(cat,male),(cat,female),(dog,male),(dog,female)]
3. Text features (text data)
-
Represent text as token features (convert text to token)
-
Bag of words(BoW) model
- limitations: needs careful vocabulary design ,missing context
-
Word embeddings(e.g. Word2vec) (word embedding)
- vectorizing words such that similar words are placed close together
- trained by predicting target word from context words
-
-
Pre-trained language models(e.g. BERT ,GPT-3) : (pre-trained deep neural network to extract features)
- giant transformer models
- traind with large amount of unannotated data
- fine-tuning for downstream task
4. image/video features (picture/video data)
- traditionally extract images by hand-craft features such as SIFT (manual extraction)
- now commonly use pre-trained deep neural networks (pre-trained neural network)
- ResNet:trained with ImageNet(Image classification)
- I3D:trained with Kinetics(action classifition)
5. Summary
- Features matter
- Features are hand-crafted or learned by deep neural networks (Either manual or deep neural network pre-training)