Handling different data types
Numerical
- Statistics: max, min, mean, std (variance)
- Discretization For
example , the price is divided into different segments (which can be of equal width or unequal width), each segment represents a vector, and different prices may be the same vector. - Hash bucket
- The corresponding variable statistical value histogram (distribution status) under each category
- Numerical => categorical
- Amplitude adjustment/normalization
Category type
- One-hot encoding/dumb variables For
example , red, yellow, and blue correspond to a vector, and each value of one-hot encoding corresponds to a vector. - hash and clustering
- Tips: Count the proportion of each target under each categorical variable and convert it into a numerical type.
time type
It can be regarded as either continuous value or discrete value.
1. Continuous value
Duration (single page view time)
Interval (time since last purchase/click)
2. Discrete values
time of day
- day of the week
- what week of the year
- what quarter of the year
- weekdays/weekends
textual
- Bag of words After the
text data is preprocessed, stop words are removed, and the remaining words form a list, which is mapped into a sparse vector in the thesaurus. - Expand the words in the bag of words to n-grams.
- TF-IDF:
TF(Term-Frequency), TF(t)=(the number of times the word t appears in the current text)/(the number of times t appears in all documents ) IDF (t)=ln( the total number of documents/ including t number of documents ) TF-IDF weight = TF(t)*IDF(t) - bag of words => word2vec
Statistical
Addition and subtraction average
Quantile line
Sequence type: which rank is ranked
Proportion type: such as the ratio of good/medium/bad reviews in e-commerceCombination
Simple Combination Features: Splicing Type
- Model Feature Combination
Use GBDT to produce combined features, and put the combined features together with the original features into LR training.