Data and Feature Processing

Handling different data types

Numerical
  • Statistics: max, min, mean, std (variance)
  • Discretization For
    example , the price is divided into different segments (which can be of equal width or unequal width), each segment represents a vector, and different prices may be the same vector.
  • Hash bucket
  • The corresponding variable statistical value histogram (distribution status) under each category
  • Numerical => categorical
  • Amplitude adjustment/normalization
Category type
  • One-hot encoding/dumb variables For
    example , red, yellow, and blue correspond to a vector, and each value of one-hot encoding corresponds to a vector.
  • hash and clustering
  • Tips: Count the proportion of each target under each categorical variable and convert it into a numerical type.
time type

It can be regarded as either continuous value or discrete value.
1. Continuous value

  • Duration (single page view time)

  • Interval (time since last purchase/click)

    2. Discrete values

  • time of day

  • day of the week
  • what week of the year
  • what quarter of the year
  • weekdays/weekends
textual
  • Bag of words After the
    text data is preprocessed, stop words are removed, and the remaining words form a list, which is mapped into a sparse vector in the thesaurus.
  • Expand the words in the bag of words to n-grams.
  • TF-IDF:
    TF(Term-Frequency), TF(t)=(the number of times the word t appears in the current text)/(the number of times t appears in all documents ) IDF (t)=ln( the total number of documents/ including t number of documents ) TF-IDF weight = TF(t)*IDF(t)

  • bag of words => word2vec
Statistical
  • Addition and subtraction average
    Quantile line
    Sequence type: which rank is ranked
    Proportion type: such as the ratio of good/medium/bad reviews in e-commerce

    Combination
  • Simple Combination Features: Splicing Type

  • Model Feature Combination
    Use GBDT to produce combined features, and put the combined features together with the original features into LR training.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325388698&siteId=291194637