FTRL (Follow The Regularized Leader) learning summary

Summary:

  1. Algorithm overview

  2. Algorithm points and derivation

  3. Algorithm characteristics and advantages and disadvantages

  4. Precautions

  5. Implementation and specific examples

  6. Applicable occasions

content:

  1. Algorithm overview

  FTRL is a common optimization algorithm suitable for processing ultra-large-scale data and online learning with a large number of sparse features. It is convenient and practical, and has good effects. It is often used to update online CTR prediction models;

  FTRLThe algorithm takes into account the advantages FOBOSof the RDAtwo algorithms, and can not only ensure a relatively high accuracy with FOBOS, but also generate better sparsity at the expense of a certain accuracy.

  FTRL performs very well on convex optimization problems with non-smooth regular terms (such as L1 regularity), which can not only control the sparsity of the model through L1 regularity, but also has a fast convergence speed;

  2. Algorithm points and derivation

  

  3. Algorithm characteristics and advantages and disadvantages

  The tricks on the FTRL-Proximal project implementation:

  1.saving memory

    Scheme 1) Poisson Inclusion: For the training samples from a certain dimension feature, accept and update the model with the probability of p.
    Option 2) Bloom Filter Inclusion: Use bloom filter to make a certain feature appear k times in probability before updating.

  2. Floating point number recoding

    1) Feature weights do not need to be stored in 32-bit or 64-bit floating-point numbers, which wastes storage space.
    2) 16-bit encoding, but pay attention to the impact of rounding technology on regret.
  3. Train several similar models
    1) For the same training data sequence, Train multiple similar models at the same time
    2) These models have some exclusive features and some shared features
    3) Starting point: some feature dimensions can be exclusive to each model, and some features shared by each model can be Train with the same data.
  4.Single Value Structure
    1) Multiple models share a feature storage (for example, put them in cbase or redis), and each model updates this common feature structure
    2) For a certain model, for a certain dimension of the feature vector it has trained , directly calculate an iterative result and do an average
  of 5 with the old value. Use the number of positive and negative samples to calculate the sum of the gradients (all models have the same N and P)

      

  6.subsampling Training Data
   1) In practice, the CTR is much less than 50%, so positive samples are more valuable. By subsampling the training data set, the size of the training data set can be greatly reduced.
   2) All positive samples are sampled (query data with at least one ad clicked), and negative samples are sampled with a ratio r (query with no ads clicked at all). data). However, training directly on this kind of sampling will lead to a relatively large biased prediction
     . 3) Solution: When training, multiply the sample by a weight. The weight is directly multiplied to the loss, so the gradient is also multiplied by this weight.

 

   Algorithm Features:

   Online learning, high real-time performance; can handle large-scale sparse data; large-scale model parameter training ability; according to different characteristic feature learning rates

   shortcoming:

     

  4. Precautions

  5. Implementation and specific examples

    FTRL processing "Springleaf Marketing Response" data

    Spark Streaming on Angel FTRL

  6. Applicable occasions

    CTR model

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325946308&siteId=291194637