Spark - Isotonic Regression Theory and Practice

  I. Introduction

Order-preserving regression is also known as monotonic regression. Isotonic in the algorithm means isotonic, which comes from the Greek root, 'iso' means equal, and 'tonos' means expansion, which expresses that it has been expanded on the basis of conventional linear regression. Added order-preserving restrictions. This article will conduct simple theoretical analysis and code practice based on the principle and practice of Spark Isotonic Regression.

2. Isotonic Regression Theory

1. Functional form

Order-preserving regression belongs to the family of regression algorithms. It is similar in form to conventional regression. Given a set of observations y1, y2, ... and original values ​​x1, x2, ..., find a mapping function f(x) such that the following The loss function is minimized:

min\sum_{i=1}^{n}w_i(y_i - f(x_i))^2

Among them, order preservation is required:

x_i \leq x_j => y_i \leq y_j

Here the mapping function f(x) is the regression function we finally generate.

2. Introduction to PAV Algorithm

Spark.Mlib uses the Pool Adjacent Violators Algorithm PAV algorithm to approximate the order-preserving regression. First of all, there is a prerequisite, which is to assume that the variance in the regression model is the same as other linear models. Assume that each yi satisfies a normal distribution:

N(\mu_i ,\sigma ^2)

And satisfy:

x_i \leq x_j => y_i \leq y_j

In order to satisfy the monotonic constraint, we need to form a monotonic sequence between the point violating the monotonic constraint and its adjacent points. It is guaranteed to satisfy the same distribution within the range of the sequence, that is, if there is an out-of-order condition:

x_i < x_{i+1} => y_i > y_{i+1}

Then let the current x-sequence also belong to the new distribution:

N(\frac{y_i + y_{i+1}}{2}, \sigma ^2)

At this point f(x) given x[i] and x[i+1] will return (y[i] + yi+1) / 2 as a prediction, ie u = (y[i] + y[i+1 ]) / 2. Continue to advance to x[i+2] point, if y[i] + y[i+1]) / 2 > y[i+2], it means that the current sequence is still out of order, at this time absorb y[i +2] to generate a new distribution:

N(\frac{y_i + y_{i+1} + y_{i+2}}{3}, \sigma ^2)

If the reciprocating cycle absorbs the sequence, if y[i] + y[i+1]) / 2 < y[i+2], then x[i] and x[i+1] belong to the same distribution at this time, x[i +2] Construct your own distribution of μ = y[i+2], and repeat the above sequence expansion process until the distribution y of all x satisfies monotonicity. To put it simply, the point that does not satisfy the monotonicity is bound to the surrounding points as a whole and merged into a distribution. At this time, the predicted value of the corresponding interval will also become a straight line. Here we all assume that the weight w is the same and both are 1. If the weights are different, the construction process of the distribution mean will become a weighted average instead of a direct average.

Tips:

In the regression task, our f(x) returns the same y given the same x, and the same x may have different y in the original data. For such x, we first need to take the y values ​​of all x to, and then go to mean(y) as the predicted value of x to enter the following steps. Taking the average here is also easy to understand, because we use the loss function of the mean square error. Given y1, y2, ..., yn, the loss function is the smallest when mean(y) is taken.

3. Algorithm legend

The figure below shows the sequence composition and absorption method of the PAV algorithm into the same distribution.

Through the regression loss function and monotonicity requirements, the obtained f(x) is finally in the form of a piecewise function.

3. Isotonic Regression in practice 

1. Simulate out-of-order data

Here we simulate a regression problem with y = 10 * x and generate out-of-order data in a random range.

    // 构造乱序数据
    val dataBuffer = new ArrayBuffer[(Double, Double)]()
    val random = new Random()

    (0 to 10000).foreach(num => {
      val x = random.nextDouble()
      val y = if (random.nextDouble() < 0.1) {
        x * 10 - random.nextDouble() * 2
      } else {
        x * 10
      }
      dataBuffer.append((y, x))
    })

    // 划分训练、预测集
    val splits = sc.parallelize(dataBuffer).map(data => {
      (data._1, data._2, 1.0)
    }).randomSplit(Array(0.6, 0.4))

    val training = splits(0)
    val test = splits(1)

2. Model training

Divide the training set prediction set and train it. If setIsotonic is True, the order of the model is required.

    // 划分训练、预测集
    val splits = sc.parallelize(dataBuffer).map(data => {
      (data._1, data._2, 1.0)
    }).randomSplit(Array(0.6, 0.4))

    val training = splits(0)
    val test = splits(1)

    // 模型训练
    val model = new IsotonicRegression().setIsotonic(true).run(training)

3. Model evaluation

Calculate MSE using original labels and predicted values.

    // 预测值与真实值
    val predictionAndLabel = test.map { point =>
      val predictedLabel = model.predict(point._2)
      (predictedLabel, point._1)
    }

    // 计算 MSE
    val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p - l), 2) }.mean()
    println(s"Mean Squared Error = $meanSquaredError")

4. Model saving and loading

Save the model via the .save method and load the model via the .load method.

// Save and load model
model.save(sc, "target/tmp/myIsotonicRegressionModel")
val sameModel = IsotonicRegressionModel.load(sc, "target/tmp/myIsotonicRegressionModel")

Tips:  complete code

    val conf = (new SparkConf).setAppName("IsotonicLR").setMaster("local[*]")

    val spark = SparkSession
      .builder
      .config(conf)
      .getOrCreate()

    val sc = spark.sparkContext

    // 构造乱序数据
    val dataBuffer = new ArrayBuffer[(Double, Double)]()
    val random = new Random()

    (0 to 10000).foreach(num => {
      val x = random.nextDouble()
      val y = if (random.nextDouble() < 0.1) {
        x * 10 - random.nextDouble() * 2
      } else {
        x * 10
      }
      dataBuffer.append((y, x))
    })

    // 划分训练、预测集
    val splits = sc.parallelize(dataBuffer).map(data => {
      (data._1, data._2, 1.0)
    }).randomSplit(Array(0.6, 0.4))

    val training = splits(0)
    val test = splits(1)

    // 模型训练
    val model = new IsotonicRegression().setIsotonic(true).run(training)

    // 预测值与真实值
    val predictionAndLabel = test.map { point =>
      val predictedLabel = model.predict(point._2)
      (predictedLabel, point._1)
    }

    // 计算 MSE
    val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p - l), 2) }.mean()
    println(s"Mean Squared Error = $meanSquaredError")

4. Isotonic Regression application 

1. Custom loading model

In actual scenarios, the Isotonic Regression model loaded by Spark training needs to be passed into SC, that is, SparkContext:

But if we use it in the Flink environment, it is not easy to initialize the SparkContext. Here we mainly go to sc to read the source data, and the source data of the order-preserving regression is actually very simple, two Array[Double] and one for Boundaries The boundary one is Predictions prediction and Boolean that identifies isotonic, so we can save these two arrays through Redis, and directly create new IsotonicRegressionModel in our own tasks:

val selfModel = new IsotonicRegressionModel(boundaries, predictions, isotonic)

The advantage of this is that it avoids the limitation of sc, and the modified model supports serialization, so there is no problem with broadcasting.

2. Custom code testing

    val boundaries = model.boundaries
    val predictions = model.predictions
    println(boundaries.mkString(","))
    println(predictions.mkString(","))
    val isotonic = true
    val selfModel = new IsotonicRegressionModel(boundaries, predictions, isotonic)

    val predictionAndLabelSelf = test.map { point =>
      val predictedLabel = selfModel.predict(point._2)
      (predictedLabel, point._1)
    }

    // Calculate mean squared error between predicted and real labels.
    val meanSquaredErrorSelf = predictionAndLabelSelf.map { case (p, l) => math.pow((p - l), 2) }.mean()
    println(s"Mean Squared Error Self = $meanSquaredError")

Here, the Boundaries and Predictions of the model just trained are used to directly construct the order-preserving regression model, and the calculated prediction result is the same as the model obtained by MES and load. For actual combat here, just store the array in a storage medium such as Redis.

3. Custom model prediction

The following is the source code of the official predict, and the idea is very clear. First, use the dichotomy method to judge whether the current test data is in the boundaries, that is, whether x is in the boundaries, and then calculate according to the data in four cases:

- less than all bounds: take Predictions.head 

- Greater than all bounds: take Predictions.last

- index not found: perform linearInterpolation linear interpolation

- Find the index: go directly to Predictions[Index]

Interested students can copy the source code and call it by themselves~

5. Summary

The classic application case of order-preserving regression is the drug dosage test. It is assumed that the drug dosage and the patient's response are positive and monotonous. In this case, use order-preserving regression, that is, do not change the order of X, and find the mean value of Y. For example, when the model predicts that the dosage is 20-30, the predicted value is the same. Considering factors such as economy and drug resistance of patients, we can think that the dosage of 20 is ideal.

In addition, order-preserving regression can also be applied to the correction of click-through rate estimates, because our prior knowledge is that the predicted score of the model should be proportional to the corresponding real click-through rate. Based on this prior knowledge, we also CTR can be corrected by ordinal regression.

reference:

Spark Isotonic regression DOCS

Predicting Good Probabilities With Supervised Learning

Isotonic Regression Research And Process

Guess you like

Origin blog.csdn.net/BIT_666/article/details/129837537