Spark MLlib之分类模型源码分析

版权声明:本文为博主原创文章,未经博主允许不得转载。http://blog.csdn.net/aspirinvagrant https://blog.csdn.net/fenghuangdesire/article/details/52006460

权重更新

MLlib采用如下方式更新模型(LR、SVM、线性回归等)的权重:

w=wthisIterStepSize(gradient+regParamw)

w=(1thisIterStepSizeregParam)wthisIterStepSizegradient

逻辑回归(LR)回顾

Logistic regression是机器学习常用的分类模型,用于将不同样本分开。本文的重点不在Logistic regression的细节,关于Logistic regression的具体原理和公式推导请参考zuoxy09的博文—— 机器学习算法与Python实践之(七)逻辑回归(Logistic Regression)
接下来给出Logistic regression的cost function

J(θ)=[1mi=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]+λ2mj=1nθ2j

使用梯度下降(Gradient Descent),对 J(θ) 求导等于0,得到 θj
θj+1=θjα[1mi=1m(hθ(x(i))y(i))x(i)j+λmθj]

对于梯度下降的方法,请参考—— 随机梯度下降与批量梯度下降
对上面的式子做如下变换:
θj+1=θj(1αλm)α1mi=1m(hθ(x(i))y(i))x(i)j

并行的梯度下降

梯度下降MapReduce实现
其中:
map函数计算每个点的梯度: (hθ(x(i))y(i))x(i)j
reduce函数计算所有点的梯度求和以及正则项: 1mmi=1(hθ(x(i))y(i))x(i)j+λmθj

Mlib源码分析

Logistic regression利用梯度下降求解参数 θj 核心代码在GradientDescent#runMiniBatchSGD方法中,包括两部分

/**
  * 
  * @param data 样本数据RDD,格式 (label, [features])
  * @param gradient 对应LogisticGradient,用于计算每个样本梯度及误差
  * @param updater 对应SquaredL2Updater, 用于每次更新权重
  * @param stepSize 初始步长
  * @param numIterations 迭代次数
  * @param regParam 正则化因子
  * @param miniBatchFraction 每次迭代参与计算的样本比例
  * @param convergenceTol 迭代前后的变化,小于某个阈值停止迭代
  * @return (Vector, Array[Double]) 第一个为权重,每二个为每次迭代的误差值
*/
def runMiniBatchSGD(
    data: RDD[(Double, Vector)],
    gradient: Gradient,
    updater: Updater,
    stepSize: Double,
    numIterations: Int,
    regParam: Double,
    miniBatchFraction: Double,
    initialWeights: Vector,
    convergenceTol: Double): (Vector, Array[Double]) = {

  // 误差变化历史记录
  val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
  // Record previous weight and current one to calculate solution vector difference
  var previousWeights: Option[Vector] = None
  var currentWeights: Option[Vector] = None
  val numExamples = data.count()
  // Initialize weights as a column vector
  var weights = Vectors.dense(initialWeights.toArray)
  val n = weights.size
  // 第一次迭代初始化正则因子
  var regVal = updater.compute(
    weights, Vectors.zeros(weights.size), 0, 1, regParam)._2
  // indicates whether converged based on convergenceTol
  var converged = false 
  var i = 1
  while (!converged && i <= numIterations) {
    val bcWeights = data.context.broadcast(weights)
    // Sample a subset (fraction miniBatchFraction) of the total data
    // compute and sum up the subgradients on this subset (this is one map-reduce)
    // treeAggregate的使用请参考http://stackoverflow.com/questions/29860635/how-to-interpret-rdd-treeaggregate
    // 计算梯度和误差
    val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
      .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
        seqOp = (c, v) => {
          // c: (grad, loss, count), v: (label, features)
          val loss = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
          (c._1, c._2 + loss, c._3 + 1)
        },
        combOp = (c1, c2) => {
          // c: (grad, loss, count)
          (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
        })

    if (miniBatchSize > 0) {
      stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
      // 更新梯度
      val update = updater.compute(
        weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
        stepSize, i, regParam)
      weights = update._1
      regVal = update._2

      previousWeights = currentWeights
      currentWeights = Some(weights)
      if (previousWeights != None && currentWeights != None) {
        converged = isConverged(previousWeights.get,
          currentWeights.get, convergenceTol)
      }
    } else {
      logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
    }
    i += 1
  }
  (weights, stochasticLossHistory.toArray)
}

上面方法中包含两部分:
1、 计算梯度gradient.compute

i=1m(hθ(x(i))y(i))x(i)j

override def compute(
      data: Vector,
      label: Double,
      weights: Vector,
      cumGradient: Vector): Double = {
    val dataSize = data.size
    val margin = -1.0 * dot(data, weights)
    // 公式(2)中求和部分
    val multiplier = (1.0 / (1.0 + math.exp(margin))) - label
    // cumGradient为当前累计求和后的梯度
    // cumGradient += multiplier * data
    axpy(multiplier, data, cumGradient)
    // 误差
    if (label > 0) {
      // The following is equivalent to log(1 + exp(margin)) but more numerically stable.
      MLUtils.log1pExp(margin)
    } else {
      MLUtils.log1pExp(margin) - margin
    }
}

2、 更新权重updater.compute

θj+1=θj(1αλm)αgradient

override def compute(
      weightsOld: Vector,
      gradient: Vector,
      stepSize: Double,
      iter: Int,
      regParam: Double): (Vector, Double) = {
    // 本次迭代步长alpha
    val thisIterStepSize = stepSize / math.sqrt(iter)
    // 根据公式(3)计算权重,brzWeights为当前权重theta
    val brzWeights: BV[Double] = weightsOld.toBreeze.toDenseVector
    // :* 为向量scalar计算
    brzWeights :*= (1.0 - thisIterStepSize * regParam)
    // brzWeights += -thisIterStepSize * gradient
    brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)
    // 正则项
    val norm = brzNorm(brzWeights, 2.0)
    (Vectors.fromBreeze(brzWeights), 0.5 * regParam * norm * norm)
  }

线性SVM

cost function

J(θ)=1mi=1mmax(0,1y(i)hθ(x(i)))+λ2mj=1nθ2j

注: hθ(x(i))=θTx(i)
使用梯度下降(Gradient Descent),对 J(θ) 求导等于0,得到 θ
θj+1={θjα(λθj1mmi=1y(i)x(i)),θjαλθj,if y(i)hθ(x(i))<1otherwise

关于SVM参数计算推导见 这里
y(i)x(i) 为梯度。

猜你喜欢

转载自blog.csdn.net/fenghuangdesire/article/details/52006460