请先阅读leboop发布的博文《Spark MLlib协同过滤之交替最小二乘法ALS原理与实践》。
核心代码如下:
//定义ALS,参数初始化
val als = new ALS().setRank(50)
.setMaxIter(10)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("itemId")
.setRatingCol("rating")
//训练模型
val model = als.fit(training)
下面详细介绍这段代码:
一、fit函数源码
fit函数在ALS类中,重写了抽象类Estimator的fit函数。
ALS类如下:
@Since("1.3.0")
class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel] with ALSParams
with DefaultParamsWritable
Estimator类如下:
abstract class Estimator[M <: Model[M]] extends PipelineStage
重写后的fit函数如下:
@Since("2.0.0")
override def fit(dataset: Dataset[_]): ALSModel = {//11111111111111111111111111
transformSchema(dataset.schema)//222222222222222222222222222222222
import dataset.sparkSession.implicits._
//333333333333333333333333333333
val r = if ($(ratingCol) != "") col($(ratingCol)).cast(FloatType) else lit(1.0f)
val ratings = dataset
.select(checkedCast(col($(userCol))), checkedCast(col($(itemCol))), r)
.rdd
.map { row =>
Rating(row.getInt(0), row.getInt(1), row.getFloat(2))
}
//44444444444444444444444444
val instr = Instrumentation.create(this, ratings)
instr.logParams(rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha, userCol,
itemCol, ratingCol, predictionCol, maxIter, regParam, nonnegative, checkpointInterval,
seed, intermediateStorageLevel, finalStorageLevel)
//555555555555555555555555555555555
val (userFactors, itemFactors) = ALS.train(ratings, rank = $(rank),
numUserBlocks = $(numUserBlocks), numItemBlocks = $(numItemBlocks),
maxIter = $(maxIter), regParam = $(regParam), implicitPrefs = $(implicitPrefs),
alpha = $(alpha), nonnegative = $(nonnegative),
intermediateRDDStorageLevel = StorageLevel.fromString($(intermediateStorageLevel)),
finalRDDStorageLevel = StorageLevel.fromString($(finalStorageLevel)),
checkpointInterval = $(checkpointInterval), seed = $(seed))
//6666666666666666666666666666
val userDF = userFactors.toDF("id", "features")
val itemDF = itemFactors.toDF("id", "features")
val model = new ALSModel(uid, $(rank), userDF, itemDF).setParent(this)
instr.logSuccess(model)
copyValues(model)
}
1、fit函数参数
类型是DataSet类型,一种弹性分布式数据集。可以简单理解成关系型数据中的表,有行和列。这里有三列,第一列用户的Id:userId,第二列是物品的Id:itemId,第三列是用户给物品的评分ratings,这通常称为数据集的模式(Schema)。
2、数据模式变换和数据类型检查
transformSchema(dataset.schema)的作用是对输入的数据进行模式变换和数据类型检查。它通过调用函数
validateAndTransformSchema来实现,要求数据类型是数值型,并将输入数据的模式变换成标准userCol,itemCol,ratingCol
三列的模式,最后新增predictionCol列,类型为FloatType。也就是说现在输入的数据模式变成了含有四个列的数据模式。
部分代码如下:
protected def validateAndTransformSchema(schema: StructType): StructType = {
// user and item will be cast to Int
SchemaUtils.checkNumericType(schema, $(userCol))
SchemaUtils.checkNumericType(schema, $(itemCol))
// rating will be cast to Float
SchemaUtils.checkNumericType(schema, $(ratingCol))
SchemaUtils.appendColumn(schema, $(predictionCol), FloatType)
}
3、数据类型变换
这里将userCol和itemCol两列数据变换成Int类型,ratingCol列变成Float类型,如果为空变成字面量1.0f。并将每行数据变换成内置的Rating(含三个列)数据。
val r = if ($(ratingCol) != "") col($(ratingCol)).cast(FloatType) else lit(1.0f)
val ratings = dataset
.select(checkedCast(col($(userCol))), checkedCast(col($(itemCol))), r)
.rdd
.map { row =>
Rating(row.getInt(0), row.getInt(1), row.getFloat(2))
}
4、训练回话日志生成器
这步主要生成训练模型过程中会话需要的一些日志。
5、训练隐藏因子(latent factors)
通过调用ALS的train方法训练生成userFators特征矩阵和itemFactors物品特征矩阵。后面详细介绍。
6、
二、train函数源码
train函数同fit函数一样,也在ALS类下,源码如下
@DeveloperApi
def train[ID: ClassTag]( // scalastyle:ignore
ratings: RDD[Rating[ID]],
rank: Int = 10,
numUserBlocks: Int = 10,
numItemBlocks: Int = 10,
maxIter: Int = 10,
regParam: Double = 0.1,
implicitPrefs: Boolean = false,
alpha: Double = 1.0,
nonnegative: Boolean = false,
intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
checkpointInterval: Int = 10,
seed: Long = 0L)(
implicit ord: Ordering[ID]): (RDD[(ID, Array[Float])], RDD[(ID, Array[Float])]) = {
require(!ratings.isEmpty(), s"No ratings available from $ratings")
require(intermediateRDDStorageLevel != StorageLevel.NONE,
"ALS is not designed to run without persisting intermediate RDDs.")
val sc = ratings.sparkContext
// Precompute the rating dependencies of each partition
val userPart = new ALSPartitioner(numUserBlocks)
val itemPart = new ALSPartitioner(numItemBlocks)
val blockRatings = partitionRatings(ratings, userPart, itemPart)
.persist(intermediateRDDStorageLevel)
val (userInBlocks, userOutBlocks) =
makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel)
userOutBlocks.count() // materialize blockRatings and user blocks
val swappedBlockRatings = blockRatings.map {
case ((userBlockId, itemBlockId), RatingBlock(userIds, itemIds, localRatings)) =>
((itemBlockId, userBlockId), RatingBlock(itemIds, userIds, localRatings))
}
val (itemInBlocks, itemOutBlocks) =
makeBlocks("item", swappedBlockRatings, itemPart, userPart, intermediateRDDStorageLevel)
itemOutBlocks.count() // materialize item blocks
// Encoders for storing each user/item's partition ID and index within its partition using a
// single integer; used as an optimization
val userLocalIndexEncoder = new LocalIndexEncoder(userPart.numPartitions)
val itemLocalIndexEncoder = new LocalIndexEncoder(itemPart.numPartitions)
// These are the user and item factor matrices that, once trained, are multiplied together to
// estimate the rating matrix. The two matrices are stored in RDDs, partitioned by column such
// that each factor column resides on the same Spark worker as its corresponding user or item.
val seedGen = new XORShiftRandom(seed)
var userFactors = initialize(userInBlocks, rank, seedGen.nextLong())
var itemFactors = initialize(itemInBlocks, rank, seedGen.nextLong())
val solver = if (nonnegative) new NNLSSolver else new CholeskySolver
var previousCheckpointFile: Option[String] = None
val shouldCheckpoint: Int => Boolean = (iter) =>
sc.checkpointDir.isDefined && checkpointInterval != -1 && (iter % checkpointInterval == 0)
val deletePreviousCheckpointFile: () => Unit = () =>
previousCheckpointFile.foreach { file =>
try {
val checkpointFile = new Path(file)
checkpointFile.getFileSystem(sc.hadoopConfiguration).delete(checkpointFile, true)
} catch {
case e: IOException =>
logWarning(s"Cannot delete checkpoint file $file:", e)
}
}
if (implicitPrefs) {
for (iter <- 1 to maxIter) {
userFactors.setName(s"userFactors-$iter").persist(intermediateRDDStorageLevel)
val previousItemFactors = itemFactors
itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
userLocalIndexEncoder, implicitPrefs, alpha, solver)
previousItemFactors.unpersist()
itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
// TODO: Generalize PeriodicGraphCheckpointer and use it here.
val deps = itemFactors.dependencies
if (shouldCheckpoint(iter)) {
itemFactors.checkpoint() // itemFactors gets materialized in computeFactors
}
val previousUserFactors = userFactors
userFactors = computeFactors(itemFactors, itemOutBlocks, userInBlocks, rank, regParam,
itemLocalIndexEncoder, implicitPrefs, alpha, solver)
if (shouldCheckpoint(iter)) {
ALS.cleanShuffleDependencies(sc, deps)
deletePreviousCheckpointFile()
previousCheckpointFile = itemFactors.getCheckpointFile
}
previousUserFactors.unpersist()
}
} else {
for (iter <- 0 until maxIter) {
itemFactors = computeFactors(userFactors, userOutBlocks, itemInBlocks, rank, regParam,
userLocalIndexEncoder, solver = solver)
if (shouldCheckpoint(iter)) {
val deps = itemFactors.dependencies
itemFactors.checkpoint()
itemFactors.count() // checkpoint item factors and cut lineage
ALS.cleanShuffleDependencies(sc, deps)
deletePreviousCheckpointFile()
previousCheckpointFile = itemFactors.getCheckpointFile
}
userFactors = computeFactors(itemFactors, itemOutBlocks, userInBlocks, rank, regParam,
itemLocalIndexEncoder, solver = solver)
}
}
val userIdAndFactors = userInBlocks
.mapValues(_.srcIds)
.join(userFactors)
.mapPartitions({ items =>
items.flatMap { case (_, (ids, factors)) =>
ids.view.zip(factors)
}
// Preserve the partitioning because IDs are consistent with the partitioners in userInBlocks
// and userFactors.
}, preservesPartitioning = true)
.setName("userFactors")
.persist(finalRDDStorageLevel)
val itemIdAndFactors = itemInBlocks
.mapValues(_.srcIds)
.join(itemFactors)
.mapPartitions({ items =>
items.flatMap { case (_, (ids, factors)) =>
ids.view.zip(factors)
}
}, preservesPartitioning = true)
.setName("itemFactors")
.persist(finalRDDStorageLevel)
if (finalRDDStorageLevel != StorageLevel.NONE) {
userIdAndFactors.count()
itemFactors.unpersist()
itemIdAndFactors.count()
userInBlocks.unpersist()
userOutBlocks.unpersist()
itemInBlocks.unpersist()
itemOutBlocks.unpersist()
blockRatings.unpersist()
}
(userIdAndFactors, itemIdAndFactors)
}
1、train函数参数
不用多解释,参见博文《Spark MLlib协同过滤之交替最小二乘法ALS原理与实践》
2、用户和物品分区
// Precompute the rating dependencies of each partition
val userPart = new ALSPartitioner(numUserBlocks)
val itemPart = new ALSPartitioner(numItemBlocks)
val blockRatings = partitionRatings(ratings, userPart, itemPart)
.persist(intermediateRDDStorageLevel)
首先生成了userPart分区器和itemPart分区器,partitionRatings按照这两个分区器将ratings数据转换为分区的形式,即((用户分区id,商品分区id),分区数据集blocks)的形式,并缓存,默认缓存方式是内存和磁盘。这里分区采用的是哈希分区,源码如下:
private[recommendation] type ALSPartitioner = org.apache.spark.HashPartitioner
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
def numPartitions: Int = partitions
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}
举个例子:如果给定10个数{1,2,3,4,5,6,7,8,9},分成4个区。可以用每个数对4求余进行,相同余数的放在一个分区,所以分区一{1,5,9},分区二{2,6},分区三{3,7},分区四{4,8}。