Spark实现关联分析

1.理解关联规则

     市场购物篮分析的结果是一组指定商品之间关系模式的关联规则,一个典型的规则可以表述为: {花生酱,果酱} –> {面包}
这个关联规则用通俗易懂的语言来表达就是:如果购买了花生酱和果酱,那么也很有可能会购买面包。我们分析的就是事物之间 关系,某些事物是否存在联系。

2.测试数据

a,b,c
a,b,d
b,a,d
b,c,e
b,d,e
a,b,c
a,b,e
a,b,e
a,b,c
a,b,c
a,b
a,d
b,d
b,e
c,d,e
a,e
b,d

a,b,c,d,e大家可以想象成某个商品,a商品,b商品。。。

以上是一些商品购买的交易,从上面我们分析商品之间有什么联系,这里需要用到支持度和置信度两个概念,不理解的可以百度下。

支持度:一个项集或者规则度量法的支持度是指其在数据中出现的频率。
置信度:是指该规则的预测能力或者准确度的度量。

比如这里有17组数据,我们求a的支持度,发现a出现10次,所以支持度为10/17,而置信度公式是confidence(x,y)=support(x,y)/support(x),意思是已知x的支持度,求x发生的概率下y发生的概率。

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.mage.ml.association_rules

import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel}
import org.apache.spark.sql.{DataFrame, Dataset}
// $example off$
import org.apache.spark.sql.SparkSession

/**
  * 关联规则....
  */
object FPGrowthExample {

    def main(args: Array[String]): Unit = {
        val spark: SparkSession = SparkSession
      .builder
      .master("local")
      .appName("FPGrowth")
      .getOrCreate()
    import spark.implicits._

    //加载数据
    val shoppings: Dataset[String] = spark.read.textFile("shopping_cart")

    //把数据通过空格分割,转成DataFrame

    val df: DataFrame = shoppings.map(_.split(",")).toDF("items")

    val growth = new FPGrowth().setItemsCol("items")
    //设置支持度和置信度
    growth.setMinConfidence(0.8)
    growth.setMinSupport(0.3)
    //设置分区数
    growth.setNumPartitions(2)

    val model: FPGrowthModel = growth.fit(df)
    //打印频繁项集
    model.freqItemsets.show();

    //打印符合置信度和支持度条件的关联规则
    model.associationRules.show()

    spark.stop()
}
}


参考文章:https://blog.csdn.net/qq_41455420/article/details/89532574

发布了178 篇原创文章 · 获赞 14 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/qq_40511966/article/details/103494862