SparkCore practical exercises

1. Data preparation

The data for this project is the collection of user behavior data from e-commerce websites, which mainly include four types of user behaviors: search, click, place an order, and pay.

Data Format

Insert picture description here

The data uses _ split field.
Each row represents a behavior of the user, so each row can only be one of the four behaviors.
If the search keyword is null, it means it is not a search this time.
If the clicked category id and product id are -1, it means it is not a click this time.
In terms of ordering behavior, multiple products can be ordered at a time, so the category id and product id are both multiple, and the id is separated by a comma. If this is not an order, their related data is represented by null.
The payment behavior is similar to that of placing an order.

Data set download

Link: https://pan.baidu.com/s/1ZLhYdXz1Foi6MpeUBFGCnQ
extraction code: 12lt

Demand 1: Top10 popular categories

Statement of needs

Category refers to the classification of products. Large e-commerce websites have multiple categories. There is only one category in our project. Different companies may have different definitions of popular. We count popular categories according to the amount of clicks, orders, and payments for each category.
The requirements of this project are optimized as follows: first rank according to the number of clicks, and then the higher ranking; if the number of clicks is the same, then compare the number of orders; if the number of orders is the same again, compare the number of payments .

demand analysis

Determine which of the three types of "click", "order", and "payment" the operation is.
Combine category and operation category into (category, (number of clicks, number of orders, number of payments)), and the meta group can be assembled as an object.
Aggregate the total number of three operations according to the key category.
Sort the top 10.

Code:

Create a sample class to encapsulate data information

// 样例类可以自动生成apply方法和unapply()方法
case class UserVisitAction(date: String, //用户点击行为的日期
                           user_id: Long, //用户的ID
                           session_id: String, //Session的ID
                           page_id: Long, //某个页面的ID
                           action_time: String, //动作的时间点
                           search_keyword: String, //用户搜索的关键词
                           click_category_id: Long, //某一个商品品类的ID
                           click_product_id: Long, //某一个商品的ID
                           order_category_ids: String, //一次订单中所有品类的ID集合
                           order_product_ids: String, //一次订单中所有商品的ID集合
                           pay_category_ids: String, //一次支付中所有品类的ID集合
                           pay_product_ids: String, //一次支付中所有商品的ID集合
                           city_id: Long) //城市 id

// 定义样例类存储商品品类及对应的操作次数
// 样例类的属性默认使用val修饰，不可重新赋值
case class CategoryCountInfo(categoryId: String, //品类id
                             var clickCount: Long, //点击次数
                             var orderCount: Long, //订单次数
                             var payCount: Long) //支付次数

Demand 1 code

    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
    val sc = new SparkContext(conf)
    val dataRDD: RDD[String] = sc.textFile("D:\\学习资料\\spark\\spark\\2.资料\\spark-core数据\\user_visit_action.txt")
    // 封装对象
    val userActionRDD: RDD[UserVisitAction] = dataRDD.map {
    
    
      line => {
    
    
        val lineSplit: Array[String] = line.split("_")
        // 封装对象
        UserVisitAction(
          lineSplit(0),
          lineSplit(1).toLong,
          lineSplit(2),
          lineSplit(3).toLong,
          lineSplit(4),
          lineSplit(5),
          lineSplit(6).toLong,
          lineSplit(7).toLong,
          lineSplit(8),
          lineSplit(9),
          lineSplit(10),
          lineSplit(11),
          lineSplit(12).toLong
        )
      }
    }
    // 转换数据格式
    val cateInfoRDD: RDD[(String, CategoryCountInfo)] = userActionRDD.flatMap {
    
    
      userAction => {
    
    
        // 判断操作类型
        if (userAction.click_category_id != -1) {
    
    
          // 转换数据格式 (品类, CategoryCountInfo对象)
          List((userAction.click_category_id.toString,
            CategoryCountInfo(userAction.click_category_id.toString, 1, 0, 0)))
        } else if (userAction.order_category_ids != "null") {
    
    
          // 处理order_category_ids
          val cateIds: Array[String] = userAction.order_category_ids.split(",")
          // ListBuffer存储多个对象
          val list = new mutable.ListBuffer[(String, CategoryCountInfo)]()
          for (cateId <- cateIds) {
    
    
            list.append((cateId, CategoryCountInfo(cateId, 0, 1, 0)))
          }
          list
        } else if (userAction.pay_category_ids != "null") {
    
    
          val cateIds: Array[String] = userAction.pay_category_ids.split(",")
          // ListBuffer存储多个对象
          val list = new mutable.ListBuffer[(String, CategoryCountInfo)]()
          for (cateId <- cateIds) {
    
    
            list.append((cateId, CategoryCountInfo(cateId, 0, 0, 1)))
          }
          list
        } else {
    
    
          // 返回空集合
          Nil
        }
      }
    }
    // 根据key聚合数据
    val reduceRDD: RDD[(String, CategoryCountInfo)] = cateInfoRDD.reduceByKey(
      (cate1, cate2) => {
    
    
        cate1.clickCount = cate1.clickCount + cate2.clickCount
        cate1.orderCount = cate1.orderCount + cate2.orderCount
        cate1.payCount = cate1.payCount + cate2.payCount
        cate1
      }
    )
    // 去掉多余的key，转换格式
    val cateCountRDD: RDD[CategoryCountInfo] = reduceRDD.map {
    
    
      _._2
    }
    // 排序取前10
    val resArr: Array[CategoryCountInfo] = cateCountRDD.sortBy(
      // 元组可以按照元素的顺序来依次排序
      cate =>{
    
    (cate.clickCount,cate.orderCount,cate.payCount)},
      // 倒序
      false
    ).take(10)
    sc.stop()

Demand 2: Top10 active session statistics for each category in the Top10 popular categories

Statement of needs

For the top 10 categories, get the sessionId of the top 10 clicks for each category. (Note: Here we only focus on the number of clicks, not the number of orders and payments.)
For the top10 category, each one must get the sessionId that ranks the top 10 in the number of clicks. This function allows us to see the category that is most interested in a certain user group, and the session behavior of the most typical users in each category.

demand analysis

Get the id of the top10 popular category from demand 1.
Filter the data, the value retains the top10 popular category id and the corresponding click log.
Convert the data format (category id_session, 1).
Aggregate the above data and count the number of session clicks for each category.
Convert data format (category id, (session, count)).
Group by category.
Sort in reverse order, taking the top 10 for each group.

Code

Requirement 2 needs to rely on the result of requirement 1.

	// =======================================上面是需求1=============================================
    // 取出top10的商品id
    val top10Ids: Array[String] = resArr.map(_.categoryId)
    // top10Ids要发送到每个task，可以做广播变量优化
    val broadcast: Broadcast[Array[String]] = sc.broadcast(top10Ids)
    // 过滤数据，只保留top10 id对应的点击数据
    val filterRDD: RDD[UserVisitAction] = userActionRDD.filter(
      // 注意click_category_id的类型是Long类型，需要转换为String类型
      datas => {
    
    
        if (datas.click_category_id != -1) {
    
    
          broadcast.value.contains(datas.click_category_id.toString)
        } else {
    
    
          false
        }
      }
    )
    // 转换格式 (品类id_session, 1)
    val cateIdAndSession1: RDD[(String, Int)] = filterRDD.map(
      datas => {
    
    
        (datas.click_category_id + "_" + datas.session_id, 1)
      }
    )
    // 按照相同的key聚合 (品类id_session, count)
    val cateIdAndSessionCount: RDD[(String, Int)] = cateIdAndSession1.reduceByKey(_ + _)
    // 转换数据格式 (品类id, (session, count))
    val cateIdAndSessionCount2: RDD[(String, (String, Int))] = cateIdAndSessionCount.map {
    
    
      case (idAndSession, count) => {
    
    
        val split: Array[String] = idAndSession.split("_")
        (split(0), (split(1), count))
      }
    }
    // 按照品类分组
    val cateGroupRDD: RDD[(String, Iterable[(String, Int)])] = cateIdAndSessionCount2.groupByKey()
    // 倒序排序，取前10
    val res2RDD: RDD[(String, List[(String, Int)])] = cateGroupRDD.mapValues(
      datas => {
    
    
        val list: List[(String, Int)] = datas.toList
        // 排序
        list.sortWith(_._2 > _._2)
      }.take(10)
    )
    res2RDD.foreach(println)

    sc.stop()

Requirement 3: Statistics of page single jump rate

Statement of needs

Calculate the page single-jump conversion rate, what is the page single-jump conversion rate, for example, the page path 3,5,7,9,10,21 accessed by a user during a session, then page 3 jumps to page 5 and called once Single jump, 7-9 is also called a single jump, then the single jump rate is to count the probability of page clicks.
For example: Calculate the single jump rate of 3-5, first get the number of visits (PV) of page 3 of the eligible session as A, and then get the number of visits (PV) of the eligible session to page 3 and then to page 5 The number of times is B, then B/A is the page single jump rate of 3-5.

demand analysis

You can first count the number of visits to each page
    --> first convert the data format: (page id, 1)
    --> then aggregate according to the page id to find the total number of visits to each page (page id, count)
    - > Then count page A->page B->page C->page D...single jump times
Sort user access time according to sessionId and page id (sessionid-page id, access time)
Convert the data format, group by sessionid
-->session1:page A->page B->page C
-->session2:->page A->page B->page C
Return the collection List of the access sequence in each group (the page access sequence of each user)
    --> Page access sequence: A -> B -> C -> D -> E
    --> Make a zipper with the collection tail: B- > C -> D -> E
    -->(A,B) (B,C) (C,D) (D,E)
Find the number of times each element in the set formed by the zipper appears, that is, the number of single hops for each page ((A, B), count)
Finally, use the number of single jumps per page / the total number of times per page = page single jump rate

	val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
    val sc = new SparkContext(conf)
    val dataRDD: RDD[String] = sc.textFile("D:\\学习资料\\spark\\spark\\2.资料\\spark-core数据\\user_visit_action.txt")
    // 封装对象
    val userVisitActionRDD: RDD[UserVisitAction] = dataRDD.map {
    
    
      line => {
    
    
        val lineSplit: Array[String] = line.split("_")
        // 封装对象
        UserVisitAction(
          lineSplit(0),
          lineSplit(1).toLong,
          lineSplit(2),
          lineSplit(3).toLong,
          lineSplit(4),
          lineSplit(5),
          lineSplit(6).toLong,
          lineSplit(7).toLong,
          lineSplit(8),
          lineSplit(9),
          lineSplit(10),
          lineSplit(11),
          lineSplit(12).toLong
        )
      }
    }

    // 先统计每个页面的访问次数
    // (页面id,1) -> (页面id,count)
    val pageAnd1: RDD[(Long, Long)] = userVisitActionRDD.map(
      action => {
    
    
        (action.page_id, 1L)
      }
    )
    // 按照pageId聚合,每个页面的访问次数, 转为map字典，方便使用
    val pageAndCount: Map[Long, Long] = pageAnd1.reduceByKey(_ + _).collect().toMap

    // 根据sessionid和pageid，按照时间排序
    val sessionAndPage: RDD[(String, String)] = userVisitActionRDD.map(
      action => {
    
    
        // 转换格式 (sessionid_pageid, 时间)
        (action.session_id + "_" + action.page_id, action.action_time)
      }
    )
    // 按照sessionid分组，按照时间正序排序
    // 转换格式 (sessionid, (pageid,时间))
    val sessionPageTime: RDD[(String, (String, String))] = sessionAndPage.map {
    
    
      case (session_page, time) => {
    
    
        val split: Array[String] = session_page.split("_")
        (split(0), (split(1), time))
      }
    }
    // 按照sessionid分组，每个用户的访问页面和时间
    val sessionGroup: RDD[(String, Iterable[(String, String)])] = sessionPageTime.groupByKey()
    // 转换格式，舍弃sessionid Iterable(pageid,时间)
    val pageTime: RDD[Iterable[(String, String)]] = sessionGroup.map(_._2)
    // 按照时间正序排序 Iterable(pageid,时间)
    val pageSortRDD: RDD[Iterable[(String, String)]] = pageTime.map(
      data => {
    
    
        data.toList.sortWith(_._2 < _._2)

      }
    )
    //转换格式，舍弃时间 Iterable(pageid)
    val pageRDD: RDD[List[String]] = pageSortRDD.map(
      datas => {
    
    
        datas.toList.map(_._1)
      }
    )

    // 拉链操作，组成((A,B),1) 的形式，方便计算从A到B的跳转数
    val zipRDD: RDD[((String, String), Long)] = pageRDD.flatMap(
      pageList => {
    
    
        pageList.zip(pageList.tail).map((_, 1L))
      }
    )
    val breakCount: RDD[((String, String), Long)] = zipRDD.reduceByKey(_ + _)

    // breakCount.take(3).foreach(println)
    // 计算A到B的跳转数
    // 跳转数 / 页面访问数
    val resRDD: RDD[String] = breakCount.map {
    
    
      case (breakPage, count) => {
    
    
        val sum: Long = pageAndCount.getOrElse(breakPage._1.toLong, 1)
        val res: Double = count * 1.0 / sum
        breakPage + ":" + res
      }
    }
    resRDD.foreach(println)

    //breakRDD.take(3).foreach(println)

    sc.stop()