SparkSql (two) window function

SparkSql the window function

Refers to a so-called window function is a multi-line data processing and return to the normal polymerization process columns columns;

Detailed syntax:窗口函数() over([partition by 分区 order by 排序规则 ...])

Window function has three categories:

  • Polymerizable window function
  • Rank window function
  • Data analysis window function
    Here Insert Picture Description

count (...) over (partition by ... order by ...) - the total number of the request packet. SUM (...)
over (Order by Partition by ... ...) - and the request packet. max (...)
over (Order by Partition by ... ...) - maximum value of the request packet. min (...)
over (Order by Partition by ... ...) - after the minimum request packet. AVG (...)
over (Order by Partition by ... ...) - average value of the request packet. Rank ()
over (Order by Partition by ... ...) --rank value may be discontinuous. DENSE_RANK ()
over (Order by Partition by ... ...) is continuous --rank value. FIRST_VALUE (...)
over (Order by Partition by ... ...) - find a first value within the packet. LAST_VALUE (...)
over (Order by Partition by ... ...) - find the last value within the packet. LAG ()
over (Order by Partition by ... ...) - Remove the first n rows of data. Lead ()
over (Order by Partition by ... ...) - n-line data is extracted. RATIO_TO_REPORT ()
over (Partition by ... by ... the Order) --Ratio_to_report ()
Molecule is in brackets, over () is the denominator in parentheses. PERCENT_RANK () over (Partition by ... the Order by
...) -

Case vignettes of each window function application

  • rank () request page visits one day before the 10 page per user
//测试数据
 val rdd = spark.sparkContext.makeRDD(
  List(
      ("2018-01-01", 1, "www.baidu.com", "10:01"),
      ("2018-01-01", 2, "www.baidu.com", "10:01"),
      ("2018-01-01", 1, "www.sina.com", "10:01"),
      ("2018-01-01", 3, "www.baidu.com", "10:01"),
      ("2018-01-01", 3, "www.baidu.com", "10:01"),
      ("2018-01-01", 1, "www.sina.com", "10:01")
    ))

Outline of Solution
1. The number of times each user to access different pages
2. descending order for each page of the user clicks, and the window function used in the ranking function
3. The number of times obtained for each user to access the first ten pages where rank <n

Method One: calling the method to achieve

 import sp.implicits._
    //导入窗口函数的支持
    import org.apache.spark.sql.functions._
    rdd
      .toDF("time","uid","path","ztime")
      .groupBy("uid","path")  //根据用户 和访问的网址 分组
      .count()   //分组之后的聚合操作  通知每个用户访问每个网址的次数
      //添加列 列名  rank   窗口函数rank():rank值可能是不连续的   over 创建的窗口
      .withColumn("rank",rank() over(Window.partitionBy("uid").orderBy($"count" desc)))
      .where("rank<=10")  //访问页面次数前十   参数为条件表达式
      .show()
      
	+---+--------------+-----+----+
	|uid|          path|count|rank|
	+---+--------------+-----+----+
	|  1|www.hao123.com|    2|   1|
	|  1| www.baidu.com|    1|   2|  rank()(rank值可能是不连续的) 并列第二了   1用户再访问别的网址 rank值为4  1 2 2 4
	|  1|  www.sina.com|    1|   2|  dense_rank(rank值一定是连续的)  1用户再访问别的网址 rank值为3          1 2 2 3
	|  3| www.baidu.com|    1|   1|
	|  3|  www.sina.com|    1|   1|
	|  2| www.baidu.com|    1|   1|
	+---+--------------+-----+----+

Method two: pure sql achieve

 //导入隐式转换 作用:将rdd转换成df 或 ds
    import sp.implicits._
    rdd
      .toDF("time","uid","path","ztime")
    //创建临时视图(表):只能在创建它的spark session中使用
     .createOrReplaceTempView("t_path")
     
spark
    .sql(
       """
         |select
         |   *
         |from
         |   (
         |     select
         |       uid,
         |       path,
         |       path_count,
         |       rank()
         |       over(partition by uid order by path_count desc) as rank
         |     from
         |         (
         |           select
         |             uid,
         |             path,
         |             count(path) as path_count
         |           from
         |             t_path
         |           group by
         |             uid,path
         |			)
         |     )
		 |where
		 | 	rank < 10
         |""".stripMargin)
     .show()	
  • AggregateOnWindow get basic current average wage sector where the user information
//测试数据
val rdd = sp.sparkContext.makeRDD(
      List(
        (1,"zs",true,1,15000),
        (2,"ls",false,2,18000),
        (3,"ww",false,2,14000),
        (4,"zl",false,1,18000),
        (5,"win7",false,1,16000)
      ))

Method One: calling the method to achieve

  //导入隐式转换  作用:将RDD转换成DF
    import sp.implicits._
    //导入窗口函数的支持
    import org.apache.spark.sql.functions._
    rdd
      .toDF("id","name","sex","dept","salary")
      .withColumn("avg_Salary",avg("salary") over(Window.partitionBy("dept")
      .orderBy($"salary" desc)
      //窗口内的可视范围  lang的最大值和最小值
      .rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)))
      .show()

Method Two :( achieve pure sql)

spark
.sql(
  """
          | select
          |   id,
          |   name,
          |   sex,
          |   dept,
          |   salary,
          |   avg(salary) over(partition by dept rows between unbounded preceding and unbounded following) as avg_salary
          | from
          |   t_user
          |""".stripMargin)
.show()

Window rowsBetween usage:

案例
A表里面有三笔记录 字段是 ID start_date end_date
数据是:

1 2018-02-03 2019-02-03;
2 2019-02-04 2020-03-04;
3 2018-08-04 2019-03-04;

根据已知的三条记录用sql写出结果为:

A 2018-02-03 2018-08-04;
B 2018-08-04 2019-02-03; 
C 2019-02-03 2019-02-04;
D 2019-02-04 2019-03-04;
E 2019-03-04 2020-03-04;

(提示:请把问题看做是一个判断区间内有零散时间点的重新建立时间区间连续以及断点的问题)

Problem-solving ideas:

  1. Dismantling time data
  2. Date ascending
  3. Window function (successive time intervals)
package method

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window

object SparkSQLWordCountOnWindowFunction2 {
  def main(args: Array[String]): Unit = {
    //1. 构建Spark SQL中核心对象SparkSession
    val spark = SparkSession.builder().appName("wordcount").master("local[*]").getOrCreate()
    import spark.implicits._

    val rdd = spark
    .sparkContext
    .makeRDD(List(
      (1, "2018-02-03", "2019-02-03"),
      (2, "2019-02-04", "2020-03-04"),
      (3, "2018-08-04", "2019-03-04")
    ))
    val df = rdd
    .flatMap(t3 => {
      Array[String](t3._2, t3._3)
    })
    .toDF("value")

    import org.apache.spark.sql.functions._

    val w1 = Window.orderBy($"value" asc).rowsBetween(0,1)
    df
    .withColumn("next", max("value") over (w1))
    .show()

    spark.stop()
  }
}

Valid data range within the window
value = 0 in the current row
** value = n at the current line n lines **
value on the current line = -n n lines

to sum up:

  1. Spark SQL syntax is similar to pure SQL syntax database DB
  2. GlobalTempViewlGlobal view and TempViewdifferences
    • Global view (global_temp stored in the database, it can spark session across multiple sessions)
    • Temporary views (stored in the default database, it can only create the spark session session uses)
Published 24 original articles · won praise 1 · views 492

Guess you like

Origin blog.csdn.net/Mr_YXX/article/details/105061974