Detailed explanation of spark window function (Window)

Project github address: bitcarmanlee easy-algorithm-interview-and-practice
often have students private messages or leave messages to ask related questions, V number bitcarmanlee. The classmates of star on github, within the scope of my ability and time, I will try my best to help you answer related questions and make progress together.

1. Why do you need a window function

Before 1.4, Spark SQL supported two types of functions to calculate a single return value. The first is a built-in function or UDF function, they take the value in a single row as input, and they generate a single return value for each input row. The other is aggregate functions, typically SUM, MAX, AVG, which operate on a group of row data and calculate a return value for each group.

The two functions mentioned above are widely used in practice, but there are still a large number of operations that cannot be expressed by these types of functions alone. One of the most common scenarios is that many times you need to operate on a set of rows and still return a value for each input row. The above two methods are useless. For example, it is very difficult to calculate the moving average, calculate the cumulative sum, or access the value of the row that appears before the current row. Fortunately, after 1.4, Spark SQL provides window functions to make up for the above shortcomings.

The core of the window function is "Frame", or we directly call it a frame, a frame is a series of multi-line data, or many groups. Then we can basically group these to meet the functions that the above ordinary functions cannot complete. In order to see the specific application clearly, we directly look at the example. Talk is cheap, Show me the code.

2. Construct the data set

To facilitate testing, we first construct a data set

import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._

  def test() = {
    val sparkConf = new SparkConf().setMaster("local[2]")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    val data = Array(("lili", "ml", 90),
      ("lucy", "ml", 85),
      ("cherry", "ml", 80),
      ("terry", "ml", 85),
      ("tracy", "cs", 82),
      ("tony", "cs", 86),
      ("tom", "cs", 75))

    val schemas = Seq("name", "subject", "score")
    val df = spark.createDataFrame(data).toDF(schemas: _*)

    df.show()
 }

After running the above test method locally, the output is as follows

+------+-------+-----+
|  name|subject|score|
+------+-------+-----+
|  lili|     ml|   90|
|  lucy|     ml|   85|
|cherry|     ml|   80|
| terry|     ml|   85|
| tracy|     cs|   82|
|  tony|     cs|   86|
|   tom|     cs|   75|
+------+-------+-----+

Data structure completed

3. View rankings in groups

A frequently used scenario is: you need to check the ranking of each major. This is a typical grouping problem, that is, when the window function shows its talents.

A window needs to define three parts:

1. Grouping problem, how to group rows? When selecting window data, it only takes effect on the data in the group.
2. Sorting, in which way to sort? When selecting window data, it will first be sorted according to the specified method
. 3. Frame selection, based on the current behavior, how to select surrounding rows?

Compared with the above three parts, the syntax of the window function is generally:

window_func(args) OVER ( [PARTITION BY col_name, col_name, ...] [ORDER BY col_name, col_name, ...] [ROWS | RANGE BETWEEN (CURRENT ROW | (UNBOUNDED |[num]) PRECEDING) AND (CURRENT ROW | ( UNBOUNDED | [num]) FOLLOWING)] )

Among them,
window_func is the window function
over, which means that this is a window function
partition by corresponds to grouping, that is, according to which column grouping,
order by corresponds to sorting, and which column sorting
rows corresponds to the corresponding frame selection.

The window_func in spark includes the following three categories:
1. The ranking function (ranking function) includes rank, dense_rank, row_number, percent_rank, ntile, etc. We will look at it with examples later.
2. Analytic functions include cume_dist, lag, etc.
3. Aggregate functions (aggregate functions) are our commonly used max, min, sum, avg, etc.

Go back to the requirements above and check the ranking of each major

  def test() = {
    val sparkConf = new SparkConf().setMaster("local[2]")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    val sqlContext = spark.sqlContext


    val data = Array(("lili", "ml", 90),
      ("lucy", "ml", 85),
      ("cherry", "ml", 80),
      ("terry", "ml", 85),
      ("tracy", "cs", 82),
      ("tony", "cs", 86),
      ("tom", "cs", 75))

    val schemas = Seq("name", "subject", "score")
    val df = spark.createDataFrame(data).toDF(schemas: _*)
    df.createOrReplaceTempView("person_subject_score")

    val sqltext = "select name, subject, score, rank() over (partition by subject order by score desc) as rank from person_subject_score";
    val ret = sqlContext.sql(sqltext)
    ret.show()
  }

The above code runs, and the results are as follows

+------+-------+-----+----+
|  name|subject|score|rank|
+------+-------+-----+----+
|  tony|     cs|   86|   1|
| tracy|     cs|   82|   2|
|   tom|     cs|   75|   3|
|  lili|     ml|   90|   1|
|  lucy|     ml|   85|   2|
| terry|     ml|   85|   2|
|cherry|     ml|   80|   4|
+------+-------+-----+----+

Focus on the window part:

rank() over (partition by subject order by score desc) as rank

The rank() function means to take the rank of each row in the group, partition by subject means to group by subject, and order by score desc means to sort by score and in reverse order, so that you can get the ranking of each student in this major!

Row_number and dense_rank are also window functions related to sorting. Let's see the difference between them through examples:

  def test() = {
    val sparkConf = new SparkConf().setMaster("local[2]")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    val sqlContext = spark.sqlContext


    val data = Array(("lili", "ml", 90),
      ("lucy", "ml", 85),
      ("cherry", "ml", 80),
      ("terry", "ml", 85),
      ("tracy", "cs", 82),
      ("tony", "cs", 86),
      ("tom", "cs", 75))

    val schemas = Seq("name", "subject", "score")
    val df = spark.createDataFrame(data).toDF(schemas: _*)
    df.createOrReplaceTempView("person_subject_score")

    val sqltext = "select name, subject, score, rank() over (partition by subject order by score desc) as rank from person_subject_score";
    val ret = sqlContext.sql(sqltext)
    ret.show()

    val sqltext2 = "select name, subject, score, row_number() over (partition by subject order by score desc) as row_number from person_subject_score";
    val ret2 = sqlContext.sql(sqltext2)
    ret2.show()

    val sqltext3 = "select name, subject, score, dense_rank() over (partition by subject order by score desc) as dense_rank from person_subject_score";
    val ret3 = sqlContext.sql(sqltext3)
    ret3.show()
  }
+------+-------+-----+----+
|  name|subject|score|rank|
+------+-------+-----+----+
|  tony|     cs|   86|   1|
| tracy|     cs|   82|   2|
|   tom|     cs|   75|   3|
|  lili|     ml|   90|   1|
|  lucy|     ml|   85|   2|
| terry|     ml|   85|   2|
|cherry|     ml|   80|   4|
+------+-------+-----+----+

+------+-------+-----+----------+
|  name|subject|score|row_number|
+------+-------+-----+----------+
|  tony|     cs|   86|         1|
| tracy|     cs|   82|         2|
|   tom|     cs|   75|         3|
|  lili|     ml|   90|         1|
|  lucy|     ml|   85|         2|
| terry|     ml|   85|         3|
|cherry|     ml|   80|         4|
+------+-------+-----+----------+

+------+-------+-----+----------+
|  name|subject|score|dense_rank|
+------+-------+-----+----------+
|  tony|     cs|   86|         1|
| tracy|     cs|   82|         2|
|   tom|     cs|   75|         3|
|  lili|     ml|   90|         1|
|  lucy|     ml|   85|         2|
| terry|     ml|   85|         2|
|cherry|     ml|   80|         3|
+------+-------+-----+----------+

Through the above example, it is not difficult to see the difference between the three:
rank generates discontinuous sequence numbers. The above example is 1,2,2,4. This
dense_rank generates continuous sequence numbers. The above example is 1,2,2,3. This kind of
row_number, as the name suggests, generates a row number. The above example is 1,2,3,4.
Don't worry about the definition of the function, just look at the above example to understand!

4. View quantiles

Let's look at another example. We want to check the quantile of a certain person in the profession, what should we do? At
this time, the cume_dist function can be used.
The calculation method of this function is: the number of rows in the group that is less than or equal to the current row value/the total number of rows in the group

Still look at the code

    val sqltext5 = "select name, subject, score, cume_dist() over (partition by subject order by score desc) as cumedist from person_subject_score";
    val ret5 = sqlContext.sql(sqltext5)
    ret5.show()

Combining the previous data initialization code with the above sql logic, the final result is as follows:

+------+-------+-----+------------------+
|  name|subject|score|          cumedist|
+------+-------+-----+------------------+
|  tony|     cs|   86|0.3333333333333333|
| tracy|     cs|   82|0.6666666666666666|
|   tom|     cs|   75|               1.0|
|  lili|     ml|   90|              0.25|
|  lucy|     ml|   85|              0.75|
| terry|     ml|   85|              0.75|
|cherry|     ml|   80|               1.0|
+------+-------+-----+------------------+

It can be seen that the above requirements are perfectly met.

5. Use DataFrame API to complete window query

The above example uses the SqlContext API. In the DataFrame, there is also a corresponding API to complete the query. The specific method is also very simple. You can use the DataFrame API to call the over() method in the supported functions, such as rank().over( …)

Take the previous requirement as an example. If we want to view the ranking of students in majors, use the DataFrame API as follows:

  def test() = {
    val sparkConf = new SparkConf().setMaster("local[2]")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    val data = Array(("lili", "ml", 90),
      ("lucy", "ml", 85),
      ("cherry", "ml", 80),
      ("terry", "ml", 85),
      ("tracy", "cs", 82),
      ("tony", "cs", 86),
      ("tom", "cs", 75))

    val schemas = Seq("name", "subject", "score")
    val df = spark.createDataFrame(data).toDF(schemas: _*)
    df.createOrReplaceTempView("person_subject_score")

    val window = Window.partitionBy("subject").orderBy(col("score").desc)
    val df2 = df.withColumn("rank", rank().over(window))
    df2.show()
  }

The output is as follows:

+------+-------+-----+----+
|  name|subject|score|rank|
+------+-------+-----+----+
|  tony|     cs|   86|   1|
| tracy|     cs|   82|   2|
|   tom|     cs|   75|   3|
|  lili|     ml|   90|   1|
|  lucy|     ml|   85|   2|
| terry|     ml|   85|   2|
|cherry|     ml|   80|   4|
+------+-------+-----+----+

Guess you like

Origin blog.csdn.net/bitcarmanlee/article/details/113617901