Spark SQL 解析-UDF,UDAF,开窗函数

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/Luomingkui1109/article/details/86183122

1.UDF函数(用户自定义函数

    • 注册一个UDF函数   spark.udf.register("addname",(x:String) => "name:"+ x)

    • spark.sql("select addname(name) as name  from people").show

详细案例:

scala>  val df = spark.read.json("source/employees.json")
df: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]

scala> df.show()
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
+-------+------+

scala>  spark.udf.register("addName", (x:String)=> "Name:"+x)
res38: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala>  df.createOrReplaceTempView("people")
scala>  spark.sql("Select addName(name), salary from people").show()

+-----------------+------+
|UDF:addName(name)|salary|
+-----------------+------+
|     Name:Michael|  3000|
|        Name:Andy|  4500|
|      Name:Justin|  3500|
|       Name:Berta|  4000|
+-----------------+------+

2.UDAF函数(聚合函数)

    强类型的Dataset和弱类型的DataFrame都提供了相关的聚合函数, 如 count(),countDistinct(),avg(),max(),min()。除此之外,用户可以设定自己的自定义聚合函数。

    示例如下:

package com.luomk.sql
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._

/**
  * 求员工的平均工资
  */
class MyAverage extends UserDefinedAggregateFunction{
  //聚合函数输入的数据类型
  override def inputSchema: StructType = StructType(StructField("salary",LongType)::Nil)
  //小范围聚合临时变量的类型
  override def bufferSchema: StructType = StructType(StructField("sum",LongType)::StructField("count",LongType)::Nil)
  //返回值得类型
  override def dataType: DataType = DoubleType
  //幂等性
  override def deterministic: Boolean = true
  //初始化你的数据结构
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }

  //每一个分区去更新数据结构
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    buffer(0) = buffer.getLong(0) + input.getLong(0)
    buffer(1) = buffer.getLong(1) + 1
  }

  //将所有分区的数据合并
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)

  }

  //计算值
  override def evaluate(buffer: Row): Double = {
    buffer.getLong(0).toDouble / buffer.getLong(1)
  }
}


object Test{
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("udaf").setMaster("local[*]")
    val spark = SparkSession.builder().config(conf).getOrCreate()
    //val ab=spark.implicits
    //import.ab._
    import spark.implicits._
    spark.udf.register("average", new MyAverage())
    val df = spark.read.json("/Users/g2/workspace/myprive/spark/sparkDemo/sparkCore/sparkcore_sql/src/main/resources/employees.json")
    df.createOrReplaceTempView("employee")
    spark.sql("select average(salary) as avg from employee").show()
    spark.stop()
  }
}

3、开窗函数

//***************  原始的班级表  ****************//
+----+-----+-----+
|name|class|score|
+----+-----+-----+
|   a|    1|   80|
|   b|    1|   78|
|   c|    1|   95|
|   d|    2|   74|
|   e|    2|   92|
|   f|    3|   99|
|   g|    3|   99|
|   h|    3|   45|
|   i|    3|   55|
|   j|    3|   78|
+----+-----+——+


//将原始数据通过rdd的形式加载进来,然后转换为dataset,通过createOrReplaceTempView注册为临时表,接下来的操作如下:

//***************  求每个班最高成绩学生的信息  ***************/

/*******  开窗函数的表  ********/
spark.sql("select name,class,score, rank() over(partition by class order by score desc) rank from score").show()

+----+-----+-----+----+
|name|class|score|rank|
+----+-----+-----+----+
|   c|    1|   95|   1|
|   a|    1|   80|   2|
|   b|    1|   78|   3|
|   f|    3|   99|   1|
|   g|    3|   99|   1|
|   j|    3|   78|   3|
|   i|    3|   55|   4|
|   h|    3|   45|   5|
|   e|    2|   92|   1|
|   d|    2|   74|   2|
+----+-----+-----+----+

    /*******  计算结果的表  *******
spark.sql("select * from " +
  "( select name,class,score,rank() over(partition by class order by score desc) rank from score) " +
  "ast " +
  "where t.rank=1").show()

+----+-----+-----+----+
|name|class|score|rank|
+----+-----+-----+----+
|   c|    1|   95|   1|
|   f|    3|   99|   1|
|   g|    3|   99|   1|
|   e|    2|   92|   1|
+----+-----+-----+——+

/**************  求每个班最高成绩学生的信息(groupBY)  ***************/
spark.sql("select class, max(score) max from score group by class").show()

+-----+---+
|class|max|
+-----+---+
|    1| 95|
|    3| 99|
|    2| 92|
+-----+---+

spark.sql("select a.name, b.class, b.max from score a, " +
  "(select class, max(score) max from score group by class) as b " +
  "where a.score = b.max").show()

+----+-----+---+
|name|class|max|
+----+-----+---+
|   e|    2| 92|

猜你喜欢

转载自blog.csdn.net/Luomingkui1109/article/details/86183122