版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/Luomingkui1109/article/details/86183122
1.UDF函数(用户自定义函数)
• 注册一个UDF函数 spark.udf.register("addname",(x:String) => "name:"+ x)
• spark.sql("select addname(name) as name from people").show
详细案例:
scala> val df = spark.read.json("source/employees.json")
df: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]
scala> df.show()
+-------+------+
| name|salary|
+-------+------+
|Michael| 3000|
| Andy| 4500|
| Justin| 3500|
| Berta| 4000|
+-------+------+
scala> spark.udf.register("addName", (x:String)=> "Name:"+x)
res38: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.createOrReplaceTempView("people")
scala> spark.sql("Select addName(name), salary from people").show()
+-----------------+------+
|UDF:addName(name)|salary|
+-----------------+------+
| Name:Michael| 3000|
| Name:Andy| 4500|
| Name:Justin| 3500|
| Name:Berta| 4000|
+-----------------+------+
2.UDAF函数(聚合函数)
强类型的Dataset和弱类型的DataFrame都提供了相关的聚合函数, 如 count(),countDistinct(),avg(),max(),min()。除此之外,用户可以设定自己的自定义聚合函数。
示例如下:
package com.luomk.sql
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
/**
* 求员工的平均工资
*/
class MyAverage extends UserDefinedAggregateFunction{
//聚合函数输入的数据类型
override def inputSchema: StructType = StructType(StructField("salary",LongType)::Nil)
//小范围聚合临时变量的类型
override def bufferSchema: StructType = StructType(StructField("sum",LongType)::StructField("count",LongType)::Nil)
//返回值得类型
override def dataType: DataType = DoubleType
//幂等性
override def deterministic: Boolean = true
//初始化你的数据结构
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
//每一个分区去更新数据结构
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
//将所有分区的数据合并
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
//计算值
override def evaluate(buffer: Row): Double = {
buffer.getLong(0).toDouble / buffer.getLong(1)
}
}
object Test{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("udaf").setMaster("local[*]")
val spark = SparkSession.builder().config(conf).getOrCreate()
//val ab=spark.implicits
//import.ab._
import spark.implicits._
spark.udf.register("average", new MyAverage())
val df = spark.read.json("/Users/g2/workspace/myprive/spark/sparkDemo/sparkCore/sparkcore_sql/src/main/resources/employees.json")
df.createOrReplaceTempView("employee")
spark.sql("select average(salary) as avg from employee").show()
spark.stop()
}
}
3、开窗函数
//*************** 原始的班级表 ****************//
+----+-----+-----+
|name|class|score|
+----+-----+-----+
| a| 1| 80|
| b| 1| 78|
| c| 1| 95|
| d| 2| 74|
| e| 2| 92|
| f| 3| 99|
| g| 3| 99|
| h| 3| 45|
| i| 3| 55|
| j| 3| 78|
+----+-----+——+
//将原始数据通过rdd的形式加载进来,然后转换为dataset,通过createOrReplaceTempView注册为临时表,接下来的操作如下:
//*************** 求每个班最高成绩学生的信息 ***************/
/******* 开窗函数的表 ********/
spark.sql("select name,class,score, rank() over(partition by class order by score desc) rank from score").show()
+----+-----+-----+----+
|name|class|score|rank|
+----+-----+-----+----+
| c| 1| 95| 1|
| a| 1| 80| 2|
| b| 1| 78| 3|
| f| 3| 99| 1|
| g| 3| 99| 1|
| j| 3| 78| 3|
| i| 3| 55| 4|
| h| 3| 45| 5|
| e| 2| 92| 1|
| d| 2| 74| 2|
+----+-----+-----+----+
/******* 计算结果的表 *******
spark.sql("select * from " +
"( select name,class,score,rank() over(partition by class order by score desc) rank from score) " +
"ast " +
"where t.rank=1").show()
+----+-----+-----+----+
|name|class|score|rank|
+----+-----+-----+----+
| c| 1| 95| 1|
| f| 3| 99| 1|
| g| 3| 99| 1|
| e| 2| 92| 1|
+----+-----+-----+——+
/************** 求每个班最高成绩学生的信息(groupBY) ***************/
spark.sql("select class, max(score) max from score group by class").show()
+-----+---+
|class|max|
+-----+---+
| 1| 95|
| 3| 99|
| 2| 92|
+-----+---+
spark.sql("select a.name, b.class, b.max from score a, " +
"(select class, max(score) max from score group by class) as b " +
"where a.score = b.max").show()
+----+-----+---+
|name|class|max|
+----+-----+---+
| e| 2| 92|