、 、 UDF
Test data user.json:
{
"id": 1001, "name": "foo", "sex": "man", "age": 20}
{
"id": 1002, "name": "bar", "sex": "man", "age": 24}
{
"id": 1003, "name": "baz", "sex": "man", "age": 18}
{
"id": 1004, "name": "foo1", "sex": "woman", "age": 17}
{
"id": 1005, "name": "bar2", "sex": "woman", "age": 19}
{
"id": 1006, "name": "baz3", "sex": "woman", "age": 20}
1. Register custom operators through anonymous functions
Change user.json
in woman、man
tofemale、male
//创建sparksession
val spark = SparkSession
.builder
.master("local[*]")
.appName("SparkUDF")
.enableHiveSupport() //启用hive
.getOrCreate()
//sparksession直接读取csv,可设置分隔符delimitor.
val userDF = spark.read.json("in/user.json")
val sc: SparkContext = spark.sparkContext
//将DataFrame注册成视图,然后即可使用hql访问
userDF.createOrReplaceTempView("userDF")
//通过匿名函数的方式注册自定义算子:将woman和man分别转换成female和male
spark.udf.register("Sex",(sex:String)=>{
var result="unknown"
if (sex=="woman"){
result="female"
}else if(sex=="man"){
result="male"
}
result
})
spark.sql("select Sex(sex) from userDF").show()
The results of the operation are as follows:
+------------+
|UDF:Sex(sex)|
+------------+
| male|
| male|
| male|
| female|
| female|
| female|
+------------+
2. Register a custom operator by means of a real-name function
//创建sparksession
val spark = SparkSession
.builder
.master("local[*]")
.appName("SparkUDF")
.enableHiveSupport() //启用hive
.getOrCreate()
//sparksession直接读取csv,可设置分隔符delimitor.
val userDF = spark.read.json("in/user.json")
val sc: SparkContext = spark.sparkContext
//将DataFrame注册成视图,然后即可使用hql访问
userDF.createOrReplaceTempView("userDF")
/*
通过实名函数的方式注册自定义算子
Scala中方法和函数是两个不同的概念,方法无法作为参数进行传递,
也无法赋值给变量,但是函数是可以的。在Scala中,利用下划线可以将方法转换成函数:
*/
spark.udf.register("sex",Sex _)
spark.sql("select Sex(sex) as A from userDF").show()
}
//将woman和man分别转换成female和male
def Sex(sex:String): String ={
var result="unknown"
if (sex=="woman"){
result="female"
}else if(sex=="man"){
result="male"
}
result
}
The results of the operation are as follows:
+------+
| A|
+------+
| male|
| male|
| male|
|female|
|female|
|female|
+------+
、 、 UDAF
1. Introduction to UDAF
First explain what is UDAF (User Defined Aggregate Function), that is, a user-defined aggregate function. What is the difference between an aggregate function and an ordinary function? An ordinary function accepts a line of input and produces an output, and an aggregate function accepts a group (usually Multi-line) input and then produce an output, that is, find a way to aggregate a set of values.
That is, input multiple rows of data to produce an output
A misunderstanding about UDAF
We may subconsciously think that UDAF needs to be used with group by. In fact, UDAF can be used with group by or not with group by. This is actually easier to understand. It is associated with functions such as max and min in mysql. ,can:
select max(foo) from foobar group by bar;
Represents grouping according to the bar field, and then finding the maximum value of each group. At this time, there are many groups. Use this function to process each group. You can also:
select max(foo) from foobar;
In this case, you can treat the entire table as a group, and then find the maximum value in this group (actually a whole table). Therefore, the aggregate function actually processes the grouping, and does not care about the specific number of records in the grouping.
2. UDAF use
2.1 Inherit UserDefinedAggregateFunction
Use UserDefinedAggregateFunction
routines:
-
The custom class inherits UserDefinedAggregateFunction and implements the methods in each stage
-
Register UDAF in spark and bind a name to it
-
Then you can use the name bound above to call in the sql statement
Write a UDAF example that calculates the average below
First define a MyAgeAvgFunction
class inheritance UserDefinedAggregateFunction
:
class MyAgeAvgFunction extends UserDefinedAggregateFunction {
//聚合函数的输入数据结构
override def inputSchema: StructType = {
new StructType().add("age",LongType)
// StructType(StructField("age",LongType)::Nil) //作用同上
}
//缓存里面的数据结构
override def bufferSchema: StructType = {
new StructType().add("sum",LongType).add("count",LongType)
// StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil) //作用同上
}
//聚合函数返回值的数据结构
override def dataType: DataType = {
DoubleType
}
//聚合函数是否是幂等的,即相同输入是否总是能得到相同输出
override def deterministic: Boolean = true
//初始化缓冲区
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0)=0L
buffer(1)=0L
}
//给聚合函数传入一条新数据进行处理
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0)=buffer.getLong(0)+input.getLong(0)
buffer(1)=buffer.getLong(1)+1
}
//合并聚合缓冲区
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
//总年龄数
buffer1(0)=buffer1.getLong(0)+buffer2.getLong(0)
//个数
buffer1(1)=buffer1.getLong(1)+buffer2.getLong(1)
}
//计算最终结果
override def evaluate(buffer: Row): Any = {
buffer.getLong(0).toDouble/buffer.getLong(1)
}
}
Then register and use it:
val spark = SparkSession.builder()
.appName("SparkUDAF")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val df: DataFrame = spark.read.json("in/user.json")
//创建并注册自定义udaf函数
val myUdaf = new MyAgeAvgFunction
spark.udf.register("myAvgAge",myUdaf)
df.createOrReplaceTempView("userinfo")
val resultDF: DataFrame = spark
.sql("select sex,Round(myAvgAge(age),2) as avgage from userinfo group by sex")
resultDF.show()
Data set user.json
:
{
"id": 1001, "name": "foo", "sex": "man", "age": 20}
{
"id": 1002, "name": "bar", "sex": "man", "age": 24}
{
"id": 1003, "name": "baz", "sex": "man", "age": 18}
{
"id": 1004, "name": "foo1", "sex": "woman", "age": 17}
{
"id": 1005, "name": "bar2", "sex": "woman", "age": 19}
{
"id": 1006, "name": "baz3", "sex": "woman", "age": 20}
The results of the operation are as follows:
+-----+------+
| sex|avgage|
+-----+------+
| man| 20.67|
|woman| 18.67|
+-----+------+
3. Inherit Aggregator
There is another way to inherit the Aggregator class, which has the advantage of being able to bring types.
The code demonstration of this method is omitted here, please see:
https://www.cnblogs.com/cc11001100/p/9471859.html
Three, UDTF
1. Introduction to UDTF
org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
Customize UDTF operators by implementing abstract classes . UDTF is one line input and multiple lines output
2. UDTF use
2.1, inheritanceGenericUDTF
Create MyUDTF
class inheritanceGenericUDTF
class MyUDTF extends GenericUDTF{
override def initialize(argOIs: Array[ObjectInspector]): StructObjectInspector = {
if(argOIs.length!=1){
throw new UDFArgumentException("有且只能有一个参数传入")
}
if (argOIs(0).getCategory!=ObjectInspector.Category.PRIMITIVE){
throw new UDFArgumentException("参数类型不匹配")
}
val fieldNames = new util.ArrayList[String]
val fieldOIs = new util.ArrayList[ObjectInspector]()
fieldNames.add("type")
//这里定义输出列字段类型
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector)
ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs)
}
//传入 Hadoop Scala kafka hive hbase Oozie
//输出 HEAD type String
//Hadoop
//Scala
//kafka
//hive
//hbase
//Oozie
override def process(objects: Array[AnyRef]): Unit = {
//将字符串切分成单个单词
val strings: Array[String] = objects(0).toString.split(" ")
for (str <- strings){
val tmp = new Array[String](1)
tmp(0)=str
forward(tmp)
}
}
override def close(): Unit = {
}
}
Then get hive
support and use it:
val spark = SparkSession.builder()
.appName("SparkUDTFDemo")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val lines: RDD[String] = sc.textFile("in/udtf.txt")
val stuDF: DataFrame = lines.map(_.split("//"))
.filter(x => x(1).equals("ls"))
.map(x=> (x(0), x(1), x(2)))
.toDF("id", "name", "class")
stuDF.createOrReplaceTempView("student")
spark.sql("create temporary function MyUDTF as 'shuju.MyUDTF' ")
val resultDF: DataFrame = spark.sql("select MyUDTF(class) from student")
resultDF.show()
Data set udtf.txt
:
01//zs//Hadoop scala spark hive hbase
02//ls//Hadoop scala kafka hive hbase Oozie
03//ww//Hadoop scala spark hive sqoop
The results of the operation are as follows:
+------+
| type|
+------+
|Hadoop|
| scala|
| kafka|
| hive|
| hbase|
| Oozie|
+------+