Use of UDF, UDAF, and UDTF in Spark

、 、 UDF

Test data user.json:

{
    
    "id": 1001, "name": "foo", "sex": "man", "age": 20}
{
    
    "id": 1002, "name": "bar", "sex": "man", "age": 24}
{
    
    "id": 1003, "name": "baz", "sex": "man", "age": 18}
{
    
    "id": 1004, "name": "foo1", "sex": "woman", "age": 17}
{
    
    "id": 1005, "name": "bar2", "sex": "woman", "age": 19}
{
    
    "id": 1006, "name": "baz3", "sex": "woman", "age": 20}

1. Register custom operators through anonymous functions

Change user.jsonin woman、mantofemale、male

	//创建sparksession
    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("SparkUDF")
      .enableHiveSupport()      //启用hive
      .getOrCreate()
    
    //sparksession直接读取csv,可设置分隔符delimitor.
    val userDF = spark.read.json("in/user.json")
    val sc: SparkContext = spark.sparkContext
    
    //将DataFrame注册成视图,然后即可使用hql访问
    userDF.createOrReplaceTempView("userDF")

    //通过匿名函数的方式注册自定义算子:将woman和man分别转换成female和male
    spark.udf.register("Sex",(sex:String)=>{
    
    
      var result="unknown"
      if (sex=="woman"){
    
    
        result="female"
      }else if(sex=="man"){
    
    
        result="male"
      }
      result
    })
    spark.sql("select Sex(sex) from userDF").show()

The results of the operation are as follows:

+------------+
|UDF:Sex(sex)|
+------------+
|        male|
|        male|
|        male|
|      female|
|      female|
|      female|
+------------+

2. Register a custom operator by means of a real-name function

    //创建sparksession
    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("SparkUDF")
      .enableHiveSupport()      //启用hive
      .getOrCreate()

    //sparksession直接读取csv,可设置分隔符delimitor.
    val userDF = spark.read.json("in/user.json")
    val sc: SparkContext = spark.sparkContext

    //将DataFrame注册成视图,然后即可使用hql访问
    userDF.createOrReplaceTempView("userDF")
    /*
    通过实名函数的方式注册自定义算子
    Scala中方法和函数是两个不同的概念,方法无法作为参数进行传递,
    也无法赋值给变量,但是函数是可以的。在Scala中,利用下划线可以将方法转换成函数:
    */
    spark.udf.register("sex",Sex _)
    spark.sql("select Sex(sex) as A from userDF").show()
  }

  //将woman和man分别转换成female和male
  def Sex(sex:String): String ={
    
    
    var result="unknown"
    if (sex=="woman"){
    
    
      result="female"
    }else if(sex=="man"){
    
    
      result="male"
    }
    result
  }

The results of the operation are as follows:

+------+
|     A|
+------+
|  male|
|  male|
|  male|
|female|
|female|
|female|
+------+

、 、 UDAF

1. Introduction to UDAF

First explain what is UDAF (User Defined Aggregate Function), that is, a user-defined aggregate function. What is the difference between an aggregate function and an ordinary function? An ordinary function accepts a line of input and produces an output, and an aggregate function accepts a group (usually Multi-line) input and then produce an output, that is, find a way to aggregate a set of values.
That is, input multiple rows of data to produce an output

A misunderstanding about UDAF

We may subconsciously think that UDAF needs to be used with group by. In fact, UDAF can be used with group by or not with group by. This is actually easier to understand. It is associated with functions such as max and min in mysql. ,can:

		select max(foo) from foobar group by bar;

Represents grouping according to the bar field, and then finding the maximum value of each group. At this time, there are many groups. Use this function to process each group. You can also:

		select max(foo) from foobar;

In this case, you can treat the entire table as a group, and then find the maximum value in this group (actually a whole table). Therefore, the aggregate function actually processes the grouping, and does not care about the specific number of records in the grouping.

2. UDAF use

2.1 Inherit UserDefinedAggregateFunction

Use UserDefinedAggregateFunctionroutines:

  1. The custom class inherits UserDefinedAggregateFunction and implements the methods in each stage

  2. Register UDAF in spark and bind a name to it

  3. Then you can use the name bound above to call in the sql statement

Write a UDAF example that calculates the average below

First define a MyAgeAvgFunctionclass inheritance UserDefinedAggregateFunction:

class MyAgeAvgFunction extends UserDefinedAggregateFunction {
    
    
  //聚合函数的输入数据结构
  override def inputSchema: StructType = {
    
    
    new StructType().add("age",LongType)
//    StructType(StructField("age",LongType)::Nil)		//作用同上
  }

  //缓存里面的数据结构
  override def bufferSchema: StructType = {
    
    
    new StructType().add("sum",LongType).add("count",LongType)
//    StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)		//作用同上
  }

  //聚合函数返回值的数据结构
  override def dataType: DataType = {
    
    
    DoubleType
  }

  //聚合函数是否是幂等的,即相同输入是否总是能得到相同输出
  override def deterministic: Boolean = true

  //初始化缓冲区
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    
    
    buffer(0)=0L
    buffer(1)=0L
  }

  //给聚合函数传入一条新数据进行处理
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    
    
    buffer(0)=buffer.getLong(0)+input.getLong(0)
    buffer(1)=buffer.getLong(1)+1
  }

  //合并聚合缓冲区
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    
    
    //总年龄数
    buffer1(0)=buffer1.getLong(0)+buffer2.getLong(0)
    //个数
    buffer1(1)=buffer1.getLong(1)+buffer2.getLong(1)
  }

  //计算最终结果
  override def evaluate(buffer: Row): Any = {
    
    
    buffer.getLong(0).toDouble/buffer.getLong(1)
  }
}

Then register and use it:

    val spark = SparkSession.builder()
      .appName("SparkUDAF")
      .master("local[*]")
      .getOrCreate()
    import spark.implicits._
    val sc: SparkContext = spark.sparkContext
    val df: DataFrame = spark.read.json("in/user.json")

    //创建并注册自定义udaf函数
    val myUdaf = new MyAgeAvgFunction
    spark.udf.register("myAvgAge",myUdaf)

    df.createOrReplaceTempView("userinfo")
    val resultDF: DataFrame = spark
      .sql("select sex,Round(myAvgAge(age),2) as avgage from userinfo group by sex")
    resultDF.show()

Data set user.json:

{
    
    "id": 1001, "name": "foo", "sex": "man", "age": 20}
{
    
    "id": 1002, "name": "bar", "sex": "man", "age": 24}
{
    
    "id": 1003, "name": "baz", "sex": "man", "age": 18}
{
    
    "id": 1004, "name": "foo1", "sex": "woman", "age": 17}
{
    
    "id": 1005, "name": "bar2", "sex": "woman", "age": 19}
{
    
    "id": 1006, "name": "baz3", "sex": "woman", "age": 20}

The results of the operation are as follows:

+-----+------+
|  sex|avgage|
+-----+------+
|  man| 20.67|
|woman| 18.67|
+-----+------+

3. Inherit Aggregator

There is another way to inherit the Aggregator class, which has the advantage of being able to bring types.
The code demonstration of this method is omitted here, please see:
https://www.cnblogs.com/cc11001100/p/9471859.html

Three, UDTF

1. Introduction to UDTF

org.apache.hadoop.hive.ql.udf.generic.GenericUDTFCustomize UDTF operators by implementing abstract classes . UDTF is one line input and multiple lines output

2. UDTF use

2.1, inheritanceGenericUDTF

Create MyUDTFclass inheritanceGenericUDTF

class MyUDTF extends GenericUDTF{
    
    
  override def initialize(argOIs: Array[ObjectInspector]): StructObjectInspector = {
    
    
    if(argOIs.length!=1){
    
    
      throw new UDFArgumentException("有且只能有一个参数传入")
    }
    if (argOIs(0).getCategory!=ObjectInspector.Category.PRIMITIVE){
    
    
      throw new UDFArgumentException("参数类型不匹配")
    }
    val fieldNames = new util.ArrayList[String]
    val fieldOIs = new util.ArrayList[ObjectInspector]()
    fieldNames.add("type")
    //这里定义输出列字段类型
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector)

    ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs)
  }

  //传入 Hadoop Scala kafka hive hbase Oozie
  //输出    HEAD  type  String
  //Hadoop
  //Scala
  //kafka
  //hive
  //hbase
  //Oozie
  
  override def process(objects: Array[AnyRef]): Unit = {
    
    
    //将字符串切分成单个单词
    val strings: Array[String] = objects(0).toString.split(" ")
    
    for (str <- strings){
    
    
      val tmp = new Array[String](1)
      tmp(0)=str
      forward(tmp)
    }
  }

  override def close(): Unit = {
    
     }
}

Then get hivesupport and use it:

    val spark = SparkSession.builder()
      .appName("SparkUDTFDemo")
      .master("local[*]")
      .enableHiveSupport()
      .getOrCreate()
    import spark.implicits._
    val sc: SparkContext = spark.sparkContext
    val lines: RDD[String] = sc.textFile("in/udtf.txt")
    val stuDF: DataFrame = lines.map(_.split("//"))
      .filter(x => x(1).equals("ls"))
      .map(x=> (x(0), x(1), x(2)))
      .toDF("id", "name", "class")

    stuDF.createOrReplaceTempView("student")
    spark.sql("create temporary function MyUDTF as 'shuju.MyUDTF' ")
    val resultDF: DataFrame = spark.sql("select MyUDTF(class) from student")
    resultDF.show()

Data set udtf.txt:

01//zs//Hadoop scala spark hive hbase
02//ls//Hadoop scala kafka hive hbase Oozie
03//ww//Hadoop scala spark hive sqoop

The results of the operation are as follows:

+------+
|  type|
+------+
|Hadoop|
| scala|
| kafka|
|  hive|
| hbase|
| Oozie|
+------+

Guess you like

Origin blog.csdn.net/qq_42578036/article/details/109749253