Spark SQL custom function_Chapter 5

1. Classification of custom functions
Similar to the custom functions in hive, spark can also use custom functions to implement new functions.
The custom functions in spark are as follows:
1. UDF (User-Defined-Function)
input one line, output one line
2.UDAF (User-Defined Aggregation Funcation)
input multiple lines, output one line
3.UDTF (User-Defined Table- Generating Functions)
input one line, output multiple lines
2. Custom UDF
Requirements The
data format of udf.txt is as follows:

Hello
abc
study
small

Use custom UDF function to convert each row of data to uppercase
select value, smallToBig (value) from t_word
code demo:

def main (args: Array [String]): Unit = {
// 1, create sparksession
val spark: SparkSession = SparkSession.builder (). master (“local [*]”). appName (“demo01”). getOrCreate ( )
// 2, create sparkcontext
val sc: SparkContext = spark.sparkContext
// 3, read data. And operate
val ttRDD: RDD [String] = sc.textFile (“file: /// F: \ Chuanzhi Podcast \ Chuanzhi Professional College \ Second Semester \ 34 \ 05-Spark \ Data \ udf.txt”)
import spark.implicits._
val UDFDS: Dataset [String] = ttRDD.toDS ()
// Custom function
spark.udf.register (“toUpperAdd123”, (str: String) => {
// Process data according to business needs
str.toUpperCase + "123"
})
UDFDS.createOrReplaceTempView ("UDF")
// Call function
spark.sql ("SELECT value, toUpperAdd123 (value) as length_10 FROM UDF"). show ()
sc.stop ()
spark.stop()
}

3. Custom UDAF [understand]
requirements: the
data content of udaf.json is as follows

{“name”:“Michael”,“salary”:3000}
{“name”:“Andy”,“salary”:4500}
{“name”:“Justin”,“salary”:3500}
{“name”:“Berta”,“salary”:4000}

To obtain the average salary
● Inherit the UserDefinedAggregateFunction method to rewrite the description
inputSchema: the type of input data
bufferSchema: the data type
that produces the intermediate result dataType: the final returned result type
deterministic: to ensure consistency, generally true
initialize: specify the initial value
update: every A piece of data participates in the operation to update the intermediate results (update is equivalent to the operation in each partition)
merge: global aggregation (aggregate the results of each partition)
evaluate: calculate the final result
● code demonstration:

object Udaf {
// Write a method to calculate average salary SalaryAvg
class SalaryAvg extends UserDefinedAggregateFunction {
// Type of input data
override def inputSchema: StructType = {
StructType (StructField (“input”, LongType) :: Nil)
}
// Intermediate result Type of cache
override def bufferSchema: StructType = {
// sum total amount of cache
// total number of total caches
StructType (List (StructField (“sum”, LongType), (StructField (“total”, LongType))))
}
// Type of data returned
override def dataType: DataType = {
DoubleType
}
// Is there the same output true
override def deterministic: Boolean = {
true
}
/ *
List (1,2,3,4,5) .reduce ((a, b) => a + b)
1 a = 1 b = 2
2 a = 3 b = 3
A = B =. 6. 3. 4
. 4 A = B = 10. 5
. 5 = 51 is A
/
// initialization data
the override the initialize DEF (Buffer: MutableAggregationBuffer): Unit = {
// for storing the total amount of
buffer (0) = 0L / / => a
// For storage times
buffer (1) = 0L // => b
}
// rdd is a partition. This method is to calculate the data and the number of data in a partition
/

{“name”: “Michael” , "Salary": 3000}
{"name": "Andy", "salary": 4500}
{"name": "Justin", "salary": 3500}
{"name": "Berta", "salary": 4000}
/
override def update (buffer: MutableAggregationBuffer, input: Row): Unit = {
// Calculate the total amount of the sub-partition
buffer (0) = buffer.getLong (0) + input.getLong (0)
// Calculate the sub-partition The total number of
buffer (1) = buffer.getLong (1) +1
}
// Summary the total amount and total times in all partitions
override def merge (buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
// Total amount of all partitions
buffer1 (0) = buffer1.getLong (0) + buffer2.getLong (0)
// Total number of times of all partitions
buffer1 (1) = buffer1.getLong (1) + buffer2.getLong (1)
}
// Find the final average
// Calculate the average salary
// Total amount / Total number
override def evaluate (buffer: Row): Any = {
buffer.getLong (0) .toDouble / buffer.getLong (1) .toDouble
}
}
def main (args: Array [String]): Unit = {
// 1, create sparksession
val spark: SparkSession = SparkSession.builder (). master ("local [
]"). appName (“demo01”). getOrCreate ()
val JsonDatas: DataFrame = spark.read.json (“file: /// F: \ Chuanzhi Podcast \ Chuanzhi Professional College \ Second Semester \ 34 \ 05-Spark \ Information \ udaf.json ”)
JsonDatas.createOrReplaceTempView ("UDAFTable")
// Registration process
UDAF function spark.udf.register ("SalaryAvg", new SalaryAvg)
// The algorithm name for calculating the average salary is SalaryAvg
spark.sql ("select SalaryAvg (salary) from UDAFTable" ) .show ()
spark.sql (“select avg (salary) from UDAFTable”). show ()
spark.stop ()
}
}

Published 238 original articles · praised 429 · 250,000 views

Guess you like

Origin blog.csdn.net/qq_45765882/article/details/105561548