UDF and UDAF in Spark SQL

Reprinted from: https://blog.csdn.net/u012297062/article/details/52227909

UDF: User Defined Function, a user-defined function, the input of the function is a specific data record, which is an ordinary Scala function in terms of implementation;
UDAF: User Defined Aggregation Function, a user-defined aggregation function, the function itself acts on the data Collection, which can perform custom operations on the basis of aggregation operations;

In essence, for example, UDF will be encapsulated into Expression by Catalyst in Spark SQL, and finally the input data Row will be calculated by the eval method (Row here has nothing to do with Row in DataFrame)

Don't say too much, just go to the code

1. Create the Spark configuration object SparkConf, and set the configuration information for the runtime of the Spark program

val conf = new SparkConf() // Create a SparkConf object   
conf.setAppName( " SparkSQLUDFUDAF " ) // Set the name of the application, you can see the name in the monitoring interface of the program running  
 // conf.setMaster("spark: // DaShuJu -040:7077") // At this time, the program is in the Spark cluster   
conf.setMaster( " local[4] " )  

2. Create SparkContext object and SQLContext object

// Create a SparkContext object and customize the specific parameters and configuration information of Spark running by passing in a SparkConf instance   
val sc = new SparkContext(conf)  
val sqlContext = new SQLContext(sc) // build SQL context  

3. Simulate the actual data used

val bigData = Array("Spark", "Spark", "Hadoop", "Spark", "Hadoop", "Spark", "Spark", "Hadoop", "Spark", "Hadoop")

4. Create a DataFrame based on the provided data

val bigDataRDD =  sc.parallelize(bigData)  
val bigDataRDDRow = bigDataRDD.map(item => Row(item))  
val structType = StructType(Array(StructField("word", StringType, true)))  
val bigDataDF = sqlContext.createDataFrame(bigDataRDDRow,structType) 

5. Register as a temporary table

bigDataDF.registerTempTable("bigDataTable") 

6. Register UDF through SQLContext, UDF function in Scala 2.10.x version can accept up to 22 input parameters

sqlContext.udf.register( " computeLength " , (input: String) => input.length)  
 // Use UDF directly in SQL statement, just like using SQL automatic internal function   
sqlContext.sql( " select word, computeLength( word) as length from bigDataTable " ).show

7. Register UDAF through SQLContext

sqlContext.udf.register("wordCount", new MyUDAF)  
sqlContext.sql("select word,wordCount(word) as count,computeLength(word) as length" +  
" from bigDataTable group by word").show()  

8. Implement UDAF according to the template

class   MyUDAF extends UserDefinedAggregateFunction {  
   // This method specifies the type of specific input data   
  override def inputSchema: StructType = StructType(Array(StructField( " input " , StringType, true )))  
   // The data to be processed during the aggregation operation The type of the result   
  override def bufferSchema: StructType = StructType(Array(StructField( " count " , IntegerType, true )))  
   // Specify the result type returned by the UDAF function after calculation   
  override def dataType: DataType = IntegerType  
   // Generally used to ensure consistency true   
  override def deterministic: Boolean = true  
  // The initialization result of each group of data before Aggregate   
  override def initialize(buffer: MutableAggregationBuffer): Unit = {buffer( 0 ) = 0 }  
   // When aggregating, whenever a new value comes in, the grouped How aggregation is calculated  
   // local aggregation operation, equivalent to Combiner   
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {  
    buffer(0) = buffer.getAs[Int](0) + 1  
  }  
  // Finally, after the completion of Local Reduce on distributed nodes, a global-level Merge operation needs to be performed   
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {  
    buffer1(0) = buffer1.getAs[Int](0) + buffer2.getAs[Int](0)  
  }  
  // Return the last calculation result of   UDAF 
  override def evaluate(buffer: Row): Any = buffer.getAs[Int]( 0 )  
}  

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324841008&siteId=291194637