Reprinted from: https://blog.csdn.net/u012297062/article/details/52227909
UDF: User Defined Function, a user-defined function, the input of the function is a specific data record, which is an ordinary Scala function in terms of implementation;
UDAF: User Defined Aggregation Function, a user-defined aggregation function, the function itself acts on the data Collection, which can perform custom operations on the basis of aggregation operations;
In essence, for example, UDF will be encapsulated into Expression by Catalyst in Spark SQL, and finally the input data Row will be calculated by the eval method (Row here has nothing to do with Row in DataFrame)
Don't say too much, just go to the code
1. Create the Spark configuration object SparkConf, and set the configuration information for the runtime of the Spark program
val conf = new SparkConf() // Create a SparkConf object conf.setAppName( " SparkSQLUDFUDAF " ) // Set the name of the application, you can see the name in the monitoring interface of the program running // conf.setMaster("spark: // DaShuJu -040:7077") // At this time, the program is in the Spark cluster conf.setMaster( " local[4] " )
2. Create SparkContext object and SQLContext object
// Create a SparkContext object and customize the specific parameters and configuration information of Spark running by passing in a SparkConf instance val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // build SQL context
3. Simulate the actual data used
val bigData = Array("Spark", "Spark", "Hadoop", "Spark", "Hadoop", "Spark", "Spark", "Hadoop", "Spark", "Hadoop")
4. Create a DataFrame based on the provided data
val bigDataRDD = sc.parallelize(bigData) val bigDataRDDRow = bigDataRDD.map(item => Row(item)) val structType = StructType(Array(StructField("word", StringType, true))) val bigDataDF = sqlContext.createDataFrame(bigDataRDDRow,structType)
5. Register as a temporary table
bigDataDF.registerTempTable("bigDataTable")
6. Register UDF through SQLContext, UDF function in Scala 2.10.x version can accept up to 22 input parameters
sqlContext.udf.register( " computeLength " , (input: String) => input.length) // Use UDF directly in SQL statement, just like using SQL automatic internal function sqlContext.sql( " select word, computeLength( word) as length from bigDataTable " ).show
7. Register UDAF through SQLContext
sqlContext.udf.register("wordCount", new MyUDAF) sqlContext.sql("select word,wordCount(word) as count,computeLength(word) as length" + " from bigDataTable group by word").show()
8. Implement UDAF according to the template
class MyUDAF extends UserDefinedAggregateFunction { // This method specifies the type of specific input data override def inputSchema: StructType = StructType(Array(StructField( " input " , StringType, true ))) // The data to be processed during the aggregation operation The type of the result override def bufferSchema: StructType = StructType(Array(StructField( " count " , IntegerType, true ))) // Specify the result type returned by the UDAF function after calculation override def dataType: DataType = IntegerType // Generally used to ensure consistency true override def deterministic: Boolean = true // The initialization result of each group of data before Aggregate override def initialize(buffer: MutableAggregationBuffer): Unit = {buffer( 0 ) = 0 } // When aggregating, whenever a new value comes in, the grouped How aggregation is calculated // local aggregation operation, equivalent to Combiner override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = buffer.getAs[Int](0) + 1 } // Finally, after the completion of Local Reduce on distributed nodes, a global-level Merge operation needs to be performed override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1.getAs[Int](0) + buffer2.getAs[Int](0) } // Return the last calculation result of UDAF override def evaluate(buffer: Row): Any = buffer.getAs[Int]( 0 ) }