Use the UDF in Apache Spark

User-defined function (UDF) is a key feature of most SQL environment, which is mainly used to expand the system's built-in functionality. UDF allows developers to abstract low-level language in its higher-level languages ​​(such as SQL) function in new applications. Apache Spark is no exception, which offers a variety of options for the UDF and Spark SQL workflow integration.

In this post, we will review Python, Apache Spark UDF simple example on Java and Scala and UDAF (user-defined aggregate functions) to achieve. We will also discuss important UDF API functions and integration points, including the current availability between releases. All in all, we will introduce some important performance considerations, the application allows you to take advantage of the choice of UDF understand.

Spark SQL UDFs

UDF converted single row in the value table, generating a single output value corresponding to each row. For example, most of the SQL environment provides a UPPER function, and returns an uppercase version of the string as input.

Custom functions can be defined in Spark SQL and registered as a UDF, and have an associated alias can be used in SQL queries. Here we will introduce you to a simple example, we will define a UDF temperature value JSON data conversion from Celsius (Celsius) degrees Fahrenheit (Fahrenheit):
Use the UDF in Apache Spark
The following sample code uses the SQL alias CTOF Sign up for our conversion UDF and then use it to convert the temperature in each city from the SQL query. For brevity, I omitted create SQLContext objects and other boilerplate code, and provides a complete list of links in each code segment below.

Python
Use the UDF in Apache Spark
Scala
Use the UDF in Apache Spark
the Java
Use the UDF in Apache Spark
Please note, Spark SQL defines UDF1 ~ UDF22 category contains support for up to 22 input parameters of the UDF. Examples of the above process using UDF1 individual temperature values as input. Failure to Apache Spark source code update, use arrays (arrays) or structure (structs) may be helpful to require more than 22 applications entered as a parameter; from the style point of view, if you find yourself using is UDF6 or later, this program may be preferred.

Spark SQL UDAF function

User-defined aggregate function (UDAF) can process multiple rows, and as a result returns a single value, which is typically used with a GROUP BY statement (e.g. COUNT or SUM). To give an example simple, we will achieve an alias for the SUMPRODUCT of UDAF by using a packet, given the number of integers price and inventory calculations for all vehicles Retail Value:
Use the UDF in Apache Spark
Currently, implementation-defined Apache Spark UDAF in extended inherited the UserDefinedAggregateFunction category and syntax supported by Scala and Java. Once the definition of good, we can illustrate and register our SumProductAggregateFunction UDAF objects, and be using SQL queries from under the alias SUMPRODUCT, which in the previous example CTOF UDF roughly the same.

Scala
Use the UDF in Apache Spark
the Apache support the Spark of other UDF

Spark SQL support UDF, UDAF and UDTF and other existing Hive (Java or Scala) function to achieve integration. Incidentally reminder, UDTFs (user-defined table functions) can return multiple columns and rows - this is beyond the scope of this article, but we may be involved in a future blog post. For example, using methods previously emphasized re-implement the same logic and register, integrating existing Hive UDF is a valuable alternative method; from a performance point of view, this method is also helpful for PySpark, in the next section discussed. Hive by the UDF JAR file contains the function implemented by the spark-submit -jars options can be accessed from HiveContext Hive function; then using a function CREATE TEMPORARY FUNCTION declared (as done in the Hive [. 1] in, comprising a UDF) specific examples as follows:

Hive UDF defined in Java
Use the UDF in Apache Spark
access from Python HiveUDF
Use the UDF in Apache Spark
Please note that, as described above UDF and UDAF function that we implemented, Hive UDF can only use Apache Spark SQL query language call - that is, not with a domain-specific language Dataframe API's ( used with DSL).

Alternatively, to the Java language and Scala UDF can be accessed from PySpark contained jar file implemented (using -jars containing spark-submit option) is passed, then the JVM reference actuators, mounted on the bottom or through the private Scala object SparkContext jar file Java UDF implementation to access the UDF definition. Holden Karau in a wonderful speech [2] of this approach are discussed. Note that some Apache Spark private variables used in this technique is not official for the end user. It could also bring additional benefits, allowing UDAF (currently must be defined in the Java and Scala) for PySpark, the following example uses the front in the SUMPRODUCT UDAF Scala defined in the proof:

Scala UDAF define
Use the UDF in Apache Spark
Scala UDAF from PySpark
Use the UDF in Apache Spark
UDF-related functions are continuously added to each version of Apache Spark. For example Version 2.0 adds support for the UDF in R. For reference, the following table summarizes the key features of each version discussed in this article:
Use the UDF in Apache Spark
Use the UDF in Apache Spark
Use the UDF in Apache Spark
table summarizes the key features introduced so far in this blog over the relevant version.

Performance Considerations

Learn affect the performance of Apache Spark UDF function is very important. For example, Python UDF (such as our CTOF function) cause data to be serialized UDF logic operation between the actuator and Python annotator JVM - Java or Scala compared with the UDF implementation, which greatly reduces performance. A potential solution to alleviate this bottleneck sequences include the following:

  1. As the previous section, the access Hive UDF from PySpark. JVM Java UDF implementation can be directly accessed by the actuator. Note again that this method is only used to access the UDF from Apache Spark SQL query language.
  2. Use of this method may be performed with reference to the UDF PySpark access Scala or Java, as we defined Scala UDAF example shown.

Generally, UDF logic should be as streamlined as possible each line will be called. For example, a step in the expansion to 1 billion rows, UDF logic takes 100 milliseconds to complete, which will soon lead to significant performance issues.

Another important part of the Catalyst Spark SQL query optimizer. This feature expands with each version, usually for the Spark SQL query can provide significant performance improvements; however, any UDF implementation code for Catalyst terms may not be well understood (although future analysis function bytecode [3] is thought to solve this problem). Therefore, the use of Apache Spark features built-in SQL query function can usually deliver the best performance, and should be the first approach while avoiding the introduction of UDF consideration. Catalyst advanced users seeking to use its code more closely, you can refer to the following Chris Fregly speech [4], the speakers use Expression.genCode UDF code optimization, and use the new Apache Spark 2.0 test function [5], which is Catalyst customized optimization rules of procedure provides a Plug and Play API.

in conclusion

When Spark SQL built-in features need to be extended, UDF is a very useful tool. This post provides a drill and UDAF UDF implementation, and discusses their integration steps, to take advantage of the Spark SQL Spark SQL in the existing Java Hive UDF. The UDF can in the Python, Scala, and Java implementation (in Spark 2.0 in) R while UDAF and may be implemented in Java and the UDAF in Scala. When used in PySpark UDF, you must consider the cost of data serialization, and should consider two strategies discussed above to solve this problem. Finally, we discussed the Catalyst Spark SQL optimizer, and based on performance factors to consider, stick with the introduction of built-in SQL function Performance considerations before the UDF in the solution.

Code https://github.com/curtishoward/sparkudfexamples
CDH Version: 5.8.0 (Apache Spark 1.6.0)

Guess you like

Origin blog.51cto.com/14086291/2408886