1: Built-in functions
hive there are many built-in functions, as a set of functions comes mysql: The nvl (), sum (), avg (), etc.
2: Hive UDF custom function development
UDF acting on the single data line, a data line is generated as output. (Mathematical functions, string functions)
UDAF (user-defined aggregate function): receiving a plurality of input data lines, and generating an output data line. (Count, max)
Development Example
1, first developed a java class that inherits UDF, and override evaluate methods
package cn.itcast.bigdata.udf
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF{
public Text evaluate(final Text s){
if(s==null){return null;}
return new Text(s.toString().toLowerCase());
}
}
2, labeled jar package uploaded to the server
3, the packet is added to the jar the hive classpath
hive>add JAR /home/hadoop/udf.jar;
4. Create a temporary function associated with good java class and Development
Hive>create temporary function tolowercase as 'cn.itcast.bigdata.udf.ToProvince';
3: Hive Transform achieve
Hive of TRANSFORM keyword provides call from a script written in SQL functions, suitable for the realization Hive features not not write UDF situation.
Example:
CREATE TABLE u_data_new (
movieid INT,
rating INT,
weekday INT,
userid INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
add FILE weekday_mapper.py;
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (movieid , rate, timestring,uid)
USING 'python weekday_mapper.py'
AS (movieid, rating, weekday,userid)
FROM t_rating;
Which reads as follows weekday_mapper.py
#!/bin/python
import sys
import datetime
for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])