Hive UDF calls to read files outside the package

 

I will not go into details on how to write UDF, there is a lot of information on the Internet. Paste a reference link and skip it directly.

 
We mainly discuss two points, which are the difficulties encountered in the development process.
1. UDF reads external resources.
2. The storage location of external resources.
 
Why do you encounter these two problems?
 
The purpose of developing UDF is to extend some functions that the database does not have. Common functions such as count and sum, but occasionally encounter some complex computing functions that are not directly implemented by the database. The solution is to directly read the data and then use the program for secondary processing, but the time is relatively slow. The second solution is to meet the needs. Develop the corresponding UDF, and directly calculate the result in the query statement.
 
I encountered this problem when doing geographic statistics based on user ip. Since ip cannot directly correspond to the ip library, it is necessary to convert the ip. You can choose to directly convert it to bigint and then compare it with the starting and ending ipnums in the ip library to obtain the provincial id.
 
The UDF development of this method is relatively simple, just need to read String ip and return long int. The rest is handled by the database. But in practice, this method is very slow. The main reason is that the logs of the ip library and the joint query are very different in order of magnitude, and the data skew is very severe . Often, the data of one day may not end after two days! ! Such efficiency is unacceptable in any case.
 
change
 
After thinking about it for a while, decided to rewrite the UDF. The previous method mainly considers not wanting to reformat the format of the ip library, so the idea is limited. The current second method is used after refactoring the ip library.
 
Refactoring ip library + halving search = new UDF.
 
But the new UDF faces another problem that the refactored ip library needs to be called inside the UDF as an external resource. This has never been encountered before. [Tucao] After baidu, I found that there seems to be no feasible method, and it is still easy to use google. . [End of ranting]
 
First, external resources should be added before running. Use the command add jar [jar file] or add file [file] to temporarily register in hive.
 
The file address of the internal call in the UDF is directly represented by the local file address. For example: String filepath = "/home/dev/test/test.txt"; After uploading to hive, just change the external file address to String filepath = "./test.txt";.

 

The literature is transferred from: http://blog.sina.com.cn/s/blog_b88e09dd01014grp.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326746863&siteId=291194637