The extracted character string hive

  During data analysis, especially when you want to analyze the page, to which we tend to be part of the data extraction, the need to rely on function hive to complete.

1.split function

  First talk about the split function, the function returns the character channeling split, the basic usage is: split (string str, string pat), the return value is an array array, and therefore deserve to be taken if the need to use sections, namely [numbers], the first of which is to be sliced ​​str string, the second character is cut in what pat. Look at the case now.

hive> select split('abcdef', 'c') ;
OK
["ab","def"]
Time taken: 8.071 seconds, Fetched: 1 row(s)
--只取第一部分
hive> select split('abcdef', 'c') [0];
OK
ab
Time taken: 5.326 seconds, Fetched: 1 row(s)

  Sometimes it is not cutting can be done, then there is more to nesting split several times just fine.

2.substr function

  Returns a string from a position fixed data length of the substring, and substring function is identical to the basic usage substr (string A, int start, int len), or look at an example. Notably, substr (str, 0,2) and substr (str, 1,2) are the same function, begins at the first position.

hive>  select substr('abcde',1,2);
OK
ab
Time taken: 0.086 seconds, Fetched: 1 row(s)
hive>  select substr('abcde',0,2);
OK
ab
Time taken: 0.089 seconds, Fetched: 1 row(s)

3.parse_url function

  This function is an artifact, you can parse url structure, return what we want. Basic usage is parse_url (string urlString, string partToExtract [, string keyToExtract]), which partToExtract valid values ​​include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE and USERINFO, the specific I will not explain, we Baidu can use their own. Let us focus, when the second parameter is QUERY, the third parameter can be used, and this is the most useful parameter extraction method, or with examples to illustrate.

hive> select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1',
    > 'QUERY', 'k1');
OK
v1
Time taken: 0.428 seconds, Fetched: 1 row(s)
hive> 

4.regexp_extract function

  When this function is the ultimate big kill, and you can not solve the above problems, everything can be solved with this, based on the use of this function is to be the basis of a regular expression of some. The basic usage of this function is regexp_extract (string subject, string pattern, int index), the first parameter is a string to be processed, the second parameter is written regular, the third expression can be ignored in general do not have access out. Look at an example:

hive> select  regexp_extract(split(split('https://m.baidu.com.cn/7874','baidu.com.cn/')
[1],'/')[0],'\\d+',0);
OK
7874
Time taken: 0.059 seconds, Fetched: 1 row(s)

  With these functions, I believe that everyone should be able to meet all the requirements for a string extracted hive.

Guess you like

Origin blog.csdn.net/weixin_33933118/article/details/90779523