(Transfer) Parse json array in hive

It is very easy to parse general json in hive, just get_json_object.

but if the field is a json array like

[{"bssid":"6C:59:40:21:05:C4","ssid":"MERCURY_05C4"},{"bssid":"AC:9C:E4:04:EE:52","appid":"10003","ssid":"and-Business"}],

Calling get_json_object directly returns null. In this case, for students who do not know how to write UDF, it becomes very difficult to parse json arrays. Fortunately, hive has its own explode function, which makes it possible to parse json arrays. Here's how to use explode.

explode(array)

select explode(array('A','B','C')) as col;
select tf.* from (select 0 from dual) t lateral view explode(array('A','B','C')) tf as col;
运行结果:
col 
C
B
A

Function description: The parameter of explode is an array, which provides a function similar to column rotation; if the length of the parameter array is 3, the returned record will be 3 rows, and each column is an array item, as above. Back to 
[{"bssid":"6C:59:40:21:05:C4","ssid":"MERCURY_05C4"}, 
 {"bssid":"AC:9C:E4:04:EE:52", "appid":"10003","ssid":"and-Business"}],
how to parse out bssid? The idea is to convert the original data into 2 lines of data through explode
({"bssid":"6C:59:40: 21:05:C4","ssid":"MERCURY_05C4"} and
 {"bssid":"AC:9C:E4:04:EE:52","appid":"10003","ssid":"and- Business"}),

然后再使用get_json_object解析。
具体代码如下:
select ss.col
from (
      select 
      split(regexp_replace(regexp_extract(
                           '[{"bssid":"6C:59:40:21:05:C4","ssid":"MERCURY_05C4"}, 
                            {"bssid":"AC:9C:E4:04:EE:52","appid":"10003","ssid":"and-Business"}]',
                           '^\\[(.+)\\]$',1),
            '\\}\\,\\{', '\\}\\|\\|\\{'),
     '\\|\\|'
     ) as str
from dual) pp
lateral view explode(pp.str) ss as col ;

运行结果:
col 
{"bssid":"AC:9C:E4:04:EE:52","appid":"10003","ssid":"and-Business"}
{"bssid":"6C:59:40:21:05:C4","ssid":"MERCURY_05C4"}


Note: Because the original data is of type string (not a real array type), the explode function cannot be used directly.
1.regexp_extract('xxx','^\\[(.+)\\]$',1) This is to remove the left and right brackets from the json array to be parsed. It should be noted that the brackets here require two turns character \\[.
2.regexp_replace('xxx','\\}\\,\\{', '\\}\\|\\|\\{') Turn the comma separator of the json array into two vertical bars|| , you can customize the separator as long as it does not appear in the json array item.
3. Use the array returned by the split function, with the separator defined above.
4. Lateral view explode processes the array returned in 3.
In addition, json_tuple in hive is more convenient to parse json than get_json_object.

select ss.col,rr.appid,rr.ssid,rr.bssid
from (
      select split(regexp_replace(regexp_extract('
                        [{"bssid":"6C:59:40:21:05:C4","ssid":"MERCURY_05C4"}, 
                        {"bssid":"AC:9C:E4:04:EE:52","appid":"10003","ssid":"and-Business"}]',
                        '^\\[(.+)\\]$',1),
             '\\}\\,\\{', '\\}\\|\\|\\{'),
       '\\|\\|'
     ) as str
from dual) pp
lateral view explode(pp.str) ss as col 
lateral view json_tuple(ss.col,'appid','ssid','bssid') rr as appid,ssid,bssid;

Running result:
col appid ssid bssid 
{"bssid":"AC:9C:E4:04:EE:52","appid":"10003","ssid":"and-Business"}10003and-BusinessAC:9C: E4:04:EE:52
{"bssid":"6C:59:40:21:05:C4","ssid":"MERCURY_05C4"}\NMERCURY_05C46C:59:40:21:05:C4
json_tuple can be one-time parses multiple fields, whereas get_json_object can only parse one field at a time.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325191823&siteId=291194637