hive get_json_object json_tuple json解析详解

1.hive中处理json的两个函数

json是常见的数据接口形式,实际中使用也很广泛,下面我们看看怎么在hive中解析json格式。

hive中常用的解析json格式的函数有两个:

先看看get_json_object

> desc function extended get_json_object;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                      tab_name                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| get_json_object(json_txt, path) - Extract a json object from path                                                                                                                   |
| Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.  |
| A limited version of JSONPath supported:                                                                                                                                            |
|   $   : Root object                                                                                                                                                                 |
|   .   : Child operator                                                                                                                                                              |
|   []  : Subscript operator for array                                                                                                                                                |
|   *   : Wildcard for []                                                                                                                                                             |
| Syntax not supported that's worth noticing:                                                                                                                                         |
|   ''  : Zero length string as key                                                                                                                                                   |
|   ..  : Recursive descent                                                                                                                                                           |
|   @   : Current object/element                                                                                                                                             |
|   ()  : Script expression                                                                                                                                                           |
|   ?() : Filter (script) expression.                                                                                                                                                 |
|   [,] : Union operator                                                                                                                                                              |
|   [start:end:step] : array slice operator                                                                                                                                           |
|                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
16 rows selected (0.579 seconds)

从上面可以看出,get_json_object输入的参数是两个,json_txt与path。
其中,json_txt就是我们需要解析的json字符串,而path表示json中的"字段"。

再看看json_tuple方法

> desc function extended json_tuple;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                  tab_name                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
1 row selected (0.549 seconds)

从上面的解释不难看出,json_tuple与get_json_object的用法很类似,唯一的区别是,json_tuple输入的是json字符串中的多个"字段",返回一个元组,元组中的所有数据类型为string。

2.最简单的json字符串解析

假设有如下一条测试数据

{"age":18, "name": "lili", "gender": "female"}

如果想要解析age字段

> select get_json_object('{"age":18, "name": "lili", "gender": "female"}', "$.age");
+------+--+
| _c0  |
+------+--+
| 18   |
+------+--+

如果使用json_tuple方法,也可以。

> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age");
+-----+--+
| c0  |
+-----+--+
| 18  |
+-----+--+

如果json串里是个数组,可以采取如下提取的方式

> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0,1,2].age") as age;
+-------------+--+
|     age     |
+-------------+--+
| [18,19,15]  |
+-------------+--+

如果只是想查询数组里第一个对象的age值,可以用下面的提取方法

> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0].age");
+------+--+
| _c0  |
+------+--+
| 18   |
+------+--+

如果想同时提取所有字段,可以用json_tuple方法。

> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age", "name", "gender");
+-----+-------+---------+--+
| c0  |  c1   |   c2    |
+-----+-------+---------+--+
| 18  | lili  | female  |
+-----+-------+---------+--+

3.解析json数组

上面的例子,都是解析单个或者确定个数的json对象。但如果不知道json数组有多少个,或者数组长度不确定时,我们可以采用下面迂回的方式来解析。

还是以上面的数组为例

[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]

如果我们想要解析出所有的age字段,该怎么办?

第一步,将"},{”中的,替换成; ,同时将数组的"[]"符号也去除:

> select regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', '');
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                     _c0                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"};{"age":19, "name": "lucy", "gender": "female"};{"age":15, "name": "mike", "gender": "male"}  |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+

第二部,将字符串用split方法分割,得到一个数组

> select split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;");
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                        _c0                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| ["{\"age\":18, \"name\": \"lili\", \"gender\": \"female\"}","{\"age\":19, \"name\": \"lucy\", \"gender\": \"female\"}","{\"age\":15, \"name\": \"mike\", \"gender\": \"male\"}"]  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

第三步,使用explode方法将数据变成一列

> select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;"));
+-------------------------------------------------+--+
|                       col                       |
+-------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"}  |
| {"age":19, "name": "lucy", "gender": "female"}  |
| {"age":15, "name": "mike", "gender": "male"}    |
+-------------------------------------------------+--+

第四步,使用get_json_object解析即可。

> select get_json_object(json_data, "$.age") from (select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;")) as json_data) virtual_table;
+------+--+
| _c0  |
+------+--+
| 18   |
| 19   |
| 15   |
+------+--+

以上步骤,就完成了对json对象数组的解析。

猜你喜欢

转载自blog.csdn.net/bitcarmanlee/article/details/114307368