hive get_json_object json_tuple json analysis detailed explanation

1. Two functions for processing json in hive

json is a common data interface form, and it is also widely used in practice. Let's see how to parse the json format in hive.

There are two commonly used functions for parsing json format in hive:

First look at get_json_object

> desc function extended get_json_object;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                      tab_name                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| get_json_object(json_txt, path) - Extract a json object from path                                                                                                                   |
| Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.  |
| A limited version of JSONPath supported:                                                                                                                                            |
|   $   : Root object                                                                                                                                                                 |
|   .   : Child operator                                                                                                                                                              |
|   []  : Subscript operator for array                                                                                                                                                |
|   *   : Wildcard for []                                                                                                                                                             |
| Syntax not supported that's worth noticing:                                                                                                                                         |
|   ''  : Zero length string as key                                                                                                                                                   |
|   ..  : Recursive descent                                                                                                                                                           |
|   @   : Current object/element                                                                                                                                             |
|   ()  : Script expression                                                                                                                                                           |
|   ?() : Filter (script) expression.                                                                                                                                                 |
|   [,] : Union operator                                                                                                                                                              |
|   [start:end:step] : array slice operator                                                                                                                                           |
|                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
16 rows selected (0.579 seconds)

As can be seen from the above, the input parameters of get_json_object are two, json_txt and path.
Among them, json_txt is the json string we need to parse, and path represents the "field" in json.

Look at the json_tuple method again

> desc function extended json_tuple;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                  tab_name                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
1 row selected (0.549 seconds)

From the above explanation, it is not difficult to see that the usage of json_tuple and get_json_object is very similar. The only difference is that the input of json_tuple is multiple "fields" in the json string, and a tuple is returned. All data types in the tuple are string. .

2. The simplest json string parsing

Suppose there is a piece of test data as follows

{"age":18, "name": "lili", "gender": "female"}

If you want to parse the age field

> select get_json_object('{"age":18, "name": "lili", "gender": "female"}', "$.age");
+------+--+
| _c0  |
+------+--+
| 18   |
+------+--+

If you use the json_tuple method, you can also.

> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age");
+-----+--+
| c0  |
+-----+--+
| 18  |
+-----+--+

If the json string is an array, you can take the following extraction method

> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0,1,2].age") as age;
+-------------+--+
|     age     |
+-------------+--+
| [18,19,15]  |
+-------------+--+

If you just want to query the age value of the first object in the array, you can use the following extraction method

> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0].age");
+------+--+
| _c0  |
+------+--+
| 18   |
+------+--+

If you want to extract all fields at the same time, you can use the json_tuple method.

> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age", "name", "gender");
+-----+-------+---------+--+
| c0  |  c1   |   c2    |
+-----+-------+---------+--+
| 18  | lili  | female  |
+-----+-------+---------+--+

3. Parse the json array

The above examples all parse a single or a certain number of json objects. But if you don't know how many json arrays there are, or the length of the array is uncertain, we can use the following roundabout way to parse.

Or take the above array as an example

[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]

What if we want to parse out all age fields?

The first step is to replace the in "},{" with;, and also remove the "[]" symbol of the array:

> select regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', '');
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                     _c0                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"};{"age":19, "name": "lucy", "gender": "female"};{"age":15, "name": "mike", "gender": "male"}  |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+

The second part is to split the string with the split method to get an array

> select split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;");
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                        _c0                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| ["{\"age\":18, \"name\": \"lili\", \"gender\": \"female\"}","{\"age\":19, \"name\": \"lucy\", \"gender\": \"female\"}","{\"age\":15, \"name\": \"mike\", \"gender\": \"male\"}"]  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

The third step is to use the explode method to turn the data into a column

> select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;"));
+-------------------------------------------------+--+
|                       col                       |
+-------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"}  |
| {"age":19, "name": "lucy", "gender": "female"}  |
| {"age":15, "name": "mike", "gender": "male"}    |
+-------------------------------------------------+--+

The fourth step is to use get_json_object to parse.

> select get_json_object(json_data, "$.age") from (select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;")) as json_data) virtual_table;
+------+--+
| _c0  |
+------+--+
| 18   |
| 19   |
| 15   |
+------+--+

The above steps complete the analysis of the json object array.

Guess you like

Origin blog.csdn.net/bitcarmanlee/article/details/114307368