Map data type application of HIVE

scenes to be used

Scenario 1 (the actual usage scenario of bloggers)

  • Because the company has recently been involved in the business logic of buried point data, the extended word is supplemented, so the map data type is used to store the extended field.

Scenario 2 (other business scenarios)

  • Scenario 2.1
    In my project, an intermediate table is generated. In order to optimize performance, one column in it should preferably be an array, because if the array is broken up and one element is stored in each row, the data volume will explode due to the duplication of other columns. First of all, I want to generate this array from the upstream table. I searched the documents for a long time, and found that the only way is to convert the source data column to STRING first, then use wm_concat to aggregate, and then use the split function to break it into ARRAY, so the original type information is lost, but STRING Seems to work too, okay, go ahead. There is a place in the following operation that needs to take the last element of the array. I tried to use the array subscript with the size function, my_array[size(my_array)], and found that the report was wrong. The subscript must be a constant, but my array is not fixed-length. See See if there is a function that can reverse an array? No! Finally had to give up using arrays. . .

  • Scenario 2.2
    My task is to generate a curve for each advertisement, which represents the curve of expected impression and number of clicks as the advertiser's bid goes from low to high. The most natural expression is to have a data structure that stores bids, impression times, and click times. However, ODPS does not support such usage, so it has to be encoded into a string, and each operation is first encoded and then decoded. It's very troublesome and inefficient, but there is no way. . .

tool

  • The author uses the Alibaba Cloud maxcompute tool, and the data source uses the Alibaba Cloud data integration script method to realize data synchronization in ES. Remarks Because the data source is of String type, Alibaba Cloud’s synchronization tool does not support converting string to map data type. Therefore, in data synchronization In the process, the ods layer uses the String type for data storage, and the subsequent processing uses the map data type. If you use Sqoop to synchronize data, the author agrees with the method

  • If you use the support of dataX to convert the String type to the Map data type during the sequence conversion process

suggested format

  • The extension field must use String Synchronization source data format is ( KaTeX parse error: Expected 'EOF', got '#' at position 7: aaa:11#̲ bbb:22# KaTeX parse error: Expected 'EOF', got '#' at position 13: key1:[{},{}]#̲ key2:[{},{}])

    • 1.1 The meaning of $ is used to distinguish extended fields (to improve legibility)
    • 1.2 # Used to split text data
    • 1.3 $key1":[{},{}] This complex data format needs to be processed separately
    • 1.4 Subsequent data format processing uses the map field data type

sample scene

  • Create table statement
CREATE TABLE test_employee ( bar MAP<STRING,STRING>);
  • Import Data
数据一
insert into TABLE test_employee
select str_to_map('"$aaa":"11"#"$bbb":"22"#"$key1":[{"shopId":"9033871623535131526","vehicleNum":"1"},{"shopId":"9033871623535131526","vehicleNum":"1"}]#"$key2":[{"shopId":"9033871623535131526","vehicleNum":"2"},{"shopId":"9033871623535131526","vehicleNum":"2"}]',"#",":");

数据二
insert into TABLE test_employee
select str_to_map('$aaa:11#$bbb:22#$key1:[{"shopId":"9033871623535131526","vehicleNum":"1"},{"shopId":"9033871623535131526","vehicleNum":"1"}]',"#",":");

  • Execute SQL (If you want to know the output of the result, you must do it yourself, you can’t be lazy)
select aa.user_id,bb.col,rr.shopId,rr.vehicleNum
from 
(SELECT user_id
        --arg3
        ,split(regexp_replace(regexp_extract(
                           arg3,
                           '^\\[(.+)\\]$',1),
            '\\}\\,\\{', '}||{'),'\\|\\|'    -- odps 格式
     ) as arg3_str
    -- '\\}\\,\\{', '}||{')  第三方云格式
FROM    prod_es_user_behavior_data_integration
WHERE   dt = 20200412
and      arg3 is not null
and      user_id in ("677132")
) aa 
-- 原因 1
-- 对于 [{},{},{}] 这种格式是无法直接进行解析 需要进行单独处理
-- regexp_replace,regexp_extract 这两种函数的使用要清楚
-- 上面的匹配原则可以通用、注意odps 与 第三方云平台的区别。
-- 不建议采用自定义函数处理上述结果、原因就是效率低
lateral view explode(aa.arg3_str) bb  as col
-- 原因 2  实现行转列
-- 使用json_tuple 对key 进行解析
lateral view json_tuple(bb.col,'shopId','vehicleNum') rr as shopId,vehicleNum
;

  • The size function is used to check the number of keys and returns an int type
select size(bar)  from test_employee;
  • The map_keys function is used to view the key return array type
select map_keys(bar) from test_employee;

  • The map_values ​​function is used to view the value return array type
select map_values(bar) from test_employee;

  • str__to_map(str_,Delimiter1,Delimiter2)
Delimiter1 切分文本(文本的含义代表字符串的值)
Delimiter2 切分KV 
样例1
select str_to_map('"aaa":"11"&"bbb":"22"', ':');
错误输出 {"11"&"bbb":NULL, "22":NULL, "aaa":NULL}

样例2
select str_to_map('"aaa":"11"&"bbb":"22"', '&',':');
正确输出 {"aaa":"11", "bbb":"22"}

样例3
select str_to_map('"aaa":"11","bbb":"22","key1":"[{},{},{}]"', ',',':'); -- {"aaa":"11", "bbb":"22", "key1":"[{}, {}:NULL, {}]":NULL}
错误输出: 复杂的数据格式无法输出

Special scene usage

  • Note 1

  • Note 2

  • Note 3

    • Use lateral view + json__tuple to parse the key. Note that the difference between json_object and json_tuple is that it can parse multiple fields or parse one field.
    • explode (array) input is an array output is one line to multiple lines

Benefit sharing

  • Complex data types can be nested (extended parts can be filtered)

  • low frequency function

Guess you like

Origin blog.csdn.net/a18302465887/article/details/105619727