Hive复杂数据类型

array类型

建表并加载数据
创建表时候指定字段为array类型 location array
指定array中每个的分隔符COLLECTION ITEMS TERMINATED BY ‘,’

hive (wzj)> create table hive_array(
          > name string,
          > loaction array<string>)
          > row format delimited fields terminated by '\t' collection items terminated by ',';
OK
Time taken: 0.426 seconds
hive (wzj)> load data local inpath '/home/wzj/data/hive_array.txt' overwrite into table hive_array;
Loading data to table wzj.hive_array
Table wzj.hive_array stats: [numFiles=1, totalSize=77]
OK
Time taken: 0.98 seconds

0: jdbc:hive2://hadoop001:10000/data_hive> select * from hive_array;
INFO  : OK
+------------------+----------------------------------------------+--+
| hive_array.name  |             hive_array.loaction              |
+------------------+----------------------------------------------+--+
| pk               | ["beijing","shanghai","tianjin","hangzhou"]  |
| jepson           | ["changchu","chengdu","wuhan","beijing"]     |
+------------------+----------------------------------------------+--+
2 rows selected (0.385 seconds)

简单应用
类似python中列表的切片用法

0: jdbc:hive2://hadoop001:10000/data_hive> select name,loaction[0],loaction[2],size(loaction) from hive_array;
INFO  : OK
+---------+-----------+----------+------+--+
|  name   |    _c1    |   _c2    | _c3  |
+---------+-----------+----------+------+--+
| pk      | beijing   | tianjin  | 4    |
| jepson  | changchu  | wuhan    | 4    |
+---------+-----------+----------+------+--+
2 rows selected (0.201 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive>

where条件使用 array_contains函数

0: jdbc:hive2://hadoop001:10000/data_hive> select * from hive_array where array_contains(loaction,'wuhan');
INFO  : OK
+------------------+-------------------------------------------+--+
| hive_array.name  |            hive_array.loaction            |
+------------------+-------------------------------------------+--+
| jepson           | ["changchu","chengdu","wuhan","beijing"]  |
+------------------+-------------------------------------------+--+
1 row selected (0.238 seconds)

Map类型

建表并加载数据
指定Map类型 members Map<string,string>
指定Map Key的分隔符和字段之间的分隔符
collection items terminated by ‘#’
map keys terminated by ‘:’;

0: jdbc:hive2://hadoop001:10000/data_hive> create table hive_map( id int, name string,members map<string,string>, age int)row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by ':';
INFO  : OK
No rows affected (0.289 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> load data local inpath '/home/wzj/data/hive_map.txt' into table hive_map;
INFO  : OK
No rows affected (0.564 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> select * from hive_map;
INFO  : OK
+--------------+----------------+----------------------------------------------------+---------------+--+
| hive_map.id  | hive_map.name  |                  hive_map.members                                    |hive_map.age  |
+--------------+----------------+----------------------------------------------------+---------------+--+
| 1            | zhangsan       | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"}        | 28           |
| 2            | lisi           | {"father":"mayun","mother":"huangyi","brother":"guanyu"}             | 22           |
| 3            | wangwu         | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"}        | 29           |
| 4            | mayun          | {"father":"mayongzhen","mother":"angelababy"}                        | 26           |
+--------------+----------------+----------------------------------------------------+---------------+--+
4 rows selected (0.173 seconds)

简单应用
类似python中字典的应用

0: jdbc:hive2://hadoop001:10000/data_hive> select name,members['father'] as father,members['mother'] as mother from hive_map;
INFO  : OK
+-----------+--------------+-------------+--+
|   name    |    father    |   mother    |
+-----------+--------------+-------------+--+
| zhangsan  | xiaoming     | xiaohuang   |
| lisi      | mayun        | huangyi     |
| wangwu    | wangjianlin  | ruhua       |
| mayun     | mayongzhen   | angelababy  |
+-----------+--------------+-------------+--+
4 rows selected (0.23 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> select map_keys(members),map_values(members) from hive_map;
INFO  : OK
+--------------------------------+-------------------------------------+--+
|              _c0               |                 _c1                 |
+--------------------------------+-------------------------------------+--+
| ["father","mother","brother"]  | ["xiaoming","xiaohuang","xiaoxu"]   |
| ["father","mother","brother"]  | ["mayun","huangyi","guanyu"]        |
| ["father","mother","sister"]   | ["wangjianlin","ruhua","jingtian"]  |
| ["father","mother"]            | ["mayongzhen","angelababy"]         |
+--------------------------------+-------------------------------------+--+
4 rows selected (0.176 seconds)

运用数组里面的array_contains，查询出有兄弟的人，并输出兄弟是谁

0: jdbc:hive2://hadoop001:10000/data_hive> select id,name,members['brother'] brother from hive_map where array_contains(map_keys(members),'brother');
INFO  : OK
+-----+-----------+----------+--+
| id  |   name    | brother  |
+-----+-----------+----------+--+
| 1   | zhangsan  | xiaoxu   |
| 2   | lisi      | guanyu   |
+-----+-----------+----------+--+
2 rows selected (0.175 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive>

struct 结构体类型

建表并加载数据

0: jdbc:hive2://hadoop001:10000/data_hive> create table hive_struct(
. . . . . . . . . . . . . . . . . . . . .> id string,
. . . . . . . . . . . . . . . . . . . . .> info struct<name:string,age:int>
. . . . . . . . . . . . . . . . . . . . .> ) row format delimited fields terminated by '#' 
. . . . . . . . . . . . . . . . . . . . .> collection items terminated by ':' ;
INFO  : OK
No rows affected (0.166 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> load data local inpath '/home/wzj/data/hive_struct.txt' into table hive_struct;
INFO  : OK
+-----------------+-------------------------------+--+
| hive_struct.id  |       hive_struct.info        |
+-----------------+-------------------------------+--+
| 192.168.1.1     | {"name":"zhangsan","age":40}  |
| 192.168.1.2     | {"name":"lisi","age":50}      |
| 192.168.1.3     | {"name":"wangwu","age":60}    |
| 192.168.1.4     | {"name":"zhaoliu","age":70}   |
+-----------------+-------------------------------+--+
4 rows selected (0.131 seconds)

0: jdbc:hive2://hadoop001:10000/data_hive> select id,info.name,info.age from hive_struct;
INFO  : OK
+--------------+-----------+------+--+
|      id      |   name    | age  |
+--------------+-----------+------+--+
| 192.168.1.1  | zhangsan  | 40   |
| 192.168.1.2  | lisi      | 50   |
| 192.168.1.3  | wangwu    | 60   |
| 192.168.1.4  | zhaoliu   | 70   |
+--------------+-----------+------+--+
4 rows selected (0.161 seconds)

创建clicklog表和adlist表

0: jdbc:hive2://hadoop001:10000/data_hive> create table  ad_list(
. . . . . . . . . . . . . . . . . . . . .> ad_id string,
. . . . . . . . . . . . . . . . . . . . .> url string,
. . . . . . . . . . . . . . . . . . . . .> catalogs string
. . . . . . . . . . . . . . . . . . . . .> ) row format delimited fields terminated by '\t';
INFO  : OK
No rows affected (0.129 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> load data local inpath '/home/wzj/data/ad_list.txt' into table ad_list;
INFO  : OK
No rows affected (0.384 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> create table click_log(
. . . . . . . . . . . . . . . . . . . . .> cookie_id string,
. . . . . . . . . . . . . . . . . . . . .> ad_id string,
. . . . . . . . . . . . . . . . . . . . .> time string
. . . . . . . . . . . . . . . . . . . . .> ) row format delimited fields terminated by '\t';
INFO  : OK
No rows affected (0.147 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> load data local inpath '/home/wzj/data/click_log.txt' into table click_log;
INFO  : OK
No rows affected (0.354 seconds)

0: jdbc:hive2://hadoop001:10000/data_hive> select * from click_log;
INFO  : OK
+----------------------+------------------+-----------------------------+--+
| click_log.cookie_id  | click_log.ad_id  |       click_log.time        |
+----------------------+------------------+-----------------------------+--+
| 11                   | ad_101           | 2014-05-01 06:01:12.334+01  |
| 22                   | ad_102           | 2014-05-01 07:28:12.342+01  |
| 33                   | ad_103           | 2014-05-01 07:50:12.33+01   |
| 11                   | ad_104           | 2014-05-01 09:27:12.33+01   |
| 22                   | ad_103           | 2014-05-01 09:03:12.324+01  |
| 33                   | ad_102           | 2014-05-02 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-02 09:07:12.344+01  |
| 35                   | ad_105           | 2014-05-03 11:07:12.339+01  |
| 22                   | ad_104           | 2014-05-03 12:59:12.743+01  |
| 77                   | ad_103           | 2014-05-03 18:04:12.355+01  |
| 99                   | ad_102           | 2014-05-04 00:36:39.713+01  |
| 33                   | ad_101           | 2014-05-04 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-05 09:07:12.344+01  |
| 35                   | ad_102           | 2014-05-05 11:07:12.339+01  |
| 22                   | ad_103           | 2014-05-05 12:59:12.743+01  |
| 77                   | ad_104           | 2014-05-05 18:04:12.355+01  |
| 99                   | ad_105           | 2014-05-05 20:36:39.713+01  |
+----------------------+------------------+-----------------------------+--+
17 rows selected (0.179 seconds)
0: jdbc:hive2://hadoop001:10000/data_hive> select * from ad_list;
INFO  : OK
+----------------+------------------------+--------------------------------------+--+
| ad_list.ad_id  |      ad_list.url       |           ad_list.catalogs           |
+----------------+------------------------+--------------------------------------+--+
| ad_101         | http://www.google.com  | catalog8|catalog1                    |
| ad_102         | http://www.sohu.com    | catalog6|catalog3                    |
| ad_103         | http://www.baidu.com   | catalog7                             |
| ad_104         | http://www.qq.com      | catalog5|catalog1|catalog4|catalog9  |
| ad_105         | http://sina.com        | NULL                                 |
+----------------+------------------------+--------------------------------------+--+
5 rows selected (0.145 seconds)

统计每个人所访问的id去重（如果不需要去重，则使用collect_list函数）

0: jdbc:hive2://hadoop001:10000/data_hive> select cookie_id,collect_set(ad_id) from click_log group by cookie_id;
INFO  : OK
+------------+-------------------------------+--+
| cookie_id  |              _c1              |
+------------+-------------------------------+--+
| 11         | ["ad_101","ad_104"]           |
| 22         | ["ad_102","ad_103","ad_104"]  |
| 33         | ["ad_103","ad_102","ad_101"]  |
| 35         | ["ad_105","ad_102"]           |
| 77         | ["ad_103","ad_104"]           |
| 99         | ["ad_102","ad_105"]           |
+------------+-------------------------------+--+
6 rows selected (41.735 seconds)

关联查询

0: jdbc:hive2://hadoop001:10000/data_hive> select click.cookie_id,click.ad_id,click.amount,ad_list.catalogs from
. . . . . . . . . . . . . . . . . . . . .> (select cookie_id,ad_id ,count(1) amount from click_log group by  cookie_id,ad_id) click
. . . . . . . . . . . . . . . . . . . . .> join ad_list
. . . . . . . . . . . . . . . . . . . . .> on ad_list.ad_id = click.ad_id;
INFO  : OK
+------------------+--------------+---------------+--------------------------------------+--+
| click.cookie_id  | click.ad_id  | click.amount  |           ad_list.catalogs           |
+------------------+--------------+---------------+--------------------------------------+--+
| 11               | ad_101       | 3             | catalog8|catalog1                    |
| 11               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 22               | ad_102       | 1             | catalog6|catalog3                    |
| 22               | ad_103       | 2             | catalog7                             |
| 22               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 33               | ad_101       | 1             | catalog8|catalog1                    |
| 33               | ad_102       | 1             | catalog6|catalog3                    |
| 33               | ad_103       | 1             | catalog7                             |
| 35               | ad_102       | 1             | catalog6|catalog3                    |
| 35               | ad_105       | 1             | NULL                                 |
| 77               | ad_103       | 1             | catalog7                             |
| 77               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 99               | ad_102       | 1             | catalog6|catalog3                    |
| 99               | ad_105       | 1             | NULL                                 |
+------------------+--------------+---------------+--------------------------------------+--+
14 rows selected (52.864 seconds)

列转行（outer去掉最后的NULL也就去掉了）

0: jdbc:hive2://hadoop001:10000/data_hive> select ad_id,catalog from ad_list lateral view outer explode(split(catalogs,'\\|')) t as catalog;
INFO  : OK
+---------+-----------+--+
|  ad_id  |  catalog  |
+---------+-----------+--+
| ad_101  | catalog8  |
| ad_101  | catalog1  |
| ad_102  | catalog6  |
| ad_102  | catalog3  |
| ad_103  | catalog7  |
| ad_104  | catalog5  |
| ad_104  | catalog1  |
| ad_104  | catalog4  |
| ad_104  | catalog9  |
| ad_105  | NULL      |
+---------+-----------+--+
10 rows selected (0.145 seconds)

jerrfy_w

发布了45 篇原创文章 · 获赞 1 · 访问量 1764

私信关注

array类型

Map类型

struct 结构体类型

猜你喜欢