Sword Finger Data Warehouse-Hive03

1. Review of the last lesson

Two, Hive03

Third, the use of various functions in Hive

1. Review of the last lesson

  • https://blog.csdn.net/SparkOnYarn/article/details/105163737
  • Rename table, delete table is divided into truncate and drop, the difference between external table and internal table; partition table is a single-level directory or multi-level directory on hdfs, also divided into static partition and dynamic partition (****), Load data (from linux local, hdfs directory), whether there is overwriet difference (there is overwrite, no overwrite is append), insert overwrite (table structure needs to exist in advance), create table emp3 as select (table does not need to exist in advance); Data export (export to linux local, hdfs file system directory) row format set the data splitter, you can also use the grep pipe character to filter the data; some simple queries of hive, aggregation functions (more in and out), group by Aggregate function; what does it depend on running mapreduce and not running mapreduce?

Two, Hive03

2.1, various joins in Hive

Various join
inner join: only return data on matching connection conditions

left join: take the left table as the reference
right join: take the right table as the reference
full join: full connection

1. Demand: According to the emp table and dept table, match the data with the same employee number:

select
e.empno,e.ename,e.deptno,e.ename
from emp as e
inner join dept as d
on e.deptno=d.deptno;

select e.empno,e.ename,e.deptno,e.ename from emp as e inner join dept as d on e.deptno=d.deptno;

Note: or is not supported in the join condition, and is supported.

2.2, beeline connect to hiveserver2

1. The following prompt appears when starting Hive:
[hadoop @ hadoop001 ~] $ hive
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.

  • https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
    // Hive CLI parameters are outdated, it is recommended to use beeline
    HiverServer2 is shorthand for HS2, beeline is a client, it is recommended to start cd when starting beeline H I V E H THE M E in go with because for in Go to HIVE_HOME; because in also has a lower beeline SPARK_HOME / bin / directory, otherwise choose connection configuration sequence according to the environment variable.

How to use beeline + HiveServer2?
1. Start hiveServer2 first:
[hadoop @ hadoop001 bin] $ ./hiveserver2

2、再启动beeline:
beeline> [hadoop@hadoop001 bin]$ ./beeline -u jdbc:hive2://hadoop001:10000/ruozedata_hive -n hadoop
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/java/jdk1.8.0_45/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
scan complete in 3ms
Connecting to jdbc:hive2://hadoop001:10000/ruozedata_hive
Connected to: Apache Hive (version 1.1.0-cdh5.16.2)
Driver: Hive JDBC (version 1.1.0-cdh5.16.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.16.2 by Apache Hive
0: jdbc:hive2://hadoop001:10000/ruozedata_hiv>

The two terminal commands in production cannot be started in the front-end mode. Once the kill is hung, it is recommended to start in the back-end mode:
There may be occupancy problems in the production of 10000 port, how to modify it?
  • [hadoop @ hadoop001 bin] $ ./hiveserver2 --hiveconf hive.server2.thrift.port = 10086, modify the parameters in hive and restart the beeline connection:

Insert picture description here

2.3. Complex data types in Hive:

2.3.1, Array data type

1. The data type in array <> needs to be the same

hive (ruozedata_hive)> create table hive_array(name string,work_locations array) row format delimited fields terminated by ‘\t’ COLLECTION ITEMS TERMINATED BY ‘,’;
OK
Time taken: 0.364 seconds

hive (ruozedata_hive)> load data local inpath ‘/home/hadoop/data/hive03/hive_array.txt’ overwrite into table hive_array;
Loading data to table ruozedata_hive.hive_array
Table ruozedata_hive.hive_array stats: [numFiles=1, totalSize=77]
OK
Time taken: 0.676 seconds

hive (ruozedata_hive)> select * from hive_array;
OK
hive_array.name hive_array.work_locations
pk [“beijing,shanghai,tianjin,hangzhou”]
jepson [“changchu,chengdu,wuhan,beijing”]
Time taken: 0.295 seconds, Fetched: 2 row(s)

2. Regarding the value of the array type, the array index starts from 0:
hive (ruozedata_hive)> select name, work_locations [0] from hive_array;
OK
name _c1
pk beijing
jepson changchu

// Find out how many cities each buddy works in?
jdbc: hive2: // hadoop001: 10086 / ruozedata_hiv> select name, size (work_locations) from hive_array;
± -------- ± ----- ±-+
| name | _c1 |
± ---- ---- ± ----- ±-+
| pk | 4 |
| jepson | 4 |
± -------- ± ----- ±-+

// Find the people who work in Tianjin to work:
select * from hive_array where array_contains (work_locations, 'tianjin')

For complex data types, we need to know how to "store" and "take"
  • When creating a table, create one more, according to the subscript when taking out.

2.3.2, Map data type

1、数据内容解读:
[hadoop@hadoop001 hive03]$ cat hive_map.txt 
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

key和value之间的分割符是:,字段与字段的分割符是#
father:xiaoming
#
mother:xiaohuang
#
brother:xiaoxu,28

2、建表语句:
create table hive_map(
	id int,name string,members map<string,string>,age int
)	ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY  '#'
MAP KEYS TERMINATED BY ':';

create table hive_map(id int,name string,members map<string,string>,age int) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by ',';

3、加载数据:
load data local inpath '/home/hadoop/data/hive03/hive_map.txt' into table hive_map;

4、开始查询id,name,father,mother,age信息从hive_map表中:
/ruozedata_hiv> select id,name members['father'] as father, members['mother'] as mother,age from hive_map;
+-----+-----------+--------------+-------------+------+--+
| id  |   name    |    father    |   mother    | age  |
+-----+-----------+--------------+-------------+------+--+
| 1   | zhangsan  | xiaoming     | xiaohuang   | 28   |
| 2   | lisi      | mayun        | huangyi     | 22   |
| 3   | wangwu    | wangjianlin  | ruhua       | 29   |
| 4   | mayun     | mayongzhen   | angelababy  | 26   |
+-----+-----------+--------------+-------------+------+--+

5、求表中每一个人的亲属关系,
select id,name,map_keys(members) as relation from hive_map;
+-----+-----------+--------------------------------+--+
| id  |   name    |            relation            |
+-----+-----------+--------------------------------+--+
| 1   | zhangsan  | ["father","mother","brother"]  |
| 2   | lisi      | ["father","mother","brother"]  |
| 3   | wangwu    | ["father","mother","sister"]   |
| 4   | mayun     | ["father","mother"]            |
+-----+-----------+--------------------------------+--+

//map_keys和map_values的使用:

6、计算每个人的亲属关系有几个?
select id,name,size(members) as relation from hive_map;

+-----+-----------+-----------+--+
| id  |   name    | relation  |
+-----+-----------+-----------+--+
| 1   | zhangsan  | 3         |
| 2   | lisi      | 3         |
| 3   | wangwu    | 3         |
| 4   | mayun     | 2         |
+-----+-----------+-----------+--+

7、查询出有兄弟的人的姓名、年龄
array_contains(column,'value')
select * from hive_map where array_contains(map_keys(members),'brother');

+--------------+----------------+----------------------------------------------------+---------------+--+
| hive_map.id  | hive_map.name  |                  hive_map.members                  | hive_map.age  |
+--------------+----------------+----------------------------------------------------+---------------+--+
| 1            | zhangsan       | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} | 28            |
| 2            | lisi           | {"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22            |
+--------------+----------------+----------------------------------------------------+---------------+--+

2.3.3, Struct data type

1. The structure data is as follows:

[hadoop@hadoop001 hive03]$ cat hive_struct.txt 
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70

create table hive_struct(ip string,info struct<name:string,age:int>) row format delimited fields terminated by '#' collection items terminated by ':';

load data local inpath '/home/hadoop/data/hive03/hive_struct.txt' into table hive_struct;

select ip,info.name,info.age from hive_struct;
+--------------+-----------+------+--+
|      ip      |   name    | age  |
+--------------+-----------+------+--+
| 192.168.1.1  | zhangsan  | 40   |
| 192.168.1.2  | lisi      | 50   |
| 192.168.1.3  | wangwu    | 60   |
| 192.168.1.4  | zhaoliu   | 70   |
+--------------+-----------+------+--+

2.3.4 Practical operation of complex cases

1. Data preparation: click_log.txt, user click log, cookie_id, ad_id, time
[hadoop @ hadoop001 hive03] $ cat click_log.txt
11 ad_101 2014-05-01 06: 01: 12.334 + 01
22 ad_102 2014-05-01 07: 28: 12.342 + 01
33 ad_103 2014-05-01 07: 50: 12.33 + 01
11 ad_104 2014-05-01 09: 27: 12.33 + 01
22 ad_103 2014-05-01 09: 03: 12.324 + 01
33 ad_102 2014-05-02 19: 10: 12.343 + 01
11 ad_101 2014-05-02 09: 07: 12.344 + 01
35 ad_105 2014-05-03 11: 07: 12.339 + 01
22 ad_104 2014-05-03 12: 59: 12.743 + 01
77 ad_103 2014-05-03 18: 04: 12.355 + 01
99 ad_102 2014-05-04 00: 36: 39.713 + 01

ad_list.txt, text log analysis:
Example: user visits Jingdong mobile phone / Huawei mobile phone
[hadoop @ hadoop001 hive03] $ cat ad_list.txt
ad_101 http://www.google.com catalog8 | catalog1
ad_102 http: // www. sohu.com catalog6 | catalog3
ad_103 http://www.baidu.com catalog7
ad_104 http://www.qq.com catalog5 | catalog1 | catalog4 | catalog9
ad_105 http://sina.com

2、创建加载表click_log;
create table click_log(
cookie_id string,
ad_id string,
time string
)
row format delimited fields terminated by ‘\t’;

create table click_log(cookie_id string,ad_id string,time string) row format delimited fields terminated by ‘\t’;

load data local inpath ‘/home/hadoop/data/hive03/click_log.txt’ into table click_log;

3、创建ad_list表:
create table ad_list(
ad_id string,
list_url string,
catalogs string
)
row format delimited fields terminated by ‘\t’;

create table ad_list(ad_id string,list_url string,catalogs string) row format delimited fields terminated by ‘\t’;

load data local inpath ‘/home/hadoop/data/hive03/ad_list.txt’ into table ad_list;

demand:

1.
Deduplication of all ad_id visited by each person: select cookie_id, collect_set (ad_id) from click_log group by cookie_id;

------------------------------ ----------- ± ± ± - +
| cookie_id | _c1 |
------------------------------ ----------- ± ± ± - +
| 11 | [ "Ad_101", "ad_104"] |
| 22 | [ "Ad_102", "ad_103", "ad_104"] |
| 33 | [ "Ad_103", "ad_102", "ad_101"] |
| 35 | [ "Ad_105", "ad_102"] |
| 77 | [ "Ad_103", "ad_104"] |
| 99 | [ "Ad_102", "ad_105"] |
------------------------------ ----------- ± ± ± - +

2. All
ad_ids visited by each person are not duplicated : select cookie_id, collect_list (ad_id) from click_log group by cookie_id;

± ± ----------- ------------------------------------- - ± - +
| cookie_id | _c1 |
± ± ----------- ------------------------------------- - ± - +
| 11 | [ "Ad_101", "ad_104", "ad_101", "ad_101"] |
| 22 | [ "Ad_102", "ad_103", "ad_104", "ad_103"] |
| 33 | [ "Ad_103", "ad_102", "ad_101"] |
| 35 | [ "Ad_105", "ad_102"] |
| 77 | [ "Ad_103", "ad_104"] |
| 99 | [ "Ad_102", "ad_105"] |
± ± ----------- ------------------------------------- - ± - +

3. Count each person's ad_id visits?
select cookie_id, ad_id, count (1) as visit from click_log group by cookie_id, ad_id;

| cookie_id | ad_id | Visit |
± -------- ------- ----------- ± ± ± - +
| 11 | ad_101 | 3 |
| 11 | ad_104 | 1 |
| 22 | ad_102 | 1 |
| 22 | ad_103 | 2 |
| 22 | ad_104 | 1 |
| 33 | ad_101 | 1 |
| 33 | ad_102 | 1 |
| 33 | ad_103 | 1 |
| 35 | ad_102 | 1 |
| 35 | ad_105 | 1 |
| 77 | ad_103 | 1 |
| 77 | ad_104 | 1 |
| 99 | ad_102 | 1 |
| 99 | ad_105 | 1 |
± -------- ------- ----------- ± ± ± - +

4. Also bring out the categories he visited:
ad_id-> ad_list to find the corresponding category:

select
c.cookie_id,c.ad_id,c.amount,a.catalogs
from
(select cookie_id,ad_id,count(1) amount from click_log group by cookie_id,ad_id) as c
join ad_list as a
on a.ad_id=c.ad_id;

结果 如下 :
± --------------- ± ----------------------- ± ------ ------------------------------- ± - +
| ad_list.ad_id | ad_list.list_url | ad_list.catalogs |
--------------- --------- ± ± ± ----------------------- ---------------------------- ± - +
| ad_101 | http://www.google.com | catalog8 | catalog1 |
| ad_102 | http://www.sohu.com | catalog6 | catalog3 |
| ad_103 | http://www.baidu.com | catalog7 |
| ad_104 | http://www.qq.com | catalog5 | catalog1 | catalog4 | catalog9 |
| ad_105 | http://sina.com | NULL |
--------------- --------- ± ± ± ----------------------- ---------------------------- ± - +

Columns in Hive turn to line:

Column to row: ad_101 catalog8 | catalog1
==>
ad_101 catalog8
ad_101 catalog1

// Need escape character: '\ |'
select ad_id, catalog from ad_list lateral view outer explode (split (catalogs, '\ |')) t as catalog;

± -------- ± ---------- ± - +
| ad_id | catalog |
± -------- ± ---------- ± - +
| ad_101 | catalog8 |
| ad_101 | catalog1 |
| ad_102 | catalog6 |
| ad_102 | catalog3 |
| ad_103 | catalog7 |
| ad_104 | catalog5 |
| ad_104 | catalog1 |
| ad_104 | catalog4 |
| ad_104 | catalog9 |
| ad_105 | NULL |
-------- ---------- ± ± ± - +

The result is that there is a NULL value, but if the outer is removed in SQL, there will be no NULL value line:

Requirements: Sort the contents of catalogs in descending order:

create table ad_list_2(ad_id string,list_url string,catalogs array) row format delimited fields terminated by ‘\t’ collection items terminated by ‘|’;

load data local inpath ‘/home/hadoop/data/hive03/ad_list.txt’ into table ad_list_2;

select ad_id,sort_array(catalogs) from ad_list_2;

±--------±-----------------------------------------------±-+
| ad_id | _c1 |
±--------±-----------------------------------------------±-+
| ad_101 | [“catalog1”,“catalog8”] |
| ad_102 | [“catalog3”,“catalog6”] |
| ad_103 | [“catalog7”] |
| ad_104 | [“catalog1”,“catalog4”,“catalog5”,“catalog9”] |
| ad_105 | NULL |
±--------±-----------------------------------------------±-+

3. Functions in Hive

  • Divided into built-in functions and user-defined functions

  • Instructions:

SHOW FUNCTIONS; // Show how many functions
DESCRIBE FUNCTION <function_name>; // See how the function uses
DESCRIBE FUNCTION EXTENDED <function_name>; // View detailed usage of the function

1、desc function explode;

±---------------------------------------------------±-+
| tab_name |
±---------------------------------------------------±-+
| explode(a) - separates the elements of array a into multiple rows, or the elements of a map into multiple rows and columns |
±---------------------------------------------------±-+

2、准备测试:
create table dual(x string)l
insert into table dual values(’’);

select current_timestamp from dual;

select unix_timestamp() from dual;

select unix_timestamp(‘2019-12-21 20:19:19’) from dual;

select unix_timestamp(‘2019-12-21 20:19:19’) from dual;

select unix_timestamp(‘20191221 20:19:19’,‘yyyyMMdd HH:mm:ss’) from dual;

//fromunix的使用:
select from_unixtime(1576930759,‘yyyyMMdd HH:mm:ss’) from dual;
±-------------------±-+
| _c0 |
±-------------------±-+
| 20191221 20:19:19 |
±-------------------±-+

//to_date的使用:
select to_date(“2020-03-20 23:32:39”) from dual;
±------------±-+
| _c0 |
±------------±-+
| 2020-03-20 |
±------------±-+

//year/month/day/hour/minute/second的使用:
select month(“2020-03-20 23:32:39”) from dual;

//date_sub,date_add的使用:
select date_sub(“2020-03-20 23:32:39”,7) from dual;

// The string is converted to int type
select cast (“2020-03-20” as date) from dual;
select cast (“10” as int) from dual;

Special case: for null values, the cast is not transferable

binary can only transfer string
QA: binary-> string-> int (X) pass

Number type:

// Reserve 3 digits after the decimal point:
select round (5.13145,3) from dual;

select ceil(5.4) from dual;

select floor(5.4) from dual;

select least(2,3,1) from dual;

String type:

// Start with the fifth position:
select substr ('Facebook', 5) from dual;

// Take the value from the second digit and take 3 digits
afterwards : select substr ('Facebook', 2,3) from dual;

// Two string concatenation:
select concat ('abc', 'def') from dual;

// concat_ws, the key point in Spark is:
select concat_ws (".", "192", "168", "1", "1") from dual;

// Usage of split split: pay attention to the reason of escape character
select split ("192.168.1.1", "\.") from dual;

// The usage of upper, take from the second place, take 3 places, and bcd should be capitalized:
select upper (substr ("abcdefg", 2,3)) from dual;

Array type:

array_contains: return value is boolean
sort_array: sorted array
size: return value
map_keys: return an array
map_values: return an array

JSON type:

举例:
create table rating_json(json string);

load data local inpath ‘/home/hadoop/data/hive03/rating_json.txt’ from rating_json;

select json_tuple(json,‘movie’,‘rating’,‘time’,‘userid’) as (movie,rate,time,userid) from rating_json limit 3;

±-------±------±-----------±--------±-+
| movie | rate | time | userid |
±-------±------±-----------±--------±-+
| 1193 | 5 | 978300760 | 1 |
| 1194 | 6 | 978300770 | 2 |
| 1195 | 7 | 978300780 | 3 |
±-------±------±-----------±--------±-+

Extension: How to convert time timestamp, another subquery:
select userid, movie, rate, time,
year (from_unixtime (time)) as year
from (
select json_tuple (json, 'movie', 'rating', 'time', 'userid') as (movie, rate, time, userid) from rating_json limit 3
) as j;

–> 修改如下:
select userid,movie,rate,time,
from_unixtime(cast(time as bigint)) as ts,
year(from_unixtime(cast(time as bigint))) as year,
month(from_unixtime(cast(time as bigint))) as month
from (
select json_tuple(json,‘movie’,‘rating’,‘time’,‘userid’) as (movie,rate,time,userid) from rating_json limit 3
) as j;

Use of parse_url_tuple:

select parse_url_tuple(“http://www.ruozedata.com/bigdata/spark?cookie_id=10&a=b&c=d”,‘HOST’,‘PATH’,‘QUERY’,‘QUERY:cookie_id’) from dual;

±-------------------±----------------±----------------------±----±-+
| c0 | c1 | c2 | c3 |
±-------------------±----------------±----------------------±----±-+
| www.ruozedata.com | /bigdata/spark | cookie_id=10&a=b&c=d | 10 |
±-------------------±----------------±----------------------±----±-+

The use of isnull (judging whether it is empty)

// If the condition is not empty, an exception will be thrown:
select assert_true (comm is null) from emp where ename = 'SMITH';

// Remove the second value:
select elt (2, 'dongwuzhengquan', 'tonghuashun', 'caifuzhengquan') from dual;

// nvl (value, default_value)-Returns default value if value is null else returns value |
comm is empty, -1 will replace it:
select ename, comm, nvl (comm, -1) from emp;

Published 23 original articles · praised 0 · visits 755

Guess you like

Origin blog.csdn.net/SparkOnYarn/article/details/105182082