hiveSQL case

1- Please describe in detail the steps of importing a structured text file student.txt into a hive table, and its keywords
• Assuming that student.txt has the following columns: id, name, gender three columns
• 1-Create database create database student_info;
•2-Create hive table student

create external table student_info.student(
id string comment '学生id',
name string comment '学生姓名',
gender string comment '学生性别'
) comment "学生信息表"
row format delimited fields terminated by '\t'
line terminated by '\n'
stored as textfile
location "/user/root/student";

• 3- Load data

load data local inpath '/root/student.txt' into table student_info.student  location "/user/root/student" ;

•4- Enter hive-cli to view the corresponding table structure

select * from student_info.student limit 10；

2- Use HQL to achieve the following functions

•2-1-Create table
•Create employee basic information table (EmployeeInfo), the fields include (employee ID, employee name, employee ID number, gender, age, department, position, time of joining company, time of leaving company), partition The field is the time of joining the company, the line separator is "\n", and the field separator is "\t". The subordinate departments include the Administration Department, Finance Department, R&D Department, and Teaching Department. The corresponding positions include administrative manager, administrative commissioner, financial manager, financial commissioner, R&D engineer, testing engineer, implementation engineer, lecturer, teaching assistant, head teacher, etc., time type Value such as: 2018-05-10 11:00:00

• Create an employee income table (IncomeInfo), the fields include (employee ID, employee name, income amount,
month of income, income type, time of income salary), the partition field is the time when the salary is issued, and the income type includes salary, bonus, There are four situations of company benefits and fines; the time type value is like: 2018-05-10 11:00:00.

•Note: The time type is 2018-05-10 11:00:00, and the field needs to be processed

•Create employee basic information form

create external table test.employee_info(
id string comment '员工id',
name string comment '员工姓名',
indentity_card string comment '身份证号',
gender string comment '性别',
department string comment '所属部门',
post string comment '岗位',
hire_date string comment '入职时间',
departure_date string comment '离职时间'
) comment "员工基本信息表"
partitioned by (day string comment "员工入职时间")
row format delimited fields terminated by '\t'
lines terminated by  '\n'
stored as textfile 
location '/user/root/employee';

• Create employee income statement

create external table test.income_info(
id string comment '员工id',
name string comment '员工姓名',
income_data string comment '收入',
income_month string comment '收入所属月份',
income_type string comment '收入类型',
income_datetime string comment '收入薪水时间'
) comment '员工收入表'
partitioned by (day string comment "员工发放薪水时间")
row format delimited fields terminated by '\t'
lines terminated by  '\n'
stored as textfile 
location '/user/root/income';

2-2 Implemented with HQL, what is the company’s total annual employee expenses, sorted by year in descending order?
• Focus on the time type 2018-05-10 11:00:00 for built-in function processing
• Need to read the full amount of income_info The table is aggregated according to the partition time. Because there is a fine in the income type, you need to deduct the fines from the money issued by the employees.
• Do not use join and traverse the data once to output the result.
• In the case of a large amount of data, consider traversing the data once to obtain the result

select 
    income_year,(income_data-(nvl(penalty_data,0))) as company_cost
from
(
    -- 统计员工收入金额和罚款金额，输出 2019 500 10
    select 
        income_year,
        sum(case when income_type!='罚款' then data_total else 0 end) as income_data,
        sum(case when income_type='罚款' then data_total else 0 end) as penalty_data
    from
    (
    -- 按照年份、收入类型求收入金额
    select 
        year(to_date(income_datetime)) as income_year,
        income_type,
        sum(income_data) as data_total
    from
        test.income_info
    group by 
        year(to_date(income_datetime)) ,income_type
    ) tmp_a
    group by  tmp_a.income_year
) as  temp
order by income_year desc;

2-3 Realize with HQL, what is the total expenditure of employees in each department each year, and arrange them in descending order by year and ascending order by department's expenditure?
• Guarantee a traversal of the data

--根据id关联得出department,和消费类型
select 
    income_year,department,
    (sum(case when income_type!='罚款' then income_data else 0 end) - sum(case when income_type='罚款' then income_data else 0 end) ) as department_cost
from
(
    -- 先对员工进行薪资类别的聚合统计
    select 
        id,year(to_date(income_datetime)) as income_year,income_type,sum(income_data) as income_data
    from 
        test.income_info
    group by 
    year(to_date(income_datetime)),id,income_type
) temp_a
inner join
    test.employee_info b
on
    temp_a.id=b.id
group by
    department,income_year
order by income_year desc , department_cost asc;

2-4 Realize with HQL, find the total expenditure of all employees in the history of each department, and rank in descending order according to the total expenditure. When the value is equal, no space is left.
• Modify according to the intermediate results in 2-3
• Pay attention to all the data in the history

select department,department_cost,dense_rank() over(order by department_cost desc) as cost_rank
from
(
--根据id关联得出department,和消费类型
select 
    department,
    (sum(case when income_type!='罚款' then income_data else 0 end) - sum(case when income_type='罚款' then income_data else 0 end) ) as department_cost
from
(
    -- 先对员工进行薪资类别的聚合统计
    select 
        id,income_type,sum(income_data) as income_data
    from 
        test.income_info
    group by 
    id,income_type
) temp_a
inner join
    test.employee_info b
on
    temp_a.id=b.id
group by
    department
) tmp_c ;

2-5 Realize with HQL, create and generate the employee salary income dynamic change table, that is, employee ID, employee name, employee's current month's salary, current month's salary issuance time, employee's last month's salary, and last month's salary issuance time. The partition field is the time when the salary is paid for this month.
• Feel that the feature of dynamic partition insertion should be used? -But I don’t know how to write it
. Create the table first, and then use insert into table **** select ***
. It is necessary to take into account the resigned and on-board employees, this needs to be considered, full join
• Two tables for full join , Filter day is null
• Need to deal with the built-in function of concat year month to_date
• This question needs to be considered more

create external table test.income_dynamic(
id string comment '员工id',
name string comment '员工姓名',
income_data_current string comment '本月收入',
income_datetime_current string comment 本月'收入薪水时间',
income_data_last   string comment '上月收入',
income_datetime_last string comment '上月收入薪水时间',
) comment '员工收入动态表'
partitioned by (day string comment "员工本月发放薪水时间")
row format delimited fields terminated by '\t'
lines terminated by  '\n'
stored as textfile 
location '/user/root/income';
-- ------------------------------------------------------------------------------
-- 动态分区插入
-- 插入语句
-- 采用full join
insert into table test.income_dynamic partition(day)
select 
    (case when id_a is not null then id_a else id_b end ) as id,
    (case when name_a is not null then name_a else name_b end )  as name ,
    income_data,income_datetime,income_data_b,income_datetime_b,day
from
    (
    -- 选出表中所有的数据
    select
        id as id_a,name as name_a,income_data,income_datetime,day,concat(year(to_date(day)),month(to_date(day))) as day_flag
    from 
        test.income_info
    where 
        income_type='薪资' ) tmp_a
full outer join
    (
    -- 将表中的收到薪水的日期整体加一个月
    select
        id as id_b,name as name_b,income_data as income_data_b,income_datetime as  income_datetime_b,concat(year(add_months(to_date(day),1)),month(add_months(to_date(day),1))) as   month_flag
    from 
        test.income_info
    where 
        income_type='薪资'
    ) tmp_b
    on 
        tmp_a.day_flag=tmp_b.month_flag
    and 
        tmp_a.id_a=tmp_b.id_b
where day is not null
;

2-6 Realized with HQL. In terms of salary increase, whose salary increased the most in May 2018, and whose salary increased the most?
●It is easier to do it on the basis of 2-5, just use the select part; or do it on the basis of 2-5 to do
Hive rank conversion

１、问题
hive如何将
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6
变为：
a       b       1,2,3
c       d       4,5,6
-------------------------------------------------------------------------------------------
２、数据
test.txt
a       b       1 
a       b       2 
a       b       3 
c       d       4 
c       d       5 
c       d       6
------------------------------------------------------------------------------------------- 
３、答案
1.建表
drop table tmp_jiangzl_test;
create table tmp_jiangzl_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by '\t'
stored as textfile;
-- 加载数据
load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test;
2.处理
select col1,col2,concat_ws(',',collect_set(col3)) 
from tmp_jiangzl_test  
group by col1,col2;
---------------------------------------------------------------------------------------
collect_set/concat_ws语法参考链接：https://blog.csdn.net/waiwai3/article/details/79071544
https://blog.csdn.net/yeweiouyang/article/details/41286469   [Hive]用concat_w实现将多行记录合并成一行
---------------------------------------------------------------------------------------
二、列转行
１、问题
hive如何将
a       b       1,2,3
c       d       4,5,6
变为：
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6
---------------------------------------------------------------------------------------------
2、答案
1.建表

drop table tmp_jiangzl_test;
create table tmp_jiangzl_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by '\t'
stored as textfile;
处理：
select col1, col2, col5
from tmp_jiangzl_test a 
lateral  view explode(split(col3,',')) b AS col5;
---------------------------------------------------------------------------------------
lateral  view 语法参考链接：
https://blog.csdn.net/clerk0324/article/details/58600284

Hive implements wordcount

1.创建数据库
create database wordcount;
2.创建外部表
create external table word_data(line string) row format delimited fields terminated by ',' location '/home/hadoop/worddata';
3.映射数据表
load data inpath '/home/hadoop/worddata' into table word_data;
4.这里假设我们的数据存放在hadoop下，路径为：/home/hadoop/worddata，里面主要是一些单词文件，内容大概为：
hello man
what are you doing now
my running
hello
kevin
hi man
执行了上述hql就会创建一张表src_wordcount，内容是这些文件的每行数据，每行数据存在字段line中，select * from word_data;就可以看到这些数据

5.根据MapReduce的规则，我们需要进行拆分，把每行数据拆分成单词，这里需要用到一个hive的内置表生成函数（UDTF）：explode(array)，参数是array，其实就是行变多列：

create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;

6.查看words表内容
OK
hello
man
what
are
you
doing
now
my
running
hello
kevin
hi
man
split是拆分函数，跟java的split功能一样，这里是按照空格拆分，所以执行完hql语句，words表里面就全部保存的单个单词
7.group by统计单词
    select word, count(*) from wordcount.words group by word;
wordcount.words 库名称.表名称，group by word这个word是create table words(word string) 命令创建的word string

结果：
are     1
doing   1
hello   2
hi      1
kevin   1
man     2
my      1
now     1
running 1
what    1
you     1

Hive takes TopN
●rank() over()
●dense_rank() over()
●row_number() over()
to obtain the order id in the specified state
●Give an order table to count the users who have only purchased flour; Customers who have only bought flour)
eg: order: order_id, buyer_id, order_time...
In the case of guaranteeing one traversal, the focus is on O(1) complexity

select buyer_id
from
(
select buyer_id,sum(case when order_id='面粉' then 0 else 1 end) as flag
from order
) as tmp
where flag=0;

How many groups have mutual fans in the Weibo system
? In the Weibo fan table, how many groups have people who follow each other, for example: A–>B; B–>A; A and B interact with each other, which is called a group.
Table structure: id, keep_id, time... (id, keep_id can be used as a joint primary key)
● Realize with Hive

select count(*)/2 as weibo_relation_number
from
(
  (select concat(id,keep_id) as flag from weibo_relation)
  union all  --全部合并到一起，不能提前去重
  (select concat(keep_id,id) as flag from weibo_relation)
) as tmp
having count(flag) =2
group by flag;

How many things did the people who bought bananas buy
● This is a classic question, how many things
did the people who bought bananas buy ● The data still use the data and table structure of the previous question, that is, it is understood that people who follow C are paying attention to it. How many people have been
●A careful understanding is the need to de-duplicate the statistics of the people you follow

select count(distinct keep_id) as total_keep_id
from weibo_relation
where id
  in
(select id from weibo_relation where keep_id='c')

Guess you like