1、数据说明
id course 1,a 1,b 1,c 1,e 2,a 2,c 2,d 2,f 3,a 3,b 3,c 3,e
(2)字段含义
表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门。
建表语句create table t_course(id int,course string) row format delimited fields terminated by ","导入数据
load data local inpath "/home/hadoop/course/course.txt" into table t_course;
3、需求
编写Hive的HQL语句来实现以下结果:表中的1表示选修,表中的0表示未选修
id a b c d e f 1 1 1 1 0 1 0 2 1 0 1 1 0 1 3 1 1 1 0 1 0
首先 将数据进行整理
create table id_courses as select t1.id as id,t1.course as id_courses,t2.course courses from ( select id as id,collect_set(course) as course from t_course group by id ) t1 join (select collect_set(course) as course from t_course) t2;collect_set(属性名) 收集属性 所对应的value 将所有的value 放入一个 数组中 且不重复
举个例子 根据id 来获取id所对应的
select id as id,collect_set(course) as course from t_course group by id
继续回到 题目 查看一下当前的表id_courses
进行下一步 将 idcourses 与 courses 进行对比 筛选出我们的数据
select id, case when array_contains(id_courses, courses[0]) then 1 else 0 end as a, case when array_contains(id_courses, courses[1]) then 1 else 0 end as b, case when array_contains(id_courses, courses[2]) then 1 else 0 end as c, case when array_contains(id_courses, courses[3]) then 1 else 0 end as d, case when array_contains(id_courses, courses[4]) then 1 else 0 end as e, case when array_contains(id_courses, courses[5]) then 1 else 0 end as f from id_courses;
补充一下
array_contains(数组, value) 用于 判断 数组中是否还有value 如果存在返回true
case when (条件) then 条件成立 else 条件不成立 end