hive学习06-youtube数据集

---hive练习第6天
1.hive基础:order by,sort by,distribute by,cluster by

order by  全局排序 所有的map结果提交至一个reduce里,如果在严格模式下,order by 需要指定 limit 数据条数,不然数据量巨大的情况下 
会造成崩溃无输出结果。
涉及属性:set hive.mapred.mode=nonstrict/strict
设置为严格模式后。order by 后面需要接 limit
sort by     多个reduce,reduce内部有序,不保证全局有序。
distribute by    distribute by 一般要和 sort by 一起使用,即将某一块数据归给(distribute by)某一个reducer 处理,然后在指定的 reducer 中进行 sort by 排序。
cluster by  =distribute by + sort by  (分桶字段和排序字段为同一个字段)

create table company_info (
personid string,
company string,
money float
)row format delimited fields terminated by "\t";

load data local inpath "/home/dip/company.txt"

[dip@dip005 lzm]$ cat company.txt p1 公司1 150 p2 公司2 230 p1 公司3 160 p3 公司4 30 p4 公司4 20 p4 公司6 90 p5 公司5 500 练习 #全局排序 select * from company_info order by money desc ; #reduce内部有序,如果启动多个reduce,reduce之间没有有序 select * from company_info sort by money desc; #默认第一个字段分桶,这里我们用第二个字段分桶distribute by 然后用第三个字段做sort by 2、hive基础:行转列、列转行 UDAF UDTF 2.1;列转行 concat()函数用于将多个字符串连接成一个字符串 SELECT CONCAT(id, ‘,’, name) AS con FROM info conact_ws()函数是caoncat的特殊形式,第一个参数是分隔符号。 SELECT CONCAT("_",id, name) AS con FROM info group_concat()函数返回一个字符串结果 hive中collect_list()函数和collect_set()函数将某列转为一个数组返回 [dip@dip005 lzm]$ cat person_info.tsv 孙悟空 白羊座 A 大海 射手座 A 宋宋 白羊座 B 猪八戒 白羊座 A 凤姐 射手座 A create table person_info( name string, constellation string, blood_type string) row format delimited fields terminated by "\t"; load data local inpath "/home/dip/lzm/person_info.tsv" into table person_info #请将具有相同星座和血型的人返回 select t1.base,collect_set(name) from (select name,concat_ws(",",constellation,blood_type) base from person_info)t1 group by t1.base; 将collect_set()返回的列表进行处理: select base,concat_ws("|",collect_set(name)) from (select name,concat_ws(",",constellation,blood_type) base from person_info)t1 group by base; 2.2行转列: “fields terminated by”:字段与字段之间的分隔符。 “collection items terminated by”:一个字段中各个子元素 item 的分隔符。 [dip@dip005 lzm]$ head movie_info.tsv 《疑犯追踪》 悬疑,动作,科幻,剧情 《lie to me》 悬疑,警匪,动作,心理,剧情 《战狼2》 战争,动作,灾难 create table movie_info( movie string, category array<string> )row format delimited fields terminated by "\t" collection items terminated by ","; load data local inpath "/home/dip/lzm/movie_info.tsv" into table movie_info; lateral view 与 explode函数的使用 youtube数据实战: 源数据解释:youtube_video_ori 视频表 video_id 视频id uploader 视频上传者 age 视频年龄 category 视频类别 length 视频长度 views 观看次数 rate 视频评分 ratings 流量 comments 评论数 related_ids 相关视频id (不多于20个) 用户表 uploader 视频上传者 videos 上传视频数 friends 朋友的数量 练习 一、各种 TopN 指标: 01统计视频观看数 top10 select * from youtube_video_ori order by views desc limit 10; 02统计视频类别热度 top10 (视频数最多的十个类别) 考点:把category进行 一行转多行,然后聚合统计次数 方法一: select t1.category_name,count(*) as cnt from (select video_id,category_name from youtube_video_ori lateral view explode(category) youtube_video_ori as category_name)t1 group by category_name order by cnt desc limit 10 ; 方法二: select t1.category_name ,count(*) as cnt from (select explode(category) as category_name from youtube_video_ori)t1 group by category_name order by cnt desc limit 10; 03统计视频观看数 Top20 所属类别包含这 Top20 视频的个数 ①计算观看视频数top20 select * from youtube_video_ori order by views desc limit 20; ②类别一行转多行 select video_id,category_name from (select * from youtube_video_ori order by views desc limit 20)t1 lateral view explode(category) t1 as category_name; ③统计每个类别包含视频个数视频 select category_name,count(*) from (select video_id,category_name from (select * from youtube_video_ori order by views desc limit 20)t1 lateral view explode(category) t1 as category_name)t2 group by category_name; 04统计视频观看数 Top50 所关联视频的所属类别Rank ①top前50 select * from youtube_video_ori order by views desc limit 50; ②关联视频id一行转多行: select video_id,related_ids_name from (select * from youtube_video_ori order by views desc limit 50)t1 lateral view explode(related_ids)t1 as related_ids_name; ③video_id做关联,与表youtube_video_ori关联,统计关联视频所属id select t2.related_ids_name,youtube_video_ori.category from (select video_id,related_ids_name from (select * from youtube_video_ori order by views desc limit 50)t1 lateral view explode(related_ids)t1 as related_ids_name)t2 join youtube_video_ori on t2.video_id =youtube_video_ori.video_id; ④类别一行转多行 ,为key,聚类统计数量 select category_name,count(*) as cnt from (select t2.related_ids_name,youtube_video_ori.category from (select video_id,related_ids_name from (select * from youtube_video_ori order by views desc limit 50)t1 lateral view explode(related_ids)t1 as related_ids_name)t2 join youtube_video_ori on t2.video_id =youtube_video_ori.video_id)t3 lateral view explode(category) t3 as category_name group by category_name order by cnt; 05统计每个类别中视频观看数的 Top10 查询有多少类别 hive -e "use hue_test1;select distinct(category_name) from youtube_video_ori lateral view explode(category)youtube_video_ori as category_name"| grep -v WARN>a.txt cat a.txt|while read line do hive -e "use hue_test1;select video_id,category_name,views from youtube_video_ori lateral view explode(category)youtube_video_ori as category_name where category_name='$line' order by views desc limit 10" |grep -v WARN>>b.txt done 06统计上传视频最多的用户 Top10 以及他们上传的视频 select youtube_video_ori.uploader,youtube_video_ori.video_id from ( select uploader,count(*)as up_cnt from youtube_video_ori group by uploader order by up_cnt desc limit 10)t1 join youtube_video_ori on t1.uploader = youtube_video_ori.uploader;

猜你喜欢

转载自www.cnblogs.com/students/p/10939996.html