Hive项目实战系列(3) | 业务分析

  此次博主为大家带来的是Hive项目实战系列的第三部分,也是最终部分。


  •   我们先来测试一下
0: jdbc:hive2://hadoop002:10000> select * from video_orc order by views desc limit 10;

1
  这时候出现了内存溢出的现象

  • 再次测试
0: jdbc:hive2://hadoop002:10000> select videoId,views from video_orc order by views desc limit 10;

2
由此可以得到以下结论,不要带* 否则会出现堆溢出

1. 统计视频观看数Top10

思路:使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。

最终代码:

select 
    videoId, 
    uploader, 
    age, 
    category, 
    length, 
    views, 
    rate, 
    ratings, 
    comments 
from 
    video_orc
order by 
    views 
desc limit 
    10;

3

2. 统计视频类别热度Top10

思路:
1.即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2. 我们需要按照类别group by聚合,然后count组内的videoId个数即可。
3.因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。
4.最后按照热度排序,显示前10条。

最终代码:

select 
    category_name as category, 
    count(t1.videoId) as hot 
from (
    select 
        videoId,
        category_name 
    from 
        video_orc lateral view explode(category) t_catetory as category_name) t1 
group by 
    t1.category_name 
order by 
    hot 
desc limit 
    10;

4

3. 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

思路:
1.先找到观看数最高的20个视频所属条目的所有信息,降序排列
2.把这20条信息中的category分裂出来(列转行)
3.最后查询视频分类名称和该分类下有多少个Top20的视频

最终代码:

select 
    category_name as category, 
    count(t2.videoId) as hot_with_views 
from (
    select 
        videoId, 
        category_name 
    from (
        select 
            * 
        from 
            video_orc 
        order by 
            views 
        desc limit 
            20) t1 lateral view explode(category) t_catetory as category_name) t2 
group by 
    category_name 
order by 
    hot_with_views 
desc;

5

4. 统计视频观看数Top50所关联视频的所属类别排序

思路:
1.查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1
2.将找到的50条视频信息的相关视频relatedId列转行,记为临时表t2
3. 将相关视频的id和gulivideo_orc表进行inner join操作
4. 按照视频类别进行分组,统计每组视频个数,然后排行

  • 1. 观看数前50的视频
select 
    * 
from 
    video_orc 
order by 
    views 
desc limit 
    50;

  • 2. 将相关视频的id进行列转行操作
select 
    explode(relatedId) as videoId 
from 
	t1;

  • 3. 得到两列数据,一列是category,一列是之前查询出来的相关视频id
 (select 
    distinct(t2.videoId), 
    t3.category 
from 
    t2
inner join 
    video_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name;

  • 4. 按照视频类别进行分组,统计每组视频个数,然后排行

最终代码

select 
    category_name as category, 
    count(t5.videoId) as hot 
from (
    select 
        videoId, 
        category_name 
    from (
        select 
            distinct(t2.videoId), 
            t3.category 
        from (
            select 
                explode(relatedId) as videoId 
            from (
                select 
                    * 
                from 
                    video_orc 
                order by 
                    views 
                desc limit 
                    50) t1) t2 
        inner join 
            video_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5
group by 
    category_name 
order by 
    hot 
desc;

6

5. 统计每个类别中的视频热度Top10,以Music为例

思路:
1.要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。
2. 向category展开的表中插入数据。
3. 统计对应类别(Music)中的视频热度。

最终代码:

  • 1. 创建表类别表:
create table gulivideo_category(
    videoId string, 
    uploader string, 
    age int, 
    categoryId string, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int, 
    relatedId array<string>)
row format delimited 
fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;

  • 2. 向类别表中插入数据:
insert into table gulivideo_category  
    select 
        videoId,
        uploader,
        age,
        categoryId,
        length,
        views,
        rate,
        ratings,
        comments,
        relatedId 
    from 
        video_orc lateral view explode(category) catetory as categoryId;

  • 3. 统计Music类别的Top10(也可以统计其他)
select 
    videoId, 
    categories,
    views
from 
    gulivideo_category 
where 
    categoryId = "Music" 
order by 
    views 
desc limit
    10;

7

6. 统计每个类别中视频流量Top10,以Music为例

思路:
1.创建视频类别展开表(categoryId列转行后的表)
2.按照ratings排序即可

最终代码:

select 
    videoId,
    views,
    ratings 
from 
    gulivideo_category 
where 
    categoryId = "Music" 
order by 
    ratings 
desc limit 
    10;

8

7. 统计每个类别视频观看数Top10

思路:
1.先得到categoryId展开的表数据
2.子查询按照categoryId进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为rank列
3.通过子查询产生的临时表,查询rank值小于等于10的数据行即可。

最终代码:

select 
    t1.* 
from (
    select 
        videoId,
        categoryId,
        views,
        row_number() over(partition by categoryId order by views desc) rank from gulivideo_category) t1 
where 
    rank <= 10;

9

  好了,关于此次实战的全部内容已经更新完毕了。


11
   \color{#FF0000}{看完就赞,养成习惯!!!} ^ _ ^ ❤️ ❤️ ❤️
  码字不易,大家的支持就是我坚持下去的动力。点赞后不要忘了关注我哦!

猜你喜欢

转载自blog.csdn.net/qq_16146103/article/details/105881999