Hive案例之影评

现有如此三份数据:

1、users.dat    数据格式为:  2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码

2、movies.dat        数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型

3、ratings.dat        数据格式为:  1::1193::5::978300760    (.)::(.)::(.)::(.)
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳

1. 正确建表,导入数据(三张表,三份数据),并验证是否正确

创建users表

create table if not exists users(UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.)::(.)::(.)::(.)::(.*)','
output.format.string'='%1s %2s %3s %4s %5$s') 
stored as textfile location "/user/data/yingping/users";

加载数据

load data local inpath "/home/hadoop/hive_data/users.dat" into table users;

检查数据

select *  from users limit 10;

创建movies表

create table if not exists movies(MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.)::(.)::(.*)','
output.format.string'='%1s %2s %3$s') 
stored as textfile location "/user/data/yingping/movies";

加载数据

load data local inpath "/home/hadoop/hive_data/movies.dat" into table movies;

检查数据

select *  from movies limit 10;

创建ratings表

create table if not exists ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.)::(.)::(.)::(.)','
output.format.string'='%1s %2s %3s %4s') 
stored as textfile location "/user/data/yingping/ratings";

加载数据

load data local inpath "/home/hadoop/hive_data/ratings.dat" into table ratings;
 

检查数据

select * from ratings limit 10;

2.求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)

select Title,count(UserID)c 
from movies m join ratings r 
where m.MovieID=r.MovieID 
group by Title order by c desc limit 10; 

3.分别求男性,女性当中平均评分最高的10部电影(性别,电影名,影评分)

select Gender,Title,avg(Rating)avg 
from ratings r join users u on r.UserID=u.UserID 
join movies m on r.MovieID=m.MovieID 
where Gender = 'F' 
group by Gender,Title order by avg desc limit 10;

select Gender,Title,avg(Rating)avg 
from ratings r join users u on r.UserID=u.UserID 
join movies m on r.MovieID=m.MovieID 
where Gender = 'M' 
group by Gender,Title order by avg desc limit 10;

4.求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)

select Age,avg(Rating)avg 
from users u join ratings r on u.UserID=r.UserID
where MovieID=2116 group by Age;  

5.求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)

思路:题目要求最喜欢看电影的女性,所以我们首先要求出那位女性,然后再找出她看的电影,从中选10部平均分最高的

求最喜欢看电影的那位女性

select u.UserID,count(Rating)count 
from users u join ratings r on u.UserID=r.UserID
where Gender='F' 
group by u.UserID 
order by count desc limit 1;

结果

u.userid        count
1150    1302

求10部电影的平均影评

select r.MovieID,m.Title,avg(r.Rating)avg
from ratings r join
(select MovieID,Rating from ratings 
where UserID=1150 order by Rating desc limit 10
)t on r.MovieID=t.MovieID join movies m on r.MovieID=m.MovieID group by r.MovieID,m.Title;

结果

162     Crumb (1994)    4.063136456211812
904     Rear Window (1954)      4.476190476190476
951     His Girl Friday (1940)  4.249370277078086
1230    Annie Hall (1977)       4.14167916041979
1966    Metropolitan (1990)     3.6464646464646466
2330    Hands on a Hard Body (1996)     4.163043478260869
3163    Topsy-Turvy (1999)      3.7039473684210527
3307    City Lights (1931)      4.387453874538745
3671    Blazing Saddles (1974)  4.047363717605005
3675    White Christmas (1954)  3.8265682656826567

6.求好片(评分>=4.0)最多的那个年份的最好看的10部电影

思路:首先我们肯定要将Title中的年份字段取出来,然后求出那个年份,最后求那个年份的10部电影

创建一个表包含year(年份),MovieID(电影id),avg_rate(评分)

create table year_movie_avgrate as 
select 
substr(a.Title,-5,4) year,a.MovieID MovieID,avg(b.Rating) avg_rate 
from movies a join ratings b on a.MovieID=b.MovieID 
group by a.MovieID,substr(a.Title,-5,4);

检查数据

select * 
from year_movie_avgrate limit 10;

结果

year_movie_avgrate.year year_movie_avgrate.movieid      year_movie_avgrate.avg_rate
1995    1       4.146846413095811
1995    2       3.20114122681883
1995    3       3.01673640167364
1995    4       2.7294117647058824
1995    5       3.0067567567567566
1995    6       3.8787234042553194
1995    7       3.410480349344978
1995    8       3.014705882352941
1995    9       2.656862745098039
1995    10      3.5405405405405403

从上面的表中找出平均分大于4,且好片最多的年份

select 
year,count(*) totalcount 
from year_movie_avgrate 
where avg_rate >= 4.0 
group by year 
order by totalcount desc limit 1;

结果

year   totalcount
1998    27

将上面的表嵌套进来,求1998年的最好看的10部电影

select 
a.year year,b.MovieID MovieID,b.avg_rate avg_rate 
from 
(select 
year,count(*) totalcount 
from year_movie_avgrate 
where avg_rate >= 4.0 
group by year 
order by totalcount desc limit 1) a 
join year_movie_avgrate b on a.year=b.year 
order by avg_rate desc limit 10;

结果

name    rate
Follow the Bitch (1998) 5.0
Apple, The (Sib) (1998) 4.666666666666667
Inheritors, The (Die Siebtelbauern) (1998)      4.5
Return with Honor (1998)        4.4
Saving Private Ryan (1998)      4.337353938937053
Celebration, The (Festen) (1998)        4.3076923076923075
West Beirut (West Beyrouth) (1998)      4.3
Central Station (Central do Brasil) (1998)      4.283720930232558
42 Up (1998)    4.2272727272727275
American History X (1998)       4.2265625

7.求1997年上映的电影中,评分最高的10部Comedy类电影

思路:直接将上面包含年份的表与movies表关联,按评分排取最高的10部即可

select 
a.year year,b.MovieID MovieID,b.Title Title,a.avg_rate avg_rate 
from year_movie_avgrate a join movies b 
on a.MovieID=b.MovieID 
where a.year="1997" and instr(lcase(b.Genres),"comedy")>0 
order by avg_rate desc limit 10; 

结果

v.id    v.name  v.rate
2324    Life Is Beautiful (La Vita � bella) (1997)      4.329861111111111
2444    24 7: Twenty Four Seven (1997)  4.0
1827    Big One, The (1997)     4.0
1871    Friend of the Deceased, A (1997)        4.0
1784    As Good As It Gets (1997)       3.9501404494382024
2618    Castle, The (1997)      3.891304347826087
1641    Full Monty, The (1997)  3.872393661384487
1564    Roseanna's Grave (For Roseanna) (1997)  3.8333333333333335
1734    My Life in Pink (Ma vie en rose) (1997) 3.825870646766169
1500    Grosse Pointe Blank (1997)      3.813380281690141

8.该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)

思路:首先得划分电影的类型,然后与评分表关联求平均分,取最高的五部

因为一个电影包含几个类型,并且以“|”分割,所以将这些字段炸裂开来

select 
MovieID,Title,tps.type Type 
from movies lateral view explode(split(Genres,"\\|")) tps as type;

创建一个按电影类型划分的评分表将上表嵌套进来并与ratings表相关联,注意这里的类型要注意大小写

create table type_movie_avgrate as 
select 
lower(a.Type) Type,a.Title Title,a.MovieID MovieID,avg(b.Rating) avg_rate 
from 
(select 
MovieID,Title,tps.type Type 
from movies lateral view explode(split(Genres,"\\|")) tps as type) a 
join ratings b on a.MovieID=b.MovieID 
group by lower(a.Type),a.Title,a.MovieID;

检查数据

select * from type_movie_avgrate limit 10;

结果

action  13th Warrior, The (1999)        2826    3.1586666666666665
action  3 Ninjas: High Noon On Mega Mountain (1998)     1739    1.3617021276595744
action  52 Pick-Up (1986)       2475    3.3
action  7th Voyage of Sinbad, The (1958)        3153    3.616279069767442
action  Abyss, The (1989)       1127    3.6839650145772596
action  Aces: Iron Eagle III (1992)     2817    1.64
action  Action Jackson (1988)   3710    2.254054054054054
action  Adrenalin: Fear the Rush (1996) 1383    1.5454545454545454
action  Adventures of Robin Hood, The (1938)    940     3.9735449735449735
action  African Queen, The (1951)       969     4.251655629139073

这里要求每种类型中平均评分的最高的5部电影,使用开窗函数将他们按评分和类型进行窗口的划分,找出五部

select 

from 
(select 
Type,Title,MovieID,avg_rate,
row_number() over(distribute by Type sort by avg_rate desc) no 
from type_movie_avgrate) a where a.no<=5;

部分结果:

a.type  a.title a.movieid       a.avg_rate      a.no
action  Sanjuro (1962)  2905    4.608695652173913       1
action  Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)     2019    4.560509554140127       2
action  Godfather, The (1972)   858     4.524966261808367       3
action  Raiders of the Lost Ark (1981)  1198    4.477724741447892       4
action  Star Wars: Episode IV - A New Hope (1977)       260     4.453694416583082       5
adventure       Ulysses (Ulisse) (1954) 3172    5.0     1
adventure       Sanjuro (1962)  2905    4.608695652173913       2
adventure       Raiders of the Lost Ark (1981)  1198    4.477724741447892       3
adventure       Star Wars: Episode IV - A New Hope (1977)       260     4.453694416583082       4
adventure       Lawrence of Arabia (1962)       1204    4.401925391095066       5
animation       Close Shave, A (1995)   745     4.52054794520548        1
animation       Wrong Trousers, The (1993)      1148    4.507936507936508       2
animation       Wallace & Gromit: The Best of Aardman Animation (1996)  720     4.426940639269406       3
animation       Grand Day Out, A (1992) 1223    4.361522198731501       4
animation       Creature Comforts (1990)        3429    4.335766423357664       5
children's      Wizard of Oz, The (1939)        919     4.247962747380675       1
children's      Toy Story 2 (1999)      3114    4.218927444794953       2
children's      Toy Story (1995)        1       4.146846413095811       3
children's      Iron Giant, The (1999)  2761    4.0474777448071215      4
children's      Winnie the Pooh and the Blustery Day (1968)     1023    3.986425339366516       5
comedy  Smashing Time (1967)    3233    5.0     1
comedy  Follow the Bitch (1998) 1830    5.0     2
comedy  One Little Indian (1973)        3607    5.0     3
comedy  Close Shave, A (1995)   745     4.52054794520548        4
comedy  Wrong Trousers, The (1993)      1148    4.507936507936508       5
crime   Lured (1947)    3656    5.0     1
crime   Godfather, The (1972)   858     4.524966261808367       2
crime   Usual Suspects, The (1995)      50      4.517106001121705       3
crime   Bells, The (1926)       3517    4.5     4
crime   Double Indemnity (1944) 3435    4.415607985480944       5
documentary     Bittersweet Motel (2000)        3881    5.0     1

9.各年评分最高的电影类型(年份,类型,影评分)

思路:根据题目要求,我们首先要求每一年每一部电影的平均评分,直接将之前的每年的电影评分表和每种类型的电影评分表相关联。然后从这这个表里面查我们要的年份,类型,以及评分,使用开窗函数按评分排序,按年份分组。这又会得到一个表,然后我直接在这个表里取最高的评分的电影即可,即第一个。

select 

from 
(select 
c.year year,c.Type Type,c.avg_rate avg_rate,
row_number() over(distribute by c.year sort by c.avg_rate desc) no 
from 
(
select 
a.year year,b.Type Type,avg(a.avg_rate) avg_rate 
from year_movie_avgrate a 
join type_movie_avgrate b 
on a.MovieID=b.MovieID 
group by a.year,b.Type
) c ) d where d.no=1;

部分结果

d.year  d.type  d.avg_rate      d.no
1919    comedy  3.6315789473684212      1
1920    comedy  3.6666666666666665      1
1921    action  3.7903225806451615      1
1922    horror  3.991596638655462       1
1923    comedy  3.4444444444444446      1
1925    war     3.97008547008547        1
1926    crime   4.5     1
1927    comedy  4.368932038834951       1
1928    comedy  3.6458333333333335      1
1929    musical 3.1875  1
1930    war     4.1940298507462686      1
1931    drama   4.387453874538745       1
1932    drama   3.7752100840336134      1
1933    war     4.21043771043771        1
1934    mystery 4.239726027397261       1
1935    musical 4.147410358565737       1
1936    drama   4.239130434782608       1
1937    war     4.33939393939394        1
1938    mystery 4.185929648241206       1
1939    musical 4.247962747380675       1
1940    comedy  4.000047333684532       1

10.每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)

思路:我们首先要求每个地区的每部电影的评分,可以直接将user表与movies表相关联,以地区、电影id、电影名分组(我们要哪个字段就以什么分组)这会得到一个表。然后从这个表中取出我们要的地区、电影名、评分、使用开窗函数按照评分分排序,地区分,这样又会得到一个表,我们只需去最高的即可,即第一个。最后将其写入hdfs.

insert overwrite directory "/user/data/" 
select 

from 
(select 
d.Zipcode Zipcode,d.MovieID MovieID,d.Title Title,d.avg_rate avg_rate,row_number() over(distribute by d.Zipcode sort by d.avg_rate desc) no
from 
(
select 
a.Zipcode  Zipcode,c.MovieID MovieID,c.Title Title,
avg(b.Rating) avg_rate 
from users a join ratings b on a.UserID=b.UserID 
join movies c on b.MovieID=c.MovieID 
group by a.Zipcode,c.MovieID,c.Title) d )e where e.no=1;

部分结果

发布了183 篇原创文章 · 获赞 126 · 访问量 7万+

猜你喜欢

转载自blog.csdn.net/young_0609/article/details/103000863