影评项目(hive)

版权声明:转载或者应用请注明出处 https://blog.csdn.net/qq_35180983/article/details/83239407

现有如此三份数据:
1、users.dat
数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码

2、movies.dat
数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型

3、ratings.dat
数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳

set hive.cli.print.current.db=true; //显示当前库
set hive.exec.mode.local.auto=true; //设置hive执行的本机模式
set hive.mapred.mode=nonstrict;

影评项目数据:
链接:https://pan.baidu.com/s/1Aq28cvfaSgdPe_TmsFqPaQ 密码:nlqo

题目要求:

数据要求:

(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)

#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
for i in $'*.dat'
do
echo $i
sed -i "s/::/:/g" $i
done
echo "have finished!"

或者:

#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
sed -i "s/::/:/g" `grep "qwe" -rl ./`
echo "have finished!"

(2)使用Hive能解析的方式进行
注:建表时处理

Hive要求:

1、正确建表,导入数据(三张表,三份数据),并验证是否正确

create table users(UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/users.dat' INTO TABLE users;
select * from users limit 10;
 
create table movies(MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/movies.dat' INTO TABLE movies;
select * from movies limit 10;
 
create table ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/ratings.dat' INTO TABLE ratings;
select * from ratings limit 10;

2、求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)

分析:
	表:       movies    ratings   
	要求的字段:title     count(userid)

select a.title, count(b.userid) counts 
from movies a join ratings b 
on a.movieid = b.movieid 
group by a.title,b.movieid 
order by counts desc 
limit 10
;

American Beauty (1999)	3428
Star Wars	2991
Star Wars	2990
Star Wars	2883
Jurassic Park (1993)	2672
Saving Private Ryan (1998)	2653
Terminator 2	2649
Matrix, The (1999)	2590
Back to the Future (1985)	2583
Silence of the Lambs, The (1991)	2578
Time taken: 125.833 seconds, Fetched: 10 row(s)

3、分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)

分析:
表: users movies ratings
要求的字段:gender title avg(rating)

select a.gender,a.title,avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
order by avgs desc
limit 10
;

FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key ‘gender’
这里出错:参考,
https://blog.csdn.net/zhoujj303030/article/details/38424469

select  collect_set(a.gender),collect_set(b.title),avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;

["F"]	["Close Shave, A (1995)"]	4.644444444444445	180
["F"]	["Wrong Trousers, The (1993)"]	4.588235294117647	238
["F"]	["Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)"]	4.572649572649572	117
["F"]	["Wallace & Gromit"]	4.563106796116505	103
["F"]	["Schindler's List (1993)"]	4.56260162601626	615
["F"]	["Shawshank Redemption, The (1994)"]	4.539074960127592	627
["F"]	["Grand Day Out, A (1992)"]	4.537878787878788	132
["F"]	["To Kill a Mockingbird (1962)"]	4.536666666666667	300
["F"]	["Creature Comforts (1990)"]	4.513888888888889	72
["F"]	["Usual Suspects, The (1995)"]	4.513317191283293	413

结果是集合,改进:

select  collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;

F	Close Shave, A (1995)	4.644444444444445	180
F	Wrong Trousers, The (1993)	4.588235294117647	238
F	Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572649572649572	117
F	Wallace & Gromit	4.563106796116505	103
F	Schindler's List (1993)	4.56260162601626	615
F	Shawshank Redemption, The (1994)	4.539074960127592	627
F	Grand Day Out, A (1992)	4.537878787878788	132
F	To Kill a Mockingbird (1962)	4.536666666666667	300
F	Creature Comforts (1990)	4.513888888888889	72
F	Usual Suspects, The (1995)	4.513317191283293	413

select  collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'M'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;

M	Sanjuro (1962)	4.639344262295082	61
M	Godfather, The (1972)	4.583333333333333	1740
M	Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)	4.576628352490421	522
M	Shawshank Redemption, The (1994)	4.560625	1600
M	Raiders of the Lost Ark (1981)	4.520597322348094	1942
M	Usual Suspects, The (1995)	4.518248175182482	1370
M	Star Wars	4.495307167235495	2344
M	Schindler's List (1993)	4.49141503848431	1689
M	Paths of Glory (1957)	4.485148514851486	202
M	Wrong Trousers, The (1993)	4.478260869565218	644

把他们拼接起来:

select x.* from(
select  collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having count(c.rating) >= 60
order by avgs desc
limit 10)x
union all
select y.* from(
select  collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'M'
group by b.movieid,b.title
having count(c.rating) >= 60
order by avgs desc
limit 10)y;

M	Sanjuro (1962)	4.639344262295082	
M	Godfather, The (1972)	4.583333333333333	
M	Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)	4.576628352490421	
M	Shawshank Redemption, The (1994)	4.560625	
M	Raiders of the Lost Ark (1981)	4.520597322348094	
M	Usual Suspects, The (1995)	4.518248175182482	
M	Star Wars	4.495307167235495	
M	Schindler's List (1993)	4.49141503848431	
M	Paths of Glory (1957)	4.485148514851486	
M	Wrong Trousers, The (1993)	4.478260869565218	
F	Close Shave, A (1995)	4.644444444444445	
F	Wrong Trousers, The (1993)	4.588235294117647	
F	Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572649572649572	
F	Wallace & Gromit	4.563106796116505	
F	Schindler's List (1993)	4.56260162601626	
F	Shawshank Redemption, The (1994)	4.539074960127592	
F	Grand Day Out, A (1992)	4.537878787878788	
F	To Kill a Mockingbird (1962)	4.536666666666667	
F	Creature Comforts (1990)	4.513888888888889	
F	Usual Suspects, The (1995)	4.513317191283293	

4、求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)

分析:
	表:		users	ratings 	
	要求字段: age   avg(rating)

select u.age,avg(r.rating)
from ratings r
join users u on u.userid = r.userid
where r.movieid = 2116
group by u.age
order by u.age;

1	3.2941176470588234
18	3.3580246913580245
25	3.436548223350254
35	3.2278481012658227
45	2.8275862068965516
50	3.32
56	3.5

5、求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)

(1)求最喜欢看电影(影评次数最多)的那位女性

select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a;
+--------+
| a.uid  |
+--------+
| 1150   |
+--------+

(2)求那位女性评最高分的10部电影

select u.uid,r.title,r.rating from film_view r
join 
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10;
改写为:
select a.uid,r.title,r.rating from film_view r
join 
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a
on r.uid = a.uid
order by r.rating desc limit 10;

+--------+----------------------------------------------------+-----------+
| u.uid  |                      r.title                       | r.rating  |
+--------+----------------------------------------------------+-----------+
| 1150   | Close Shave, A (1995)                              | 5.0       |
| 1150   | Night on Earth (1991)                              | 5.0       |
| 1150   | Trust (1990)                                       | 5.0       |
| 1150   | Rear Window (1954)                                 | 5.0       |
| 1150   | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 5.0       |
| 1150   | Being John Malkovich (1999)                        | 5.0       |
| 1150   | Roger & Me (1989)                                  | 5.0       |
| 1150   | It Happened One Night (1934)                       | 5.0       |
| 1150   | Crying Game, The (1992)                            | 5.0       |
| 1150   | Duck Soup (1933)                                   | 5.0       |
+--------+----------------------------------------------------+-----------+

(3)求10部电影的平均影评分(观影者,电影名,影评分)
—大表连小表用时:188s

select aa.uid,bb.* from
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
join 
(select u.uid,r.title,r.rating from film_view r
 join 
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
on aa.title = bb.title;
+---------+----------------------------------------------------+---------------------+
| aa.uid  |                      bb.title                      |     bb.avgrate      |
+---------+----------------------------------------------------+---------------------+
| 1150    | Being John Malkovich (1999)                        | 4.125390450691656   |
| 1150    | Close Shave, A (1995)                              | 4.52054794520548    |
| 1150    | Crying Game, The (1992)                            | 3.7314890154597236  |
| 1150    | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 4.4498902706656915  |
| 1150    | Duck Soup (1933)                                   | 4.21043771043771    |
| 1150    | It Happened One Night (1934)                       | 4.280748663101604   |
| 1150    | Night on Earth (1991)                              | 3.747422680412371   |
| 1150    | Rear Window (1954)                                 | 4.476190476190476   |
| 1150    | Roger & Me (1989)                                  | 4.0739348370927315  |
| 1150    | Trust (1990)                                       | 4.188888888888889   |
+---------+----------------------------------------------------+---------------------+
---小表连大表用时:236s结果一致
select aa.uid,bb.* from
(select u.uid,r.title,r.rating from film_view r
 join 
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
join 
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
on aa.title = bb.title;

6、求好片(评分>=4.0)最多的那个年份的最好看的10部电影
(1)获取电影年份字段,在电影名字的后6位是年份

select mid,title,substring(title,-5,4)year from movies limit 5;
+------+-------+
| mid  |  _c1  |
+------+-------+
| 1    | 1995  |
| 2    | 1995  |
| 3    | 1995  |
| 4    | 1995  |
| 5    | 1995  |

(2)组合movies和ratings表

create view moive_6_v as
select r.rating,m.* from ratings r
join 
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid
limit 5;
+-----------+--------+-----------------------------------------+---------+
| r.rating  | m.mid  |                 m.title                 | m.year  |
+-----------+--------+-----------------------------------------+---------+
| 5.0       | 1193   | One Flew Over the Cuckoo's Nest (1975)  | 1975    |
| 3.0       | 661    | James and the Giant Peach (1996)        | 1996    |
| 3.0       | 914    | My Fair Lady (1964)                     | 1964    |
| 4.0       | 3408   | Erin Brockovich (2000)                  | 2000    |
| 5.0       | 2355   | Bug's Life, A (1998)                    | 1998    |
+-----------+--------+-----------------------------------------+---------+

(3)获取评分大于4的最多的那个年份

create view moive_6_v_a as
select f.year,f.title,avg(f.rating) avgr from moive_6_v f
group by f.year,f.title;


select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year 
order by n desc 
limit 5;

+---------+-----+
| m.year  |  n  |
+---------+-----+
| 1998    | 27  |
| 1995    | 25  |
| 1996    | 24  |
| 1999    | 20  |
| 1994    | 20  |
+---------+-----+

(4)求那个年份的最好看的10部电影

select rr.title,rr.year,rr.avgrate,rr.cc from 
(select mm.title,mm.year,avg(rating)avgrate,count(*)cc
from 
(select r.rating,m.* from ratings r
join 
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid)mm
group by mm.year,mm.title having cc >=50
order by avgrate desc)rr
join
(select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year 
order by n desc 
limit 1)yy
on rr.year = yy.year
limit 10;
+---------------------------------------------+----------+---------------------+--------+
|                  rr.title                   | rr.year  |     rr.avgrate      | rr.cc  |
+---------------------------------------------+----------+---------------------+--------+
| Saving Private Ryan (1998)                  | 1998     | 4.337353938937053   | 2653   |
| Celebration, The (Festen) (1998)            | 1998     | 4.3076923076923075  | 117    |
| Central Station (Central do Brasil) (1998)  | 1998     | 4.283720930232558   | 215    |
| 42 Up (1998)                                | 1998     | 4.2272727272727275  | 88     |
| American History X (1998)                   | 1998     | 4.2265625           | 640    |
| Run Lola Run (Lola rennt) (1998)            | 1998     | 4.224813432835821   | 1072   |
| Shakespeare in Love (1998)                  | 1998     | 4.127479949345715   | 2369   |
| After Life (1998)                           | 1998     | 4.088235294117647   | 102    |
| Get Real (1998)                             | 1998     | 4.088235294117647   | 68     |
| Elizabeth (1998)                            | 1998     | 4.029850746268656   | 938    |
+---------------------------------------------+----------+---------------------+--------+

7、求1997年上映的电影中,评分最高的10部Comedy类电影
(1)求1997年上映的电影

select title,rating,genres from film_view
where substring(title,-5,4)=1997
limit 10;

(2)求1997年上映的电影Comedy类电影

select title,rating,genres from film_view
where substring(title,-5,4)=1997 and 
(lcase(genres) like '%comedy%')
limit 10;
+---------------------------------------+---------+------------------------------------------------+
|                 title                 | rating  |                     genres                     |
+---------------------------------------+---------+------------------------------------------------+
| Hercules (1997)                       | 4.0     | Adventure|Animation|Children's|Comedy|Musical  |
| As Good As It Gets (1997)             | 5.0     | Comedy|Drama                                   |
| Full Monty, The (1997)                | 2.0     | Comedy                                         |
| Beverly Hills Ninja (1997)            | 3.0     | Action|Comedy                                  |
| Men in Black (1997)                   | 3.0     | Action|Adventure|Comedy|Sci-Fi                 |
| Liar Liar (1997)                      | 3.0     | Comedy                                         |
| Love and Death on Long Island (1997)  | 3.0     | Comedy|Drama                                   |
| Grosse Pointe Blank (1997)            | 3.0     | Comedy|Crime                                   |
| Men in Black (1997)                   | 4.0     | Action|Adventure|Comedy|Sci-Fi                 |
| Billy's Hollywood Screen Kiss (1997)  | 4.0     | Comedy|Romance                                 |
+---------------------------------------+---------+------------------------------------------------+

(3)评分最高的10部

select mm.* ,f.genres from
(select m.title,avg(m.rating)avgrate,count(*)cc from 
(select title,rating,genres from film_view
where substring(title,-5,4)=1997 and 
(lcase(genres) like '%comedy%'))m
group by m.title having cc >= 50
order by avgrate desc
limit 10)mm
join movies f
on mm.title = f.title;
+----------------------------------------------------+---------------------+--------+---------------------------------+
|                      mm.title                      |     mm.avgrate      | mm.cc  |            f.genres             |
+----------------------------------------------------+---------------------+--------+---------------------------------+
| Life Is Beautiful (La Vita � bella) (1997)         | 4.329861111111111   | 1152   | Comedy|Drama                    |
| Big One, The (1997)                                | 4.0                 | 102    | Comedy|Documentary              |
| As Good As It Gets (1997)                          | 3.9501404494382024  | 1424   | Comedy|Drama                    |
| Full Monty, The (1997)                             | 3.872393661384487   | 1199   | Comedy                          |
| My Life in Pink (Ma vie en rose) (1997)            | 3.825870646766169   | 201    | Comedy|Drama                    |
| Grosse Pointe Blank (1997)                         | 3.813380281690141   | 1136   | Comedy|Crime                    |
| Men in Black (1997)                                | 3.739952718676123   | 2538   | Action|Adventure|Comedy|Sci-Fi  |
| Austin Powers: International Man of Mystery (1997) | 3.7103734439834026  | 1205   | Comedy                          |
| Billy's Hollywood Screen Kiss (1997)               | 3.6710526315789473  | 76     | Comedy|Romance                  |
| Liar Liar (1997)                                   | 3.5                 | 666    | Comedy                          |
+----------------------------------------------------+---------------------+--------+---------------------------------+

8、该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
难点:每个类型取5个
(1)将电影类型裂变

创建新的movies数据表;
create table newmovies(mid int, title string,genres array<string>)row format 
delimited fields terminated by '\t' collection items terminated by ','stored as textfile;
将数据插入
 insert into table newmovies select mid,title,split(genres,'\\|') from movies;
裂变:
 create table nnmovies(mid int, title string, genres string)row 
 format delimited fields terminated by '\t';

insert into table nnmovies select mid, title, tpf.key from newmovies t 
lateral view explode(t.genres) tpf as key;
(map裂变:select id,name, tpf.mykey as key, tpf.myvalue as value 
from cdt t lateral view explode(t.piaofang) tpf as mykey, myvalue;)

(2)拼接形成视图

create view film_view3 as 
(select r.*,m.title,m.genres 
from ratings r
join nnmovies m on r.mid = m.mid);

(3)各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)

	<1>创建视图,电影按照类型平均分分类
	create view movie_rate as select a.mid,a.title,a.genres,avg(rating)rate 
	from film_view3 a group by a.genres,a.mid,a.title;
	<2>使用row_number函数将每个类型添加序号
	create view movie_rate_order as
	select t.*,row_number() over (distribute by genres sort by rate desc) rn 
	from movie_rate t order by t.genres,t.rate desc;
	<3>通过每组的序号,取出前5(选择10个结果显示)
	select m.* from movie_rate_order m where rn <6 
	order by m.genres,m.rate desc limit 10;
+--------+----------------------------------------------------+------------+--------------------+-------+
| m.mid  |                      m.title                       |  m.genres  |       m.rate       | m.rn  |
+--------+----------------------------------------------------+------------+--------------------+-------+
| 2905   | Sanjuro (1962)                                     | Action     | 4.608695652173913  | 1     |
| 2019   | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | Action     | 4.560509554140127  | 2     |
| 858    | Godfather, The (1972)                              | Action     | 4.524966261808367  | 3     |
| 1198   | Raiders of the Lost Ark (1981)                     | Action     | 4.477724741447892  | 4     |
| 260    | Star Wars: Episode IV - A New Hope (1977)          | Action     | 4.453694416583082  | 5     |
| 3172   | Ulysses (Ulisse) (1954)                            | Adventure  | 5.0                | 1     |
| 2905   | Sanjuro (1962)                                     | Adventure  | 4.608695652173913  | 2     |
| 1198   | Raiders of the Lost Ark (1981)                     | Adventure  | 4.477724741447892  | 3     |
| 260    | Star Wars: Episode IV - A New Hope (1977)          | Adventure  | 4.453694416583082  | 4     |
| 1204   | Lawrence of Arabia (1962)                          | Adventure  | 4.401925391095066  | 5     |
+--------+----------------------------------------------------+------------+--------------------+-------+

9、各年评分最高的电影类型(年份,类型,影评分)

(1)新建带年份、类型视图
create view movie_y_g as 
(select r.*,m.title,m.genres,substring(m.title,-5,4)year 
from ratings r
join nnmovies m on r.mid = m.mid);
(2)创建评分视图
create view movie_y_g_r as
select m.year,m.genres,avg(m.rating)rate,count(*)cc from movie_y_g m
group by m.year,m.genres having cc >= 50
order by m.year,rate desc;
(3)给不同年份不同类型电影加row_number
create view movie_y_g_r_l as
select f.*,row_number() over(distribute by genres sort by rate desc)rn
from movie_y_g_r f order by f.genres,f.rate desc;
(4)取每组的第一值
select mm.* from movie_y_g_r_l mm
where mm.rn < 2
order by mm.year;
+----------+--------------+---------------------+--------+--------+
| mm.year  |  mm.genres   |       mm.rate       | mm.cc  | mm.rn  |
+----------+--------------+---------------------+--------+--------+
| 1927     | Comedy       | 4.368932038834951   | 206    | 1      |
| 1931     | Drama        | 4.387453874538745   | 271    | 1      |
| 1939     | Children's   | 4.182008368200837   | 1912   | 1      |
| 1941     | Film-Noir    | 4.395973154362416   | 1043   | 1      |
| 1942     | Romance      | 4.412822049131217   | 1669   | 1      |
| 1949     | Mystery      | 4.452083333333333   | 480    | 1      |
| 1949     | Thriller     | 4.452083333333333   | 480    | 1      |
| 1952     | Musical      | 4.2836218375499335  | 751    | 1      |
| 1961     | Western      | 4.404651162790698   | 215    | 1      |
| 1962     | Adventure    | 4.3997821350762525  | 918    | 1      |
| 1963     | Sci-Fi       | 4.334664005322688   | 1503   | 1      |
| 1963     | War          | 4.425109064469219   | 2063   | 1      |
| 1972     | Crime        | 4.4660907127429805  | 2315   | 1      |
| 1974     | Horror       | 4.021985343104597   | 1501   | 1      |
| 1977     | Fantasy      | 4.453694416583082   | 2991   | 1      |
| 1977     | Action       | 4.303571428571429   | 3584   | 1      |
| 1981     | Documentary  | 4.274193548387097   | 62     | 1      |
| 1993     | Animation    | 4.0367534456355285  | 1306   | 1      |
+----------+--------------+---------------------+--------+--------+

10、每个地区(邮政编码)最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)

(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view2 as 
(select r.*,u.zcode,m.title,m.genres 
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2) 按地区、电影名求平均分
create view movie_z_r as
select m.zcode,m.title,avg(m.rating)rate,count(*)cc 
from film_view2 m
group by m.zcode,m.title having cc >= 5
order by m.zcode,rate desc;
(3)添加序号
create view movie_z_r_l as
select f.*,row_number() over(distribute by zcode sort by rate desc)rn
from movie_z_r f 
order by f.zcode,f.rate desc;
(4)取最高值
create view movie_z_r_l_m as
select * from movie_z_r_l
where rn < 2
order by zcode;
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| movie_z_r_l_m.zcode  |            movie_z_r_l_m.title             | movie_z_r_l_m.rate  | movie_z_r_l_m.cc  | movie_z_r_l_m.rn  |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| 01002                | Star Wars: Episode IV - A New Hope (1977)  | 4.4                 | 5                 | 1                 |
| 01060                | American Beauty (1999)                     | 4.8                 | 5                 | 1                 |
| 02115                | Shawshank Redemption, The (1994)           | 4.8                 | 5                 | 1                 |
| 02134                | Star Wars: Episode IV - A New Hope (1977)  | 4.6                 | 5                 | 1                 |
| 02135                | Princess Bride, The (1987)                 | 4.6                 | 5                 | 1                 |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
(5)将结果存入HDFS
insert directory '/movie/' select * from  movie_z_r_l_m;

猜你喜欢

转载自blog.csdn.net/qq_35180983/article/details/83239407