现有如此三份数据:
1、users.dat
数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat
数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat
数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
set hive.cli.print.current.db=true; //显示当前库
set hive.exec.mode.local.auto=true; //设置hive执行的本机模式
set hive.mapred.mode=nonstrict;
影评项目数据:
链接:https://pan.baidu.com/s/1Aq28cvfaSgdPe_TmsFqPaQ 密码:nlqo
题目要求:
数据要求:
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
for i in $'*.dat'
do
echo $i
sed -i "s/::/:/g" $i
done
echo "have finished!"
或者:
#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
sed -i "s/::/:/g" `grep "qwe" -rl ./`
echo "have finished!"
(2)使用Hive能解析的方式进行
注:建表时处理
Hive要求:
1、正确建表,导入数据(三张表,三份数据),并验证是否正确
create table users(UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/users.dat' INTO TABLE users;
select * from users limit 10;
create table movies(MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/movies.dat' INTO TABLE movies;
select * from movies limit 10;
create table ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/ratings.dat' INTO TABLE ratings;
select * from ratings limit 10;
2、求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
分析:
表: movies ratings
要求的字段:title count(userid)
select a.title, count(b.userid) counts
from movies a join ratings b
on a.movieid = b.movieid
group by a.title,b.movieid
order by counts desc
limit 10
;
American Beauty (1999) 3428
Star Wars 2991
Star Wars 2990
Star Wars 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
Time taken: 125.833 seconds, Fetched: 10 row(s)
3、分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
分析:
表: users movies ratings
要求的字段:gender title avg(rating)
select a.gender,a.title,avg(c.rating) avgs,count(c.rating) counts
from ratings c
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
order by avgs desc
limit 10
;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key ‘gender’
这里出错:参考,
https://blog.csdn.net/zhoujj303030/article/details/38424469
select collect_set(a.gender),collect_set(b.title),avg(c.rating) avgs,count(c.rating) counts
from ratings c
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;
["F"] ["Close Shave, A (1995)"] 4.644444444444445 180
["F"] ["Wrong Trousers, The (1993)"] 4.588235294117647 238
["F"] ["Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)"] 4.572649572649572 117
["F"] ["Wallace & Gromit"] 4.563106796116505 103
["F"] ["Schindler's List (1993)"] 4.56260162601626 615
["F"] ["Shawshank Redemption, The (1994)"] 4.539074960127592 627
["F"] ["Grand Day Out, A (1992)"] 4.537878787878788 132
["F"] ["To Kill a Mockingbird (1962)"] 4.536666666666667 300
["F"] ["Creature Comforts (1990)"] 4.513888888888889 72
["F"] ["Usual Suspects, The (1995)"] 4.513317191283293 413
结果是集合,改进:
select collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs,count(c.rating) counts
from ratings c
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;
F Close Shave, A (1995) 4.644444444444445 180
F Wrong Trousers, The (1993) 4.588235294117647 238
F Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572649572649572 117
F Wallace & Gromit 4.563106796116505 103
F Schindler's List (1993) 4.56260162601626 615
F Shawshank Redemption, The (1994) 4.539074960127592 627
F Grand Day Out, A (1992) 4.537878787878788 132
F To Kill a Mockingbird (1962) 4.536666666666667 300
F Creature Comforts (1990) 4.513888888888889 72
F Usual Suspects, The (1995) 4.513317191283293 413
select collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs,count(c.rating) counts
from ratings c
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'M'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;
M Sanjuro (1962) 4.639344262295082 61
M Godfather, The (1972) 4.583333333333333 1740
M Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.576628352490421 522
M Shawshank Redemption, The (1994) 4.560625 1600
M Raiders of the Lost Ark (1981) 4.520597322348094 1942
M Usual Suspects, The (1995) 4.518248175182482 1370
M Star Wars 4.495307167235495 2344
M Schindler's List (1993) 4.49141503848431 1689
M Paths of Glory (1957) 4.485148514851486 202
M Wrong Trousers, The (1993) 4.478260869565218 644
把他们拼接起来:
select x.* from(
select collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs
from ratings c
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having count(c.rating) >= 60
order by avgs desc
limit 10)x
union all
select y.* from(
select collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs
from ratings c
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'M'
group by b.movieid,b.title
having count(c.rating) >= 60
order by avgs desc
limit 10)y;
M Sanjuro (1962) 4.639344262295082
M Godfather, The (1972) 4.583333333333333
M Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.576628352490421
M Shawshank Redemption, The (1994) 4.560625
M Raiders of the Lost Ark (1981) 4.520597322348094
M Usual Suspects, The (1995) 4.518248175182482
M Star Wars 4.495307167235495
M Schindler's List (1993) 4.49141503848431
M Paths of Glory (1957) 4.485148514851486
M Wrong Trousers, The (1993) 4.478260869565218
F Close Shave, A (1995) 4.644444444444445
F Wrong Trousers, The (1993) 4.588235294117647
F Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572649572649572
F Wallace & Gromit 4.563106796116505
F Schindler's List (1993) 4.56260162601626
F Shawshank Redemption, The (1994) 4.539074960127592
F Grand Day Out, A (1992) 4.537878787878788
F To Kill a Mockingbird (1962) 4.536666666666667
F Creature Comforts (1990) 4.513888888888889
F Usual Suspects, The (1995) 4.513317191283293
4、求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
分析:
表: users ratings
要求字段: age avg(rating)
select u.age,avg(r.rating)
from ratings r
join users u on u.userid = r.userid
where r.movieid = 2116
group by u.age
order by u.age;
1 3.2941176470588234
18 3.3580246913580245
25 3.436548223350254
35 3.2278481012658227
45 2.8275862068965516
50 3.32
56 3.5
5、求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
(1)求最喜欢看电影(影评次数最多)的那位女性
select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a;
+--------+
| a.uid |
+--------+
| 1150 |
+--------+
(2)求那位女性评最高分的10部电影
select u.uid,r.title,r.rating from film_view r
join
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10;
改写为:
select a.uid,r.title,r.rating from film_view r
join
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a
on r.uid = a.uid
order by r.rating desc limit 10;
+--------+----------------------------------------------------+-----------+
| u.uid | r.title | r.rating |
+--------+----------------------------------------------------+-----------+
| 1150 | Close Shave, A (1995) | 5.0 |
| 1150 | Night on Earth (1991) | 5.0 |
| 1150 | Trust (1990) | 5.0 |
| 1150 | Rear Window (1954) | 5.0 |
| 1150 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 5.0 |
| 1150 | Being John Malkovich (1999) | 5.0 |
| 1150 | Roger & Me (1989) | 5.0 |
| 1150 | It Happened One Night (1934) | 5.0 |
| 1150 | Crying Game, The (1992) | 5.0 |
| 1150 | Duck Soup (1933) | 5.0 |
+--------+----------------------------------------------------+-----------+
(3)求10部电影的平均影评分(观影者,电影名,影评分)
—大表连小表用时:188s
select aa.uid,bb.* from
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
join
(select u.uid,r.title,r.rating from film_view r
join
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
on aa.title = bb.title;
+---------+----------------------------------------------------+---------------------+
| aa.uid | bb.title | bb.avgrate |
+---------+----------------------------------------------------+---------------------+
| 1150 | Being John Malkovich (1999) | 4.125390450691656 |
| 1150 | Close Shave, A (1995) | 4.52054794520548 |
| 1150 | Crying Game, The (1992) | 3.7314890154597236 |
| 1150 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 4.4498902706656915 |
| 1150 | Duck Soup (1933) | 4.21043771043771 |
| 1150 | It Happened One Night (1934) | 4.280748663101604 |
| 1150 | Night on Earth (1991) | 3.747422680412371 |
| 1150 | Rear Window (1954) | 4.476190476190476 |
| 1150 | Roger & Me (1989) | 4.0739348370927315 |
| 1150 | Trust (1990) | 4.188888888888889 |
+---------+----------------------------------------------------+---------------------+
---小表连大表用时:236s结果一致
select aa.uid,bb.* from
(select u.uid,r.title,r.rating from film_view r
join
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
join
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
on aa.title = bb.title;
6、求好片(评分>=4.0)最多的那个年份的最好看的10部电影
(1)获取电影年份字段,在电影名字的后6位是年份
select mid,title,substring(title,-5,4)year from movies limit 5;
+------+-------+
| mid | _c1 |
+------+-------+
| 1 | 1995 |
| 2 | 1995 |
| 3 | 1995 |
| 4 | 1995 |
| 5 | 1995 |
(2)组合movies和ratings表
create view moive_6_v as
select r.rating,m.* from ratings r
join
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid
limit 5;
+-----------+--------+-----------------------------------------+---------+
| r.rating | m.mid | m.title | m.year |
+-----------+--------+-----------------------------------------+---------+
| 5.0 | 1193 | One Flew Over the Cuckoo's Nest (1975) | 1975 |
| 3.0 | 661 | James and the Giant Peach (1996) | 1996 |
| 3.0 | 914 | My Fair Lady (1964) | 1964 |
| 4.0 | 3408 | Erin Brockovich (2000) | 2000 |
| 5.0 | 2355 | Bug's Life, A (1998) | 1998 |
+-----------+--------+-----------------------------------------+---------+
(3)获取评分大于4的最多的那个年份
create view moive_6_v_a as
select f.year,f.title,avg(f.rating) avgr from moive_6_v f
group by f.year,f.title;
select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year
order by n desc
limit 5;
+---------+-----+
| m.year | n |
+---------+-----+
| 1998 | 27 |
| 1995 | 25 |
| 1996 | 24 |
| 1999 | 20 |
| 1994 | 20 |
+---------+-----+
(4)求那个年份的最好看的10部电影
select rr.title,rr.year,rr.avgrate,rr.cc from
(select mm.title,mm.year,avg(rating)avgrate,count(*)cc
from
(select r.rating,m.* from ratings r
join
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid)mm
group by mm.year,mm.title having cc >=50
order by avgrate desc)rr
join
(select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year
order by n desc
limit 1)yy
on rr.year = yy.year
limit 10;
+---------------------------------------------+----------+---------------------+--------+
| rr.title | rr.year | rr.avgrate | rr.cc |
+---------------------------------------------+----------+---------------------+--------+
| Saving Private Ryan (1998) | 1998 | 4.337353938937053 | 2653 |
| Celebration, The (Festen) (1998) | 1998 | 4.3076923076923075 | 117 |
| Central Station (Central do Brasil) (1998) | 1998 | 4.283720930232558 | 215 |
| 42 Up (1998) | 1998 | 4.2272727272727275 | 88 |
| American History X (1998) | 1998 | 4.2265625 | 640 |
| Run Lola Run (Lola rennt) (1998) | 1998 | 4.224813432835821 | 1072 |
| Shakespeare in Love (1998) | 1998 | 4.127479949345715 | 2369 |
| After Life (1998) | 1998 | 4.088235294117647 | 102 |
| Get Real (1998) | 1998 | 4.088235294117647 | 68 |
| Elizabeth (1998) | 1998 | 4.029850746268656 | 938 |
+---------------------------------------------+----------+---------------------+--------+
7、求1997年上映的电影中,评分最高的10部Comedy类电影
(1)求1997年上映的电影
select title,rating,genres from film_view
where substring(title,-5,4)=1997
limit 10;
(2)求1997年上映的电影Comedy类电影
select title,rating,genres from film_view
where substring(title,-5,4)=1997 and
(lcase(genres) like '%comedy%')
limit 10;
+---------------------------------------+---------+------------------------------------------------+
| title | rating | genres |
+---------------------------------------+---------+------------------------------------------------+
| Hercules (1997) | 4.0 | Adventure|Animation|Children's|Comedy|Musical |
| As Good As It Gets (1997) | 5.0 | Comedy|Drama |
| Full Monty, The (1997) | 2.0 | Comedy |
| Beverly Hills Ninja (1997) | 3.0 | Action|Comedy |
| Men in Black (1997) | 3.0 | Action|Adventure|Comedy|Sci-Fi |
| Liar Liar (1997) | 3.0 | Comedy |
| Love and Death on Long Island (1997) | 3.0 | Comedy|Drama |
| Grosse Pointe Blank (1997) | 3.0 | Comedy|Crime |
| Men in Black (1997) | 4.0 | Action|Adventure|Comedy|Sci-Fi |
| Billy's Hollywood Screen Kiss (1997) | 4.0 | Comedy|Romance |
+---------------------------------------+---------+------------------------------------------------+
(3)评分最高的10部
select mm.* ,f.genres from
(select m.title,avg(m.rating)avgrate,count(*)cc from
(select title,rating,genres from film_view
where substring(title,-5,4)=1997 and
(lcase(genres) like '%comedy%'))m
group by m.title having cc >= 50
order by avgrate desc
limit 10)mm
join movies f
on mm.title = f.title;
+----------------------------------------------------+---------------------+--------+---------------------------------+
| mm.title | mm.avgrate | mm.cc | f.genres |
+----------------------------------------------------+---------------------+--------+---------------------------------+
| Life Is Beautiful (La Vita � bella) (1997) | 4.329861111111111 | 1152 | Comedy|Drama |
| Big One, The (1997) | 4.0 | 102 | Comedy|Documentary |
| As Good As It Gets (1997) | 3.9501404494382024 | 1424 | Comedy|Drama |
| Full Monty, The (1997) | 3.872393661384487 | 1199 | Comedy |
| My Life in Pink (Ma vie en rose) (1997) | 3.825870646766169 | 201 | Comedy|Drama |
| Grosse Pointe Blank (1997) | 3.813380281690141 | 1136 | Comedy|Crime |
| Men in Black (1997) | 3.739952718676123 | 2538 | Action|Adventure|Comedy|Sci-Fi |
| Austin Powers: International Man of Mystery (1997) | 3.7103734439834026 | 1205 | Comedy |
| Billy's Hollywood Screen Kiss (1997) | 3.6710526315789473 | 76 | Comedy|Romance |
| Liar Liar (1997) | 3.5 | 666 | Comedy |
+----------------------------------------------------+---------------------+--------+---------------------------------+
8、该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
难点:每个类型取5个
(1)将电影类型裂变
创建新的movies数据表;
create table newmovies(mid int, title string,genres array<string>)row format
delimited fields terminated by '\t' collection items terminated by ','stored as textfile;
将数据插入
insert into table newmovies select mid,title,split(genres,'\\|') from movies;
裂变:
create table nnmovies(mid int, title string, genres string)row
format delimited fields terminated by '\t';
insert into table nnmovies select mid, title, tpf.key from newmovies t
lateral view explode(t.genres) tpf as key;
(map裂变:select id,name, tpf.mykey as key, tpf.myvalue as value
from cdt t lateral view explode(t.piaofang) tpf as mykey, myvalue;)
(2)拼接形成视图
create view film_view3 as
(select r.*,m.title,m.genres
from ratings r
join nnmovies m on r.mid = m.mid);
(3)各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
<1>创建视图,电影按照类型平均分分类
create view movie_rate as select a.mid,a.title,a.genres,avg(rating)rate
from film_view3 a group by a.genres,a.mid,a.title;
<2>使用row_number函数将每个类型添加序号
create view movie_rate_order as
select t.*,row_number() over (distribute by genres sort by rate desc) rn
from movie_rate t order by t.genres,t.rate desc;
<3>通过每组的序号,取出前5(选择10个结果显示)
select m.* from movie_rate_order m where rn <6
order by m.genres,m.rate desc limit 10;
+--------+----------------------------------------------------+------------+--------------------+-------+
| m.mid | m.title | m.genres | m.rate | m.rn |
+--------+----------------------------------------------------+------------+--------------------+-------+
| 2905 | Sanjuro (1962) | Action | 4.608695652173913 | 1 |
| 2019 | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | Action | 4.560509554140127 | 2 |
| 858 | Godfather, The (1972) | Action | 4.524966261808367 | 3 |
| 1198 | Raiders of the Lost Ark (1981) | Action | 4.477724741447892 | 4 |
| 260 | Star Wars: Episode IV - A New Hope (1977) | Action | 4.453694416583082 | 5 |
| 3172 | Ulysses (Ulisse) (1954) | Adventure | 5.0 | 1 |
| 2905 | Sanjuro (1962) | Adventure | 4.608695652173913 | 2 |
| 1198 | Raiders of the Lost Ark (1981) | Adventure | 4.477724741447892 | 3 |
| 260 | Star Wars: Episode IV - A New Hope (1977) | Adventure | 4.453694416583082 | 4 |
| 1204 | Lawrence of Arabia (1962) | Adventure | 4.401925391095066 | 5 |
+--------+----------------------------------------------------+------------+--------------------+-------+
9、各年评分最高的电影类型(年份,类型,影评分)
(1)新建带年份、类型视图
create view movie_y_g as
(select r.*,m.title,m.genres,substring(m.title,-5,4)year
from ratings r
join nnmovies m on r.mid = m.mid);
(2)创建评分视图
create view movie_y_g_r as
select m.year,m.genres,avg(m.rating)rate,count(*)cc from movie_y_g m
group by m.year,m.genres having cc >= 50
order by m.year,rate desc;
(3)给不同年份不同类型电影加row_number
create view movie_y_g_r_l as
select f.*,row_number() over(distribute by genres sort by rate desc)rn
from movie_y_g_r f order by f.genres,f.rate desc;
(4)取每组的第一值
select mm.* from movie_y_g_r_l mm
where mm.rn < 2
order by mm.year;
+----------+--------------+---------------------+--------+--------+
| mm.year | mm.genres | mm.rate | mm.cc | mm.rn |
+----------+--------------+---------------------+--------+--------+
| 1927 | Comedy | 4.368932038834951 | 206 | 1 |
| 1931 | Drama | 4.387453874538745 | 271 | 1 |
| 1939 | Children's | 4.182008368200837 | 1912 | 1 |
| 1941 | Film-Noir | 4.395973154362416 | 1043 | 1 |
| 1942 | Romance | 4.412822049131217 | 1669 | 1 |
| 1949 | Mystery | 4.452083333333333 | 480 | 1 |
| 1949 | Thriller | 4.452083333333333 | 480 | 1 |
| 1952 | Musical | 4.2836218375499335 | 751 | 1 |
| 1961 | Western | 4.404651162790698 | 215 | 1 |
| 1962 | Adventure | 4.3997821350762525 | 918 | 1 |
| 1963 | Sci-Fi | 4.334664005322688 | 1503 | 1 |
| 1963 | War | 4.425109064469219 | 2063 | 1 |
| 1972 | Crime | 4.4660907127429805 | 2315 | 1 |
| 1974 | Horror | 4.021985343104597 | 1501 | 1 |
| 1977 | Fantasy | 4.453694416583082 | 2991 | 1 |
| 1977 | Action | 4.303571428571429 | 3584 | 1 |
| 1981 | Documentary | 4.274193548387097 | 62 | 1 |
| 1993 | Animation | 4.0367534456355285 | 1306 | 1 |
+----------+--------------+---------------------+--------+--------+
10、每个地区(邮政编码)最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view2 as
(select r.*,u.zcode,m.title,m.genres
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2) 按地区、电影名求平均分
create view movie_z_r as
select m.zcode,m.title,avg(m.rating)rate,count(*)cc
from film_view2 m
group by m.zcode,m.title having cc >= 5
order by m.zcode,rate desc;
(3)添加序号
create view movie_z_r_l as
select f.*,row_number() over(distribute by zcode sort by rate desc)rn
from movie_z_r f
order by f.zcode,f.rate desc;
(4)取最高值
create view movie_z_r_l_m as
select * from movie_z_r_l
where rn < 2
order by zcode;
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| movie_z_r_l_m.zcode | movie_z_r_l_m.title | movie_z_r_l_m.rate | movie_z_r_l_m.cc | movie_z_r_l_m.rn |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| 01002 | Star Wars: Episode IV - A New Hope (1977) | 4.4 | 5 | 1 |
| 01060 | American Beauty (1999) | 4.8 | 5 | 1 |
| 02115 | Shawshank Redemption, The (1994) | 4.8 | 5 | 1 |
| 02134 | Star Wars: Episode IV - A New Hope (1977) | 4.6 | 5 | 1 |
| 02135 | Princess Bride, The (1987) | 4.6 | 5 | 1 |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
(5)将结果存入HDFS
insert directory '/movie/' select * from movie_z_r_l_m;