MovieLens 1M 数据集

了解有关具有丰富数据，图像和预告片的电影的更多信息。

!git clone https://github.com/wesm/pydata-book

!tree pydata-book/datasets/movielens

0 导入相关库

# 基础
import numpy as np # 处理数组
import pandas as pd # 读取数据&&DataFrame
import matplotlib.pyplot as plt # 制图
import seaborn as sns
from matplotlib import rcParams # 定义参数
from matplotlib.cm import rainbow # 配置颜色

%matplotlib inline 
import warnings
warnings.filterwarnings('ignore') # 忽略警告信息
np.set_printoptions(precision=4) # 小数点后
pd.options.display.max_rows = 10 # 最大行数

1 读取文本文件（ZIP格式）

用户信息
人口统计学数据（年龄、职业、邮编）
6000名用户信息

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
user = pd.read_table('pydata-book/datasets/movielens/users.dat', sep='::', header=None, names=unames)
user

在这里插入图片描述

User information is in the file users.dat and is in the following format:

UserID::Gender::Age::Occupation::Zip-code

Gender is denoted by a “M” for male and “F” for female
Age is chosen from the following ranges:
- 1: “Under 18”
- 18: “18-24”
- 25: “25-34”
- 35: “35-44”
- 45: “45-49”
- 50: “50-55”
- 56: “56+”
Occupation is chosen from the following choices:
- 0: “other” or not specified
- 1: “academic/educator”
- 2: “artist”
- 3: “clerical/admin”
- 4: “college/grad student”
- 5: “customer service”
- 6: “doctor/health care”
- 7: “executive/managerial”
- 8: “farmer”
- 9: “homemaker”
- 10: “K-12 student”
- 11: “lawyer”
- 12: “programmer”
- 13: “retired”
- 14: “sales/marketing”
- 15: “scientist”
- 16: “self-employed”
- 17: “technician/engineer”
- 18: “tradesman/craftsman”
- 19: “unemployed”
- 20: “writer”

电影评分
100万条评分数据

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/datasets/movielens/ratings.dat', sep='::', header=None, names=rnames)
ratings

在这里插入图片描述

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/datasets/movielens/ratings.dat', sep='::', header=None, names=rnames)
ratings

All ratings are contained in the file ratings.dat and are in the following format:

UserID::MovieID::Rating::Timestamp

UserIDs range between 1 and 6040
MovieIDs range between 1 and 3952
Ratings are made on a 5-star scale (whole-star ratings only)
Timestamp is represented in seconds since the epoch as returned by time(2)
Each user has at least 20 ratings

电影元数据（风格类型和年代）
4000部电影信息

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('pydata-book/datasets/movielens/movies.dat', sep='::', header=None, names=mnames)
movies

在这里插入图片描述
Movie information is in the file movies.dat and is in the following format:

MovieID::Title::Genres

Titles are identical to titles provided by the IMDB (including year of release)
Genres are pipe-separated and are selected from the following genres:
- Action
- Adventure
- Animation
- Children’s
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
Movies are mostly entered by hand, so errors and inconsistencies may exist

2 数据预处理

合并((ratings, users), movies)
pd.merge

data = pd.merge(pd.merge(ratings, users), movies)
data

在这里插入图片描述

按性别计算每部电影的平均分
pd.DataFrame.pivot_table # 行 -> 列

mean_ratings = data.pivot_table('rating', index='title', 
                                columns='gender', aggfunc='mean')
mean_ratings

在这里插入图片描述
内容为电影平均得分
行标为电影名称（索引）
列标为性别

过滤评分数据不够300条的电影
pd.DataFrame.groupby
pd.DataFrame.size

ratings_by_title = data.groupby('title').size()
ratings_by_title

在这里插入图片描述

pd.Series.index

active_titles = ratings_by_title.index[ratings_by_title >= 300]
active_titles

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

在这里插入图片描述

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

在这里插入图片描述

女性 & 男性观众最爱看电影（top10）
pd.DataFrame.sort_values

top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings

top_male_ratings = mean_ratings.sort_values(by='M', ascending=False)
top_male_ratings

top_female_ratings10 = top_female_ratings['F'][:10] 
top_female_ratings10

top_male_ratings10 = top_male_ratings['M'][:10] 
top_male_ratings10

在这里插入图片描述

rcParams['font.size'] = 16

sns.barplot(top_female_ratings10.values, top_female_ratings10.index)
plt.xlabel('rating')
plt.title('top_female_ratings10', {'fontsize': rcParams['axes.titlesize']})

在这里插入图片描述

sns.barplot(top_male_ratings10.values, top_male_ratings10.index)

在这里插入图片描述

a = set(top_female_ratings10.index) # 女性
a

b = set(top_male_ratings10.index) # 男性
b

a & b # 交集

在这里插入图片描述

pd.DataFrame.loc

mean_ratings.loc[a&b]

在这里插入图片描述

intersection = mean_ratings.loc[a&b]
intersection.describe()

plt.legend

intersection.plot.barh()
plt.legend(prop={'size':8}, loc='center') # 图例

在这里插入图片描述

计算评分分歧

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
mean_ratings

在这里插入图片描述

pd.DataFrame.sort_values

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff

在这里插入图片描述

男性更喜欢的电影
对排序结果反序并取出前10行

sorted_by_diff[::-1][:10]

在这里插入图片描述

不考虑性别因素

标准差

rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title

过滤

rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title

rating_std_by_title.sort_values(ascending=False)[:10]

在这里插入图片描述

data.groupby('genres').size()

在这里插入图片描述

喝醉酒的小白

发布了50 篇原创文章 · 获赞 51 · 访问量 2493

私信关注