MovieLens 1M 数据集
了解有关具有丰富数据,图像和预告片的电影的更多信息。
!git clone https://github.com/wesm/pydata-book
!tree pydata-book/datasets/movielens
0 导入相关库
# 基础
import numpy as np # 处理数组
import pandas as pd # 读取数据&&DataFrame
import matplotlib.pyplot as plt # 制图
import seaborn as sns
from matplotlib import rcParams # 定义参数
from matplotlib.cm import rainbow # 配置颜色
%matplotlib inline
import warnings
warnings.filterwarnings('ignore') # 忽略警告信息
np.set_printoptions(precision=4) # 小数点后
pd.options.display.max_rows = 10 # 最大行数
1 读取文本文件(ZIP格式)
用户信息
人口统计学数据(年龄、职业、邮编)
6000名用户信息
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
user = pd.read_table('pydata-book/datasets/movielens/users.dat', sep='::', header=None, names=unames)
user
User information is in the file users.dat
and is in the following format:
UserID::Gender::Age::Occupation::Zip-code
-
Gender is denoted by a “M” for male and “F” for female
-
Age is chosen from the following ranges:
- 1: “Under 18”
- 18: “18-24”
- 25: “25-34”
- 35: “35-44”
- 45: “45-49”
- 50: “50-55”
- 56: “56+”
-
Occupation is chosen from the following choices:
- 0: “other” or not specified
- 1: “academic/educator”
- 2: “artist”
- 3: “clerical/admin”
- 4: “college/grad student”
- 5: “customer service”
- 6: “doctor/health care”
- 7: “executive/managerial”
- 8: “farmer”
- 9: “homemaker”
- 10: “K-12 student”
- 11: “lawyer”
- 12: “programmer”
- 13: “retired”
- 14: “sales/marketing”
- 15: “scientist”
- 16: “self-employed”
- 17: “technician/engineer”
- 18: “tradesman/craftsman”
- 19: “unemployed”
- 20: “writer”
电影评分
100万条评分
数据
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/datasets/movielens/ratings.dat', sep='::', header=None, names=rnames)
ratings
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/datasets/movielens/ratings.dat', sep='::', header=None, names=rnames)
ratings
All ratings are contained in the file ratings.dat
and are in the following format:
UserID::MovieID::Rating::Timestamp
- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
电影元数据(风格类型和年代)
4000部电影信息
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('pydata-book/datasets/movielens/movies.dat', sep='::', header=None, names=mnames)
movies
Movie information is in the file movies.dat
and is in the following format:
MovieID::Title::Genres
-
Titles are identical to titles provided by the IMDB (including year of release)
-
Genres are pipe-separated and are selected from the following genres:
- Action
- Adventure
- Animation
- Children’s
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
-
Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries -
Movies are mostly entered by hand, so errors and inconsistencies may exist
2 数据预处理
合并((ratings, users), movies)
pd.merge
data = pd.merge(pd.merge(ratings, users), movies)
data
按性别计算每部电影的平均分
pd.DataFrame.pivot_table # 行 -> 列
mean_ratings = data.pivot_table('rating', index='title',
columns='gender', aggfunc='mean')
mean_ratings
内容为电影平均得分
行标为电影名称(索引)
列标为性别
过滤评分数据不够
300
条的电影
pd.DataFrame.groupby
pd.DataFrame.size
ratings_by_title = data.groupby('title').size()
ratings_by_title
pd.Series.index
active_titles = ratings_by_title.index[ratings_by_title >= 300]
active_titles
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings
女性 & 男性观众最爱看电影(top10)
pd.DataFrame.sort_values
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings
top_male_ratings = mean_ratings.sort_values(by='M', ascending=False)
top_male_ratings
top_female_ratings10 = top_female_ratings['F'][:10]
top_female_ratings10
top_male_ratings10 = top_male_ratings['M'][:10]
top_male_ratings10
rcParams['font.size'] = 16
sns.barplot(top_female_ratings10.values, top_female_ratings10.index)
plt.xlabel('rating')
plt.title('top_female_ratings10', {'fontsize': rcParams['axes.titlesize']})
sns.barplot(top_male_ratings10.values, top_male_ratings10.index)
a = set(top_female_ratings10.index) # 女性
a
b = set(top_male_ratings10.index) # 男性
b
a & b # 交集
pd.DataFrame.loc
mean_ratings.loc[a&b]
intersection = mean_ratings.loc[a&b]
intersection.describe()
plt.legend
intersection.plot.barh()
plt.legend(prop={'size':8}, loc='center') # 图例
计算评分分歧
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
mean_ratings
pd.DataFrame.sort_values
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff
男性更喜欢的电影
对排序结果反序并取出前10行
sorted_by_diff[::-1][:10]
不考虑性别因素
标准差
rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title
过滤
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title
rating_std_by_title.sort_values(ascending=False)[:10]
data.groupby('genres').size()