MovieLens 1M 数据集

MovieLens 1M 数据集

了解有关具有丰富数据,图像和预告片的电影的更多信息。

!git clone https://github.com/wesm/pydata-book

!tree pydata-book/datasets/movielens

0 导入相关库

# 基础
import numpy as np # 处理数组
import pandas as pd # 读取数据&&DataFrame
import matplotlib.pyplot as plt # 制图
import seaborn as sns
from matplotlib import rcParams # 定义参数
from matplotlib.cm import rainbow # 配置颜色

%matplotlib inline 
import warnings
warnings.filterwarnings('ignore') # 忽略警告信息
np.set_printoptions(precision=4) # 小数点后
pd.options.display.max_rows = 10 # 最大行数

1 读取文本文件(ZIP格式)

用户信息
人口统计学数据(年龄、职业、邮编)
6000名用户信息

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
user = pd.read_table('pydata-book/datasets/movielens/users.dat', sep='::', header=None, names=unames)
user

在这里插入图片描述

User information is in the file users.dat and is in the following format:

UserID::Gender::Age::Occupation::Zip-code

  • Gender is denoted by a “M” for male and “F” for female

  • Age is chosen from the following ranges:

    • 1: “Under 18”
    • 18: “18-24”
    • 25: “25-34”
    • 35: “35-44”
    • 45: “45-49”
    • 50: “50-55”
    • 56: “56+”
  • Occupation is chosen from the following choices:

    • 0: “other” or not specified
    • 1: “academic/educator”
    • 2: “artist”
    • 3: “clerical/admin”
    • 4: “college/grad student”
    • 5: “customer service”
    • 6: “doctor/health care”
    • 7: “executive/managerial”
    • 8: “farmer”
    • 9: “homemaker”
    • 10: “K-12 student”
    • 11: “lawyer”
    • 12: “programmer”
    • 13: “retired”
    • 14: “sales/marketing”
    • 15: “scientist”
    • 16: “self-employed”
    • 17: “technician/engineer”
    • 18: “tradesman/craftsman”
    • 19: “unemployed”
    • 20: “writer”

电影评分
100万条评分数据

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/datasets/movielens/ratings.dat', sep='::', header=None, names=rnames)
ratings

在这里插入图片描述

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/datasets/movielens/ratings.dat', sep='::', header=None, names=rnames)
ratings

All ratings are contained in the file ratings.dat and are in the following format:

UserID::MovieID::Rating::Timestamp

  • UserIDs range between 1 and 6040
  • MovieIDs range between 1 and 3952
  • Ratings are made on a 5-star scale (whole-star ratings only)
  • Timestamp is represented in seconds since the epoch as returned by time(2)
  • Each user has at least 20 ratings

电影元数据(风格类型和年代)
4000部电影信息

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('pydata-book/datasets/movielens/movies.dat', sep='::', header=None, names=mnames)
movies

在这里插入图片描述
Movie information is in the file movies.dat and is in the following format:

MovieID::Title::Genres

  • Titles are identical to titles provided by the IMDB (including year of release)

  • Genres are pipe-separated and are selected from the following genres:

    • Action
    • Adventure
    • Animation
    • Children’s
    • Comedy
    • Crime
    • Documentary
    • Drama
    • Fantasy
    • Film-Noir
    • Horror
    • Musical
    • Mystery
    • Romance
    • Sci-Fi
    • Thriller
    • War
    • Western
  • Some MovieIDs do not correspond to a movie due to accidental duplicate
    entries and/or test entries

  • Movies are mostly entered by hand, so errors and inconsistencies may exist

2 数据预处理

合并((ratings, users), movies)
pd.merge

data = pd.merge(pd.merge(ratings, users), movies)
data

在这里插入图片描述

按性别计算每部电影的平均分
pd.DataFrame.pivot_table # 行 -> 列

mean_ratings = data.pivot_table('rating', index='title', 
                                columns='gender', aggfunc='mean')
mean_ratings

在这里插入图片描述
内容为电影平均得分
行标为电影名称(索引)
列标为性别

过滤评分数据不够300条的电影
pd.DataFrame.groupby
pd.DataFrame.size

ratings_by_title = data.groupby('title').size()
ratings_by_title

在这里插入图片描述

pd.Series.index

active_titles = ratings_by_title.index[ratings_by_title >= 300]
active_titles

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

在这里插入图片描述

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

在这里插入图片描述

女性 & 男性观众最爱看电影(top10)
pd.DataFrame.sort_values

top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings

top_male_ratings = mean_ratings.sort_values(by='M', ascending=False)
top_male_ratings

top_female_ratings10 = top_female_ratings['F'][:10] 
top_female_ratings10

top_male_ratings10 = top_male_ratings['M'][:10] 
top_male_ratings10

在这里插入图片描述
在这里插入图片描述

rcParams['font.size'] = 16

sns.barplot(top_female_ratings10.values, top_female_ratings10.index)
plt.xlabel('rating')
plt.title('top_female_ratings10', {'fontsize': rcParams['axes.titlesize']})

在这里插入图片描述

sns.barplot(top_male_ratings10.values, top_male_ratings10.index)

在这里插入图片描述

a = set(top_female_ratings10.index) # 女性
a

b = set(top_male_ratings10.index) # 男性
b

a & b # 交集

在这里插入图片描述

pd.DataFrame.loc

mean_ratings.loc[a&b]

在这里插入图片描述

intersection = mean_ratings.loc[a&b]
intersection.describe()

plt.legend

intersection.plot.barh()
plt.legend(prop={'size':8}, loc='center') # 图例

在这里插入图片描述

计算评分分歧

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
mean_ratings

在这里插入图片描述

pd.DataFrame.sort_values

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff

在这里插入图片描述

男性更喜欢的电影
对排序结果反序并取出前10行

sorted_by_diff[::-1][:10]

在这里插入图片描述

不考虑性别因素

标准差

rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title

过滤

rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title

rating_std_by_title.sort_values(ascending=False)[:10]

在这里插入图片描述

data.groupby('genres').size()

在这里插入图片描述

发布了50 篇原创文章 · 获赞 51 · 访问量 2493

猜你喜欢

转载自blog.csdn.net/hezuijiudexiaobai/article/details/104451980