《利用python进行数据分析》————MovieLens 1M数据集

[本次数据分析所用到的数据集链接]
(http://github.com/wesm/pydata-book)
先使用pandas.read_table将每个表加载到一个pandas.DataFrame对象中:

import pandas as pd

#让展示的内容少一点
pd.options.display.max_rows = 10

unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('datasets/movielens/users.dat',sep = '::',header = None,names = unames)

rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('datasets/movielens/ratings.dat',sep = '::',header = None,names = rnames)

mnames = ['movie_id','title','genres']
movies = pd.read_table('datasets/movielens/movies.dat',sep = '::',header = None,names = mnames)

然后首先将ratings表与users表合并,然后将该结果与movies表数据合并:

data = pd.merge(pd.merge(ratings,users),movies)
print(data)

使用pivot_table方法或得按性别分级的每部电影的平均电影评分:

mean_ratings = data.pivot_table('rating',index = 'title',columns='gender',aggfunc='mean')
print(mean_ratings[:5])

过滤掉少于250个评分的电影,并使用size()为每个标题获取一个元素是各分组大小的Series,然后评分多于250个的电影标题的索引之后可以用于从mean_ratings中选出所需的行:

ratings_by_title = data.groupby('title').size()
print(ratings_by_title[:10])
active_titles = ratings_by_title.index[ratings_by_title >= 250]
print(active_titles)
mean_ratings = mean_ratings.loc[active_titles]
print(mean_ratings)

要看到女性观众的top电影,我们可以按F列降序排序:

top_female_ratings = mean_ratings.sort_values(by = 'F',ascending = False)
print(top_female_ratings[:10])

如果想要找到男性和女性观众之间最具有分歧性的电影,一种方法是添加一列到含有均值差的mean_ratings中:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

按照’diff’排序产生评分差异最大的电影,以便我们可以看到哪些是女性首选的:

sorted_by_diff = mean_ratings.sort_values(by = 'diff')
print(sorted_by_diff[:10])

转换行的顺序,并切片出top10的行,我们就可以获得男性更喜欢但女性评分不高的电影:

print(sorted_by_diff[::-1][:10])

如果你想要的是不依赖于性别标识而在观众中引起最大异议的电影。异议可以通过评分的方差或者标准差来衡量:

rating_std_by_title = data.groupby('title')['rating'].std()
ratings_std_by_title = rating_std_by_title.loc[active_titles]
print(rating_std_by_title.sort_values(ascending = False)[:10])

猜你喜欢

转载自blog.csdn.net/weixin_43303087/article/details/84037242