[Tools] Detailed introduction of Movielens data set

MovieLens dataset

The MovieLens data set contains ratings data of multiple users for multiple movies, as well as movie metadata information and user attribute information.

download link

http://files.grouplens.org/datasets/movielens/

Introduction

Let's take the ml-100k data set as an example to introduce:

The main one is u.data (rating) | u.item (movie information) | u.user (user information)

The specific meaning of each file after downloading is as follows:
The meaning of each file is as follows:

  • allbut.pl - A script to generate training and test sets, where all training and test sets are in the training data except for n user ratings.

  • mku.sh - Shell scripts for all users generated from the u.data dataset.

  • u.data-consists of 10,000 ratings of 1,682 movies from 943 users. Each user rated at least 20 movies. Users and movies are numbered consecutively starting from number 1. The data is sorted randomly.

  • Label separated list: user id | item id | rating | timestamp

  • u.genre ---type list.

  • u.info - The number of users, movies and ratings in the u.data dataset.

  • u.item-movie information.

  • 标签分隔列表:movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children’s | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western

  • The last 19 fields are genres, 1 means the movie is of this type, 0 means not; movies can use several genres simultaneously.

  • The movie id is the same as the id in the u.data dataset.

  • u.occupation-occupation list.

  • u.user-the demographic information of the user.

  • Label separated list: user id | age | gender | occupation | zip code

  • The user id is the same as the id in the u.data data set.

  • u1.base - The data sets u1.base / u1.test to u5.base / u5.test are all training and test sets that split the u.data data set at a ratio of 80% / 20%.

  • u1.test u1,...,u5 have disjoint test sets; if it is 5 times cross validation, then you can repeat the experiment in each training and test set and average the results.

  • u2.base These data sets can be generated from u.data through mku.sh

  • u2.test

  • u3.base

  • u3.test

  • u4.base

  • u4.test

  • u5.base

  • u5.test

  • ua.base --Data set ua.base, ua.test, ub.base, ub.test The u.data data set is divided into training set and test set, each user has 10 scores in the test set.

  • ua.test ua.test and ub.test are disjoint. These data sets can be generated from u.data via mku.sh

ub.base

ub.test

Guess you like

Origin blog.csdn.net/weixin_51656605/article/details/111937937