How to build a simple recommendation system with Python?

Knowledge recommendation system we already mentioned in the foregoing, in this article, we will explain how to use Python to build a simple recommendation system.

As used herein, the data set is MovieLens data set, the data set compiled by Grouplens research team of the University of Minnesota. It contains 1, 10 and 200 million ratings. Movielens There is also a website, we can register, write a review and get movie recommendations. Then we started combat exercise.

In this article, we will use Movielens build a simple item-based recommendation system. Before you begin, the first thing is to import the pandas and numPy.

import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore')

Next, we use the pandas read_csv () to load the data set. By tab-delimited data sets, so we will \ t the parameters passed to the sep. Then, using the names argument to the column name.

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id','rating','titmestamp'])

Next View header, check the data being processed.

df.head()

If we can see the title of the movie rather than just ID, it could not be better. After loading the movie title and put it with the data collection and.

movie_titles = pd.read_csv('Movie_Titles') movie_titles.head()

Since item_id same column, we can combine these data sets in this column.

df = pd.merge(df, movie_titles, on='item_id') df.head()

Each column in the dataset Division on behalf of:

user_id- ID User rating movies.

item_id- ID movies.

rating, between user rating for the movie offers 1 and 5 -.

timestamp- rated movie time.

title- movie title.

Describe using info command or can be obtained dataset briefly described. If you want to really know the dataset is being used, then it is very important.

df.describe()

As can be seen, a total of 100,003 sets of data records, the average rating of the film is between 3.52-5.

Now we'll create a dataframe, which contains an average rating and number of ratings for each movie. After that, these scores will be used to calculate the correlation between the movie. Correlation is a statistical indicator, indicating the degree of fluctuation together two or more variables. The higher the correlation coefficient, the film is similar.

The following example will use the Pearson correlation coefficient (Pearson correlation coefficient), the number between -1 and 1, 1 denotes a positive linear correlation, -1 for negative correlation, 0 indicates that no linear correlation. In other words, the movie has zero relevance are entirely dissimilar.

We will use the pandas groupby functionality to create dataframe. Group data set by title, and averaged to obtain an average score for each movie.

ratings = pd.DataFrame(df.groupby('title')['rating'].mean()) ratings.head()

Next we create number_of_ratings column, so you can see the number of ratings for each movie. Upon completion of this operation, we can see the relationship between the number of scoring average score movies and film obtained. Movie star is likely to be only of a man, and this five-star movie statistically incorrect.

Therefore, when building recommendation systems, we need to set the threshold. We can use the pandas groupby feature to create a new column, then press title bar groups, each movie score calculation using the count function. After that, you can use the head () function to see the new dataframe.

rating ['number_of_ratings'] = df.groupby('title')['rating'].count() ratings.head()

Next, we use the pandas drawing function to draw a histogram that shows the distribution of ratings:

import matplotlib.pyplot as plt %matplotlib inline ratings['rating'].hist(bins=50)

You can see, most of the film's score are between 2.5-4. By a similar method may be visualized number_of_ratings columns.

ratings['number_of_ratings'].hist(bins=60)

It is clear from the above histogram, the majority of the film's score is very low, the highest rated movies are some very famous movie.

Now let us look at the relationship between the number of movie ratings and scores. We can use seaborn draw a scatter plot, then use the function to do this jointplot ().

import seaborn as sns sns.jointplot(x='rating', y='number_of_ratings', data=ratings)

We can see from the chart, a positive correlation between the mean score and the number of movie scores, the more the number of movie scores obtained, the higher the average score.

Create a simple item-based recommendation system

Next we will quickly create a simple item-based recommendation system.

First, we need to convert the data set as a matrix, a movie titled column, user_id as an index rating for value. This is done, we will get a dataframe, which is a movie title column, line is the user ID. Each column represents all users of all ratings of the film. NAN rating means that the user did not score for the film.

We can use this matrix to calculate the correlation matrix of a single movie ratings with the rest of the film, the matrix can be achieved by pandas pivot_table.

movie_matrix = df.pivot_table(index ='user_id',columns ='title',values ='rating') movie_matrix.head()

Let's find the largest number of movie scores, and select one of the two films. Then use pandas sort_values ​​and ascending set to false, in order to display up to score the film. Then use head () function to see the greatest number of scores of the top ten movies.

ratings.sort_values('number_of_ratings', ascending=False).head(10)

Suppose a user has seen Air Force One (1997) and the Business Card (1997), we wanted to watch the recording to the user to recommend other similar films based on these two, then this can be centralized by calculating both films with ratings data the correlation between the ratings of other movies to achieve. The first step is to create a dataframe, which contains the movies from movie_matrix rating.

AFO_user_rating = movie_matrix['Air Force One (1997)'] contact_user_rating = movie_matrix['Contact (1997)']

Dataframe user_id and can display both films score.

AFO_user_rating.head() contact_user_rating.head()

Calculating a correlation between the use of two pandas corwith dataframe function. With this step, we can get each movie ratings and Air Force One correlation between the rating of the movie.

similar_to_air_force_one = movie_matrix.corrwith(AFO_user_rating)

You can see, Air Force One movie and Till There Was You correlation between (1997) is 0.867. This indicates that there is a strong similarity between the two films.

similar_to_air_force_one.head()

It can also calculate Contact (1997) movie ratings and other rating between the relevance of steps above:

similar_to_contact = movie_matrix.corrwith(contact_user_rating)

May find, Business Card (1997) and Till There Was You there is a very strong correlation (0.904) between (1997).

similar_to_contact.head()

Been mentioned, not all users have to score all the movies we were, therefore, in the matrix have a lot of missing values. In order to make the results look more attractive, remove these null and convert relevant results dataframe.

corr_contact = pd.DataFrame(similar_to_contact, columns=['Correlation']) corr_contact.dropna(inplace=True) corr_contact.head()corr_AFO = pd.DataFrame(similar_to_air_force_one, columns=['correlation']) corr_AFO.dropna(inplace=True) corr_AFO.head()

Above this are two dataframe show with Business Card (1997) and Air Force One (1997) is most similar to movie film. However, the question arises, some of the actual quality of the film is very low, but probably because of one or two users to give them 5-star rating is recommended.

This problem can be solved by setting the rating threshold number. Seen from the early histogram, the number of downgrades from 100 starts sharply. This can be set as the threshold, but may also consider other suitable values. To this end, we will need two dataframe and rating datframe is number_of_ratingsadded together with the column.

corr_AFO = corr_AFO.join(ratings['number_of_ratings']) corr_contact = corr_contact.join(ratings['number_of_ratings'])corr_AFO.head()corr_contact.head()

Now, we can get the Air Force One (1997) most similar to the movie, and the movie is limited to those with at least 100 reviews of the movie, then press the relevant columns, sort them and view the top 10.

corr_AFO [corr_AFO ['number_of_ratings']> 100] .sort_values(by ='correlation',ascending = False).head(10)

We note that Air Force One (1997) and their most relevant, it is not surprising. The next one with the Air Force One (1997) is most similar to the movie Hunt for Red October , the correlation coefficient was 0.554.

Obviously, by changing the threshold number of comments, we can get different results according to the previous method. Limit the number of rating allows us to get better results.

Now repeat the procedure above, you can see the Contact movie most relevant movie (1997):

corr_contact [corr_contact ['number_of_ratings']> 100] .sort_values(by ='Correlation',ascending = False).head(10)

With Business Card (1997) is most similar to the movie Philadelphia (1993), the correlation coefficient was 0.446, 137 ratings. So, if someone like Business Card (1997), we can recommend the above-mentioned film to them.

The above is a very simple way to build recommendation systems, but does not meet industry standards. Then we can follow to improve the system based on collaborative filtering system memory by constructing. In this case, the data into training and test sets, such as cosine similarity using the similarity between the calculated film; or build-based collaborative filtering system model, and then use the Root Mean Squared Error (RMSE) technology assessment model.

Github: mwitiderrick/simple-recommender-

Source: How to Build A Recommender System in the Simple Python

(Above the first recommendation issued by finishing fourth paradigm)

Guess you like

Origin blog.51cto.com/13945147/2427200