TMDB Movie Dataset Analysis

In this report, I will use python to analyse the trend in movie market.

Packages: Pandas, Numpy, Matplotlib, Seaborn, Json

IDE: Pycharm

Major questions:

  • How genres of movies change over time?
  • How is the comparison between universal pictures and paramount pictures?
  • How is the comparison between the movies based on novel and original?

1. Data import and cleaning

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import json
import numpy as np

moviesdf = pd.read_csv('movies.csv')
movdf = pd.read_csv('credits.csv')

1) Fill missing values

null=moviesdf["release_date"].isnull()
moviesdf.loc[null,:]
moviesdf['release_date'] = moviesdf['release_date'].fillna( '2017-11-01' )

2) Convert data type

Date

moviesdf.loc[:,'release_date']=pd.to_datetime(moviesdf.loc[:,'release_date'],
                                    format='%Y-%m-%d',
                                    errors='coerce')

Json into characters

#genres
moviesdf['genres'] = moviesdf['genres'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['genres']):
    l = []
    for j in range(len(i)):
        l.append((i[j]['name']))
    moviesdf.loc[index, 'genres'] = str(l)

#keywords
moviesdf['keywords'] = moviesdf['keywords'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['keywords']):
    l = []
    for j in range(len(i)):
        l.append((i[j]['name']))
    moviesdf.loc[index, 'keywords'] = str(l)

#production_companies
moviesdf['production_companies'] = moviesdf['production_companies'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['production_companies']):
    l = []
    for j in range(len(i)):
        l.append((i[j]['name']))
    moviesdf.loc[index, 'production_companies'] = str(l)

#production_countries
moviesdf['production_countries'] = moviesdf['production_countries'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['production_countries']):
    l = []
    for j in range(len(i)):
        l.append((i[j]['name']))
    moviesdf.loc[index, 'production_countries'] = str(l)

2. Data processing and visualising

Summarise genres in list

moviesdf['genres']=moviesdf['genres'].str.strip('[]').str.replace(' ','').str.replace("'",'')
moviesdf['genres']=moviesdf['genres'].str.split(',')

list1=[]
for i in moviesdf['genres']:
    list1.extend(i)
genres=pd.Series(list1).value_counts().sort_values(ascending=False)
genres[:10]
genres=pd.DataFrame(genres[:10])
genres.rename(columns={0:"total"},inplace=True)

1) Barplot: Genres of movies & Amount

f,ax=plt.subplots(figsize=(12,10))
g=sns.barplot(y=genres.index,x="total",data=genres,palette="Blues_d",ax=ax)
plt.show()

Figure_1.png

2) Q1: How genres of movies change over time?

years=[]
for x in moviesdf["release_date"]:
    year=x.year
    years.append(year)
Years=pd.Series(years)
moviesdf['year']=Years
moviesdf['year'].head()

min_year = moviesdf['year'].min()
max_year = moviesdf['year'].max()

liste_genres = set()
for s in moviesdf['genres']:
    liste_genres = set().union(s, liste_genres)
liste_genres = list(liste_genres)
liste_genres

genre_df = pd.DataFrame(  index = liste_genres,columns= range(min_year, max_year + 1))
genre_df = genre_df.fillna(value = 0)
year = np.array(moviesdf['year'])
z = 0
for i in moviesdf['genres']:
    split_genre = list(i)
    for j in split_genre:
        genre_df.loc[j, year[z]] = genre_df.loc[j, year[z]] + 1
    z+=1

genre_df

plt.figure(figsize=(15,8))
plt.plot(genre_df.T)
plt.title('rrr')
plt.xticks(range(1910,2020,5))
plt.legend(genre_df.index)
plt.show()

Figure_2.png

*Genres of movies increase over time, booming from 1975-1995.

*After 1995, dramas, comedies and thrillers increased dramatically.

3) Q2: How is the comparison between universal pictures and paramount pictures?

plt.figure(figsize = (7,4))
two = ['Universal Pictures', 'Paramount Pictures']
num = [77015832,70100000]
plt.bar(np.arange(len(two)), num, color = 'c', width = 0.1, align = 'center')
plt.ylabel('revenue')
plt.xticks(np.arange(len(two)), two)
plt.title('Universal Pictures VS Paramount Pictures ')
plt.grid(True)
plt.show()

Figure_4.png

*Until 2017, Universal Pictures has a slightly higher revenue than Paramount Pictures.

4) Q3: How is the comparison between the movies based on novel and original?

keylist = ['based on novel','original']
nums = [197,4606]
plt.figure(figsize=(7, 4))
plt.bar(np.arange(len(keylist)), nums, color = 'c' , width = 0.1, align = 'center')
plt.ylabel('Amount',fontsize = 12)
plt.xticks(np.arange(len(keylist)), keylist,fontsize = 12)
plt.title('Original VS Based on novel',fontsize = 14)
plt.grid(True)
plt.show()

Figure_3.png

*Until 2017, most movies are original rather than based on novel.

Guess you like

Origin www.cnblogs.com/zfkepic/p/12208083.html