In this report, I will use python to analyse the trend in movie market.
Packages: Pandas, Numpy, Matplotlib, Seaborn, Json
IDE: Pycharm
Major questions:
- How genres of movies change over time?
- How is the comparison between universal pictures and paramount pictures?
- How is the comparison between the movies based on novel and original?
1. Data import and cleaning
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style('darkgrid') import json import numpy as np moviesdf = pd.read_csv('movies.csv') movdf = pd.read_csv('credits.csv')
1) Fill missing values
null=moviesdf["release_date"].isnull() moviesdf.loc[null,:] moviesdf['release_date'] = moviesdf['release_date'].fillna( '2017-11-01' )
2) Convert data type
Date
moviesdf.loc[:,'release_date']=pd.to_datetime(moviesdf.loc[:,'release_date'], format='%Y-%m-%d', errors='coerce')
Json into characters
#genres moviesdf['genres'] = moviesdf['genres'].apply(json.loads) for index, i in zip(moviesdf.index, moviesdf['genres']): l = [] for j in range(len(i)): l.append((i[j]['name'])) moviesdf.loc[index, 'genres'] = str(l) #keywords moviesdf['keywords'] = moviesdf['keywords'].apply(json.loads) for index, i in zip(moviesdf.index, moviesdf['keywords']): l = [] for j in range(len(i)): l.append((i[j]['name'])) moviesdf.loc[index, 'keywords'] = str(l) #production_companies moviesdf['production_companies'] = moviesdf['production_companies'].apply(json.loads) for index, i in zip(moviesdf.index, moviesdf['production_companies']): l = [] for j in range(len(i)): l.append((i[j]['name'])) moviesdf.loc[index, 'production_companies'] = str(l) #production_countries moviesdf['production_countries'] = moviesdf['production_countries'].apply(json.loads) for index, i in zip(moviesdf.index, moviesdf['production_countries']): l = [] for j in range(len(i)): l.append((i[j]['name'])) moviesdf.loc[index, 'production_countries'] = str(l)
2. Data processing and visualising
Summarise genres in list
moviesdf['genres']=moviesdf['genres'].str.strip('[]').str.replace(' ','').str.replace("'",'') moviesdf['genres']=moviesdf['genres'].str.split(',') list1=[] for i in moviesdf['genres']: list1.extend(i) genres=pd.Series(list1).value_counts().sort_values(ascending=False) genres[:10] genres=pd.DataFrame(genres[:10]) genres.rename(columns={0:"total"},inplace=True)
1) Barplot: Genres of movies & Amount
f,ax=plt.subplots(figsize=(12,10)) g=sns.barplot(y=genres.index,x="total",data=genres,palette="Blues_d",ax=ax) plt.show()
2) Q1: How genres of movies change over time?
years=[] for x in moviesdf["release_date"]: year=x.year years.append(year) Years=pd.Series(years) moviesdf['year']=Years moviesdf['year'].head() min_year = moviesdf['year'].min() max_year = moviesdf['year'].max() liste_genres = set() for s in moviesdf['genres']: liste_genres = set().union(s, liste_genres) liste_genres = list(liste_genres) liste_genres genre_df = pd.DataFrame( index = liste_genres,columns= range(min_year, max_year + 1)) genre_df = genre_df.fillna(value = 0) year = np.array(moviesdf['year']) z = 0 for i in moviesdf['genres']: split_genre = list(i) for j in split_genre: genre_df.loc[j, year[z]] = genre_df.loc[j, year[z]] + 1 z+=1 genre_df plt.figure(figsize=(15,8)) plt.plot(genre_df.T) plt.title('rrr') plt.xticks(range(1910,2020,5)) plt.legend(genre_df.index) plt.show()
*Genres of movies increase over time, booming from 1975-1995.
*After 1995, dramas, comedies and thrillers increased dramatically.
3) Q2: How is the comparison between universal pictures and paramount pictures?
plt.figure(figsize = (7,4)) two = ['Universal Pictures', 'Paramount Pictures'] num = [77015832,70100000] plt.bar(np.arange(len(two)), num, color = 'c', width = 0.1, align = 'center') plt.ylabel('revenue') plt.xticks(np.arange(len(two)), two) plt.title('Universal Pictures VS Paramount Pictures ') plt.grid(True) plt.show()
*Until 2017, Universal Pictures has a slightly higher revenue than Paramount Pictures.
4) Q3: How is the comparison between the movies based on novel and original?
keylist = ['based on novel','original'] nums = [197,4606] plt.figure(figsize=(7, 4)) plt.bar(np.arange(len(keylist)), nums, color = 'c' , width = 0.1, align = 'center') plt.ylabel('Amount',fontsize = 12) plt.xticks(np.arange(len(keylist)), keylist,fontsize = 12) plt.title('Original VS Based on novel',fontsize = 14) plt.grid(True) plt.show()
*Until 2017, most movies are original rather than based on novel.