Python data analysis case 12 - Netflix film and television drama data analysis and visualization

background introduction

Netflix is ​​one of the most popular media and video streaming platforms. There are more than 8000 movies or TV shows on their platform. As of mid-2021, they have over 200 million subscribers worldwide.

Bloggers also watch a lot of American dramas. High-scoring American dramas like "Stranger Things" and "Sex Education" are all on Netflix.

For Netflix's film and television dramas, we can analyze the proportion of its movies and TV dramas, release year, country, type of film and television drama, ratings, profile keywords, etc., and perform a certain degree of descriptive statistics and visualization. In this way, it can be concluded which types of film and television dramas are more popular, which countries release more film and television dramas, and so on.

Note: (This article does not involve advanced and complex mathematical models, the main core is descriptive analysis and visualization of data.) 


 

Introduction to Datasets

This tabular dataset comes from kaggle, Netflix Movies and TV Shows | Kaggle

Contains a list of all Movies and TV Shows available on Netflix with details like Actors, Directors, Ratings, Release Years, Duration, and more.

Students who are inconvenient to register a kaggle account on the Internet can comment and leave their emails to find the main data set of the blog.


Data reading and cleaning

Import commonly used packages for data analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

plt.rcParams ['font.sans-serif'] ='SimHei'              #显示中文
plt.rcParams ['axes.unicode_minus']=False               #显示负号

Next, read the data set and convert it into a pandas data frame object, delete all the columns whose values ​​are blank, set the program number of the first column as the index, and view the first five rows of the data

df=pd.read_csv('netflix_titles.csv',encoding='ANSI').dropna(how='all',axis=1).set_index('show_id')
df.head()

It can be seen that it is mainly text type data.


 Data variable introduction and analysis

Variable Information Introduction

'type' is the type of film and television drama, that is, whether it belongs to a movie or a TV series, and the type variable

'title' is the name of the film and television drama, a text variable

'director' is the name of the director, a text variable

'cast' for all cast names, text variable

'country' is the country of distribution, a categorical variable

'date_added' The date and time variable of the TV series added on Netflix

'release_year' the actual release year of the film and television drama, time variable

'rating' TV rating of the movie/show, categorical variable

'duration' total duration, categorical variable

'listed_in' TV drama program type, multiple groups of categorical variables

'description' movie and TV series introduction, text variable


View all variable information for the data

df=df.infer_objects()
print(df.shape)
df.info()

 From the above data information, we can see that the data has a total of 8798 items and 11 variables. Some variables have certain missing values. The missing values ​​are processed below.


data cleaning

Visualize missing values

#观察缺失值
import missingno as msno
msno.matrix(df)

 

It can be seen that there are many missing values ​​in the director column, and there are also some missing values ​​in actors and distribution countries. Since the directors and actors of each film and television drama are unique, and they are text-type data, the mean or mode cannot be used for filling here, and we use 'no data' instead of null values.

The country of distribution is filled with the country with the most distribution of film and television dramas in the existing data, and samples with missing values ​​in other columns can be deleted.

fill modification

df['country'] = df['country'].fillna(df['country'].mode()[0])
df['cast'].fillna('No Data',inplace  = True)
df['director'].fillna('No Data',inplace  = True)
df.dropna(inplace=True)

remove duplicates

df.drop_duplicates(inplace=True)

Convert time variable to time format

For later analysis, the year and month when the film and television drama was added to the Netflix section are extracted as categorical variables

df["date_added"] = pd.to_datetime(df['date_added'])
df['year_added'] = df['date_added'].dt.year
df['month_name_added']=df['date_added'].dt.month_name()
df['release_year']=df['release_year'].astype('int')

View data information again

df.info()

 In the end, there are 8774 pieces of sample data remaining, all variables have no missing values, and the variable types are all correct. The following analysis and visualization can be performed


Analysis and its visualization

Analysis of the respective proportions of movies and TV series in Netflix film and television dramas

plt.figure(figsize=(2,2),dpi=180)
p1=df.type.value_counts()
plt.pie(p1,labels=p1.index,autopct="%1.3f%%",shadow=True,explode=(0.2,0),colors=['royalblue','pink']) #带阴影,某一块里中心的距离
plt.title("网飞影视剧中电影和电视剧的各自占比")
plt.show()

 It can be seen that the number of movies in Netflix film and television dramas accounts for more, nearly seven layers, and TV dramas account for about 30%.

Distribution country analysis of Netflix film and television series

import squarify
p2=df.country.value_counts()[:15]
fig = plt.figure(figsize = (8,4),dpi=256)
ax = fig.add_subplot(111)
plot = squarify.plot(sizes = p2, # 方块面积大小
                     label = p2.index, # 指定标签
                     #color = colors, # 指定自定义颜色
                     alpha = 0.8, # 指定透明度
                     value = p2, # 添加数值标签
                     edgecolor = 'white', # 设置边界框
                     linewidth =0.1 # 设置边框宽度
                    )
# 设置标题大小
ax.set_title('网飞影视剧数量发行量排名前15的国家',fontsize = 22)
# 去除坐标轴
ax.axis('off')
# 去除上边框和右边框刻度
ax.tick_params(top = 'off', right = 'off')
# 显示图形
plt.show()

 

It can be seen that because Netflix is ​​an American company, it has the largest number of film and television works in its homeland, accounting for almost half of all film and television works, followed by India, the United Kingdom, Japan, South Korea, and Canada. Netflix in these five countries There are also more film and television dramas.

(Only the top 15 countries are selected, because the map will be messy if there are too many countries)

Comparative analysis of the number of movies and TV dramas in the top 10 countries with the distribution of Netflix movies and TV dramas 

def check0(txt):
    if txt in p2.index[:10]:
        a=True
    else:
        a=False
    return a
df_bool=df.country.astype('str').apply(check0)
p3=pd.crosstab(df[df_bool].type,df[df_bool].country,normalize='columns').T.sort_values(by='TV Show')
m =np.arange(len(p3))
plt.figure(figsize = (8,4),dpi=256)
plt.bar(x=m, height=p3.iloc[:,0], label=p3.columns[0], width=0.3,alpha=0.5, hatch='.',color='orange') 
plt.bar(x=m , height=p3.iloc[:,1], label=p3.columns[1], bottom=p3.iloc[:,0],width=0.3,alpha=0.5,hatch='*',color='lime')
plt.xticks(range(len(p3)),p3.index,fontsize=10,rotation=30)
plt.legend()
plt.ylabel('频率')
plt.title("网飞影视剧发行量前10的国家电影和电视剧数量对比")
plt.show()

 

From the perspective of the top ten countries with Netflix distribution, India’s Netflix film and television dramas account for a very high proportion of movies, followed by Egypt and the United States.

TV dramas account for a relatively high proportion in South Korea, Japan, and the United Kingdom.

It shows that Netflix's production and filming of film and television dramas in India, Egypt, and the United States is more biased towards movies. In South Korea, Japan, and the United Kingdom, they are more inclined to release TV dramas.

(Only the top 10 countries are selected, because there are too many countries and the map will be messy, and the country names are stacked together and cannot be placed)

Rating analysis of film and television dramas

p4=df.rating.value_counts()
plt.figure(figsize = (6,3),dpi=256)
sns.barplot(x=p4.index,y=p4)
plt.ylabel('数量')
plt.xlabel('评价')
plt.xticks(fontsize=10,rotation=45)
plt.title("网飞所有影视剧不同评级数量对比")
plt.show()

 It can be seen that most of the evaluations are TV-MA and TV-14, that is, the ratings of film and television dramas suitable for adults and suitable for those over 14 years old.

df_bar=pd.crosstab(df.type,df.rating).T.sort_values(by='Movie',ascending=False).unstack().reset_index().rename(columns={0:'number'})
plt.subplots(figsize = (10,4),dpi=128)
sns.barplot(x=df_bar.rating,y=df_bar.number,hue=df_bar.type,palette = "copper")

 It can be seen that there are TV-MA, TV-14 and TV-PG movies and TV series, and R and PG are all movies.

Rating analysis of film and television dramas in different distribution countries

df_heatmap=df[df_bool].groupby('country')['rating'].value_counts().unstack().sort_index().fillna(0).astype(int).T#.sort_values(by='Movie',ascending=False).T
for col in df_heatmap.columns:
    df_heatmap[col]=df_heatmap[col]/df_heatmap[col].sum()
corr = plt.subplots(figsize = (8,6),dpi=256)
corr= sns.heatmap(df_heatmap,annot=True,square=True,annot_kws={'size':6,'weight':'bold', 'color':'royalblue'},fmt='.2f',cmap='cubehelix_r')
plt.title('不同发行国家的网飞影视剧评级对比')
plt.show()

 

 From the figure above, it can be seen intuitively that the vast majority of Netflix film and television drama ratings are TV-MA and TV-14, which is consistent with the previous conclusion.

From the perspective of different distribution countries, the number of Netflix movies and TV series produced and distributed in Canada, France, Mexico, Spain, the United Kingdom, and the United States tends to be more suitable for adults to watch.

Netflix film and television dramas produced and distributed in Egypt, India, Japan, and South Korea are more frequently evaluated as suitable for viewing by 14 years old and above.

This is consistent with the traditional concept. Film and television dramas in Western countries such as Europe and the United States will be more open, while film and television dramas in India, Japan, South Korea and Asian countries will be more conservative.

Analysis of the release year of film and television dramas

plt.figure(figsize=(8,3.5),dpi=128)
colors=['tomato','orange','royalblue','lime','pink']
for i, mtv in enumerate(df['type'].value_counts().index):
    mtv_rel = df[df['type']==mtv]['year_added'].value_counts().sort_index()
    plt.plot(mtv_rel.index, mtv_rel, color=colors[i], label=mtv)
    plt.fill_between(mtv_rel.index, 0, mtv_rel, color=colors[i], alpha=0.8)
    plt.legend()
plt.ylabel('网飞发行影视剧数量')
plt.xlabel('年份')
plt.title('网飞在不同年份上映影视剧数量')
plt.show()

 It can be seen that since 2014, Netflix has experienced an explosive growth in the number of film and television dramas, especially in 2019, when the number of film and television dramas released is the largest.

After 19 years, due to the impact of the epidemic, the number of film and television works released has shown a slow downward trend.

Analysis of the release months of film and television dramas

plt.figure(figsize=(5,5),dpi=128)
colors=['tomato','orange','royalblue','lime','pink','brown']

p5=df.month_name_added.value_counts()
plt.pie(p5,labels=p5.index,autopct="%1.3f%%",shadow=True,explode=(0.2,0.1,0.08,0.06,0.04,0.02,0,0,0,0,0,0),colors=colors) #带阴影,某一块里中心的距离
plt.title('网飞影视剧上映月份分析')
plt.show()

 

It can be seen that the number of Netflix film and television dramas is released in a relatively even month. Among them, there are more TV dramas released in July and December, which coincides with the summer and winter vacations in the West, and there are more TV dramas released during holidays.

February and March are the months with the least releases of film and television dramas.

Age Analysis of Released Movies and TV Dramas

df_age=df.assign(age=df.year_added-df.release_year)[['type','age']]
plt.figure(figsize=(3,4),dpi=128)
sns.boxplot(x='type',y='age',width=0.8,data=df_age,orient="v") 
plt.show()

 

It can be seen that the release time and release time of most movies or TV series are not much different. The median is about 2 to 3 years, and the movies will be slightly bigger. This also reflects that good movies can continue to circulate than TV series. features

There are many outliers in movies and TV series, and there are too many maximum values. The main reason may be that Netflix has included many classic TV series and movies in the past.

Type analysis of film and television drama

p6=df.assign(kind=df.listed_in.str.split(',')).explode('kind')['kind'].value_counts()[:15]
plt.figure(figsize=(10,4),dpi=128)
sns.barplot(y=p6.index,x=p6,orient="h")
plt.xlabel('影片数量')
plt.ylabel('影视剧类型')
plt.xticks(fontsize=10,rotation=45)
plt.title("网飞不同影视剧类型数量对比")
plt.show()

You can see clearly that the most popular types of TV dramas on Netflix are international movies, followed by dramas, comedies, action-adventure films, and documentaries

Only watch American movies and TV shows

p7=df.assign(kind=df.listed_in.str.split(',')).explode('kind').where(lambda d:d.country=='United States').dropna()['kind'].value_counts()[:12]         
plt.figure(figsize=(5,5),dpi=128)
plt.pie(p7,labels=p7.index,autopct="%1.2f%%",shadow=True,explode=(0.15,0.1,0.08,0.06,0.04,0.02,0,0,0,0,0,0),colors=['c', 'b', 'g', 'tomato', 'm', 'y', 'lime', 'w','orange','pink','grey','tan']) 
plt.title('在美国制作发行的网飞影视剧类型数量对比')
plt.show()

 From the above pie chart, we know that among the movies and TV shows shown on Netflix in the United States, documentaries are the most popular, followed by dramas, comedies, family movies, and independent movies.

Analysis of directors and actors of Netflix film and television dramas

p8=df.assign(directo=df.director.str.split(',')).explode('directo')['directo'].value_counts()[1:11]
p9=df.assign(cas=df.cast.str.split(',')).explode('cas')['cas'].value_counts()[1:11]

plt.subplots(1,2,figsize=(12,5),dpi=128)
plt.subplot(121)
sns.barplot(y=p8.index,x=p8,orient="h")
plt.ylabel('导演姓名')
plt.xlabel('导演影视剧的数量',fontsize=14)
plt.title("(a)网飞影视剧导演数量前十的导演")
 
plt.subplot(122)
sns.barplot(y=p9.index,x=p9,orient="h")
plt.ylabel('演员名字')
plt.xlabel('出演影视剧的数量',fontsize=14)
plt.title("(b)网飞影视剧出演数量前十的演员")
#plt.legend()
plt.tight_layout()
plt.show()

From the picture above, we can see the top ten directors of Netflix's film and television dramas, and the top ten actors. (I can only see the names and I don’t know them...) ((Only select the top 10, because the picture will look messy if there are too many names))

Word cloud of Netflix movie titles

The background uses the Netflix logo

 

from wordcloud import WordCloud
import random
from PIL import Image
import matplotlib
# Custom colour map based on Netflix palette
mask = np.array(Image.open('wf.png'))

cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#221f1f', '#b20710'])
text = str(list(df['title'])).replace(',', '').replace('[', '').replace("'", '').replace(']', '').replace('.', '')
wordcloud = WordCloud(background_color = 'white', width = 500,  height = 200,colormap=cmap, max_words = 150, mask = mask).generate(text)
plt.figure( figsize=(9,5),dpi=1028)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

 

 It can be seen that the words with the highest number of words in the titles of Netflix movies and TV series are 'LOVE', 'World', 'Day', 'Life', 'Girl' and other words.

The word cloud of Netflix's film and television drama introduction

text2=str(list(df['description'])).replace(',', '').replace('[', '').replace("'", '').replace(']', '').replace('.', '')
wordcloud = WordCloud(background_color = 'white', width = 500,  height = 200,colormap='coolwarm', max_words =30).generate(text2)
plt.figure( figsize=(8,4),dpi=512)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

 It can be seen that the most frequently used words in the introduction of Netflix movies and TV series are 'life', 'family', 'love', 'find', 'new' and other words.


Summarize

By analyzing the data of more than 8,000 TV series on Netflix, we can draw the following conclusions:

1. The number of movies in Netflix film and television dramas accounts for more, nearly seven layers, and TV dramas account for about 30%

2. Since Netflix is ​​a company in the United States, it has the largest number of film and television works in its homeland, accounting for almost half of all Netflix film and television works, followed by India, the United Kingdom, Japan, South Korea, and Canada, Netflix of these five countries There are also more film and television dramas.

3. Netflix’s production and filming of film and television dramas in India, Egypt, and the United States is more biased towards movies. In South Korea, Japan, and the United Kingdom, they are more inclined to release TV dramas.

4. The ratings of most of Netflix's film and television dramas are TV-MA and TV-14, that is, the ratings of film and television dramas suitable for adults and those suitable for over 14 years old.

5. The distribution countries of Netflix movies and TV dramas are related to the ratings of movies and TV dramas. Movies and TV dramas in Western countries such as Europe and the United States will be more open, while movies and TV dramas in India, Japan, South Korea and Asian countries will be more conservative.

6. Beginning in 2014, the number of film and television dramas began to grow explosively, especially in 2019, when the number of film and television dramas was released the most. After 19 years, due to the impact of the epidemic, the number of film and television works released has shown a slow downward trend.

7. The number of Netflix movies and TV dramas is released in a relatively even month. Among them, there are more TV dramas released in July and December, which coincides with the summer and winter vacations in the West, and there are more TV dramas released during holidays. February and March are the months with the least releases of film and television dramas.

8. The release time and release time of most Netflix movies or TV series are not much different, and the movies will be slightly bigger, which reflects the characteristics that good movies can be circulated more than TV series. There are many outliers in movies and TV series, and there are too many maximum values. The main reason may be that Netflix has included many classic TV series and movies in the past.

9. The most popular types of TV series on Netflix are international films, followed by dramas, comedies, action-adventure films, and documentaries.

10. Among the film and television dramas released by Netflix in the United States, documentaries are the most popular, followed by dramas, comedies, family films, and independent films.

11. Learn about the top ten directors of Netflix's film and television dramas, and the top ten actors who starred in them.

12. The words with the highest number of words in the titles of Netflix movies and TV series are 'LOVE', 'World', 'Day', 'Life', 'Girl' and other words.

13. The most frequently used words in the introduction of Netflix movies and TV shows are 'life', 'family', 'love', 'find', 'new' and other words


Since this article does not use a very complicated mathematical model, the conclusions obtained are not very advanced, but they are also very effective and meaningful. Excel can't do this effect... You can learn these drawing methods at the core. After all, beautiful images and effective conclusions are the meaning of visualization.

Guess you like

Origin blog.csdn.net/weixin_46277779/article/details/128125922