[Data analysis] python-made lousy depth Secret (2) - the most of what the theme of the movie lousy movies?

Claim:

① accordance with the "type" field classification, screening different movies belong to what subjects

② organize your data, in accordance with the "theme" summary view of the proportion of bad films on diverse topics, and select TOP20

The proportion of subjects rotten piece ③ → scattergram obtained TOP20 production abscissa "theme" type, the ordinate is the proportion of bad films, point size number of samples

** with bokeh cartography

** made in accordance with the descending order of proportion rotten

prompt:

① delete data "type" field null

② As a movie "types" have more, there needs to be a movie for each "type" are identified, we need to calculate when a statistical themes, such as:

If the movie is a type: the proportion of bad film "Comedy / Romance", the calculation of "comedy", "Love" theme, the movie needs to count

③ pay attention to the type of field you want to delete the space character

④ bokeh FIG set point size, where prescribing → size by reducing the data gaps coefficient = count ** 0.5 *

1 preparation

import os
os.chdir(r'C:\Users\86177\Desktop')
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

from bokeh.models import ColumnDataSource,HoverTool
from bokeh.plotting import figure,show,output_file

Import and associated libraries provided running path, and the portion of the alarm information processing ignored, set the graphic displayed directly in the notebook (% matplotlib inline)

2 screening data

The goal is to find out what the different theme of lousy movies, the data related to the two, a watercress score is used to determine whether bad films, this step in a previous article has been processed, the data created from data ; there is a film type, which is used to distinguish the subject matter. Thus where the film to be missing values ​​'type' in this data class data, the target data finally obtained

2.1 Viewing movie 'type' of data

data_load = pd.read_excel('moviedata.xlsx')
data = data_load[data_load['豆瓣评分'] >0]
data['类型']

-> output is:
Here Insert Picture Description
prompted understood, 'Type' field contains content inside space and content / character missing values and the like

2.2 missing values ​​and Symbol Processing

For through the loop, the large amount of data can be processed before the first trial and error

data[data['类型'].isnull() == False]['类型'][0].strip().split(' / ')

-> The output is: [ 'story', 'children']

typelst = [x for x in data[data['类型'].isnull() == False]['类型'].str.strip().str.split(' / ')]
typelst

-> The output is:
Here Insert Picture Description
direct use list comprehensions, there is no way to generate the data the way I want, so it is simple to operate according to the original way, can only open a distribution operation, the following code can be fully displayed. the difference between the methods append () and .extend ()

typelst = []
for x in data[data['类型'].isnull() == False]['类型'].str.strip().str.split(' / '):
    #typelst.append(x) 如果是append的话,那么这三条语句就相当于上面的一条语句
    typelst.extend(x)#这里就体现出了append和expend的区别
typelst = list(set(typelst))
print('电影类型一共有{}种,分别为:\n\n{}'.format(len(typelst),typelst))

-> output is:

Genre A total of 35 species were:

[ 'Music', 'ghost', 'Comedy', 'sport', 'crime', 'action', 'costume', 'read', 'Mystery', 'adventure', 'drama', 'News',' biography ',' terrorist ',' Erotica ',' stagecraft ',' martial arts ',' science fiction ',' Thriller ',' history ',' dance ',' talk ',' disaster ',' story ',' film noir ',' family ',' children ',' gay ',' love ',' war ',' fantastic ',' animation ',' reality show ',' western ',' documentary ']

2.3 and the ratio of the number of film (1) bad sheet of different types of themes

Here note (1) title and different (2) of title, data on bad films, the article has been completed data_lp, there is no longer written into the code, and direct view of the data content data_lp

data_lp

-> The output is:
Here Insert Picture Description
2.2 acquisition of all types of whole movie, but for data_lp, the last article only a judge lousy situation, and no missing values rotten deal with the type of movie film, and therefore to determine the type of data slice bad, missing values data_lp first treated at the inside, and the above operation is similar to the code

data_lp['类型'].isna().sum()
#发现一共存在31个缺失值,那么处理后的数据应该就是515

-> output is: 31

data_lp = data_lp[data_lp['类型'].notnull()].set_index(np.arange(515))
data_lp 

-> output is:
Here Insert Picture Description
After the completion of the screening data, output data in accordance with three steps top20 rotten piece, where the three packaging functions lp_one_info (), lp_info (), lp_top_20 ()

def lp_one_info(data,type_i):
    dic = {}
    dic['lp_name'] = type_i
    dic['lp_num'] = len(data[data['类型'].str.contains(type_i)])
    dic['lp_rate'] ='{:.2f}%'.format(len(data[data['类型'].str.contains(type_i)])/len(data)*100)
    return dic

lp_one_info () function is to realize lp_info () function is created, the test is in the premise of a trial and error of the data, type_i is one data typelst inside, specifically exemplified as follows

lp_one_info(data_lp,typelst[0])

-> output is: { 'lp_name': 'Reality', 'lp_num': 0, 'lp_rate': '0.00%'}

def lp_info(data,ls):
    lp_ls = []
    for type_i in ls:
        lp_ls.append(lp_one_info(data_lp, type_i))  
    return(lp_ls)

lp_info () function is a function for each data typelst inside one by one and stored in the output list which, results are as follows

lp_info(data_lp,typelst)

-> output is:
Here Insert Picture Description

def lp_top_20(lp_ls):
    df_lp= pd.DataFrame(lp_ls)
    df_lp.sort_values(by='lp_num',inplace = True,ascending = False)
    df_lp_20 = df_lp[:20].set_index(np.arange(1,21))
    return df_lp_20

Function lp_top_20 () function returns the top 20 is a direct rotten piece of data types, the following code is run

lp_top_20(lp_info(data_lp,typelst))

-> output is:
Here Insert Picture Description

2.3 (2) and the ratio of the number of different types of rotten piece of movie subjects

Carefully think about the kind of movie theme prone to bad films, this conclusion can be drawn premise is this: the size of the share of the various themes rotten piece of data among all the data, the more the proportion of the more tendency lousy movies, local contrast rather than the sample data_lp, so long as the function lp_one_info () can be modified part of the code, which is a function of convenience of packaging, the last third of the split function, also the operation later use, that or not more convenient package three functions

def lp_one_info(data,type_i):
    dic = {}
    data_i = data[data['类型'].str.contains(type_i)]
    data_lp = data_i[data_i['豆瓣评分'] < 4.3]
    dic['lp_name'] = type_i
    dic['lp_num'] = len(data_lp)
    dic['type_num'] = len(data_i)
    #dic['lp_rate'] ='{:.2f}'.format(len(data_lp)/len(data_i))  这就是最后bokeh制图时候不出图的原因,不可以进行格式化
    dic['lp_rate'] =len(data_lp)/len(data_i)
    return dic

def lp_top_20(data,ls):
    lp_ls = []
    for type_i in ls:
        lp_ls.append(lp_one_info(data, type_i))
    return lp_ls

df_lp= pd.DataFrame(lp_top_20(data,typelst))
df_lp.sort_values(by='lp_rate',inplace = True,ascending = False)
df_lp_20 = df_lp[:20].set_index(np.arange(1,21))
   
df_lp_20
#对于最终生成DataFrame的数据,尽量不用函数

-> output is:
Here Insert Picture Description

4 Bokeh drawing

df_lp_20["size"] = df_lp_20['type_num']**0.5*2 
#设置点的大小

source = ColumnDataSource(df_lp_20)

ls_type = df_lp_20['lp_name'].tolist()
hover = HoverTool(tooltips = [("数据量","@type_num"),
                              ("烂片比例","@lp_rate")])

output_file("1.html")
p = figure(x_range = ls_type, plot_width = 900, plot_height = 500, title = "不同题材电影的烂片比例",
          tools = [hover, 'reset, xwheel_zoom, pan, crosshair, box_select'])

p.circle(x = 'lp_name', y ="lp_rate",source = source, size = "size", line_color = "black",
        line_dash = [6,4], fill_color = "red",fill_alpha = 0.5)

show(p)

#为什么当y='lp_rate'一直输出空白,然而当y=lp_num或者y=type_num时都有图形输出,调试了半天,最终的结果在def lp_one_info()函数里面,lp_rate的结果
#不能格式化输出,否则无法识别

-> output is:
Here Insert Picture Description

Published 20 original articles · won praise 4 · Views 1933

Guess you like

Origin blog.csdn.net/lys_828/article/details/104086110