[Machine learning notes] Analysis of TED speech data with Python (updated)

Analyzing TED speech data with Python

First prepare the TED speech data set. The TED speech data set and information can be obtained from the following resources:

https://www.datafountain.cn/datasets/11

The data set contains 2 files:

  • ted_main.csv
    contains the main information of the speech, including the title of the speech, the speaker, the content of the speech, the number of views, the number of comments, the rating of the speech, etc.
  • transcripts.csv
    contains speech links and official English subtitles.

1) The file ted_main.csv contains 17 fields, 2550 lines, each line represents a TED speech, specific information is as follows:

Serial number Field name type of data Field description
1 name Integer The official name of the speech (main speaker + title)
2 title String Speech title
3 description Integer Speech content
4 main_speaker String Main speaker
5 speaker_occupation Integer Major Speaker's Occupation
6 num_speaker Integer Number of speakers
7 duration String Duration of speech, in seconds
8 event String TED / TEDx event where the lecture is held
9 film_date Integer Lecture shooting time (Unix timestamp)
10 published_date Integer Speech release time (Unix timestamp)
11 comments String Number of comments
12 tags String String
13 languages String Number of languages ​​that can be selected when listening to the lecture
14 ratings String A list containing many dictionaries, each with a different speech rating (such as inspiring, fascinating, surprising, etc.)
15 related_talks String A list containing many dictionaries, each dictionary is recommended for the next speech worth watching
16 url String URL link to speech
17 views Integer Views

(2) The file transcripts.csv contains 2 fields, 2467 lines, each line represents a TED speech, the specific information is as follows:

Serial number Field name type of data Field description
1 url String URL link to speech
2 transcript String Official English subtitles for the speech

Direction of exploration ( continuously updated )

The TED speech data set can be explored from the following aspects:

  • What type of speech is the most popular?
  • What are the commonalities in the content of high-volume speeches?
  • What is the most popular theme in TED?

Structural analysis

  • The number of new speeches on TED every year?
  • What is the distribution of speech types?
  • What is the distribution of the lecture duration?
  • Profession distribution of speakers?
  • What is the distribution of speech views?
  • What is the distribution of speech comments?

Value Analysis

  • What are the characteristics of high-view speeches? (Theme, content, duration)
  • What are the characteristics of a highly discussed speech? (Degree of discussion = number of comments / page views)

Group behavior analysis

  • What is the relationship between page views and discussion?
  • What is the relationship between page views and speech duration?
  • What is the ideal duration of different types of speech?

 


Data analysis process

In Jupyter Notebook, first load the data:

%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns #matplotlib的默认作图风格就会被覆盖成seaborn的格式
import json
from pandas.io.json import json_normalize
#加载数据集
df = pd.read_csv("C:\\Machine-Learning-with-Python-master\\data\\ted_main.csv")
#查看数据集行列数
print("该数据集共有 {} 行 {} 列".format(ted.shape[0],ted.shape[1])) 

The data set contains 2550 rows and 17 columns of data.

Next, deal with the date format of the data column

 

import datetime
df['film_date'] = df['film_date'].apply(lambda x: datetime.datetime.fromtimestamp( int(x)).strftime('%d-%m-%Y'))
df['published_date'] = df['published_date'].apply(lambda x: datetime.datetime.fromtimestamp( int(x)).strftime('%d-%m-%Y'))
df.head()

Next, sort according to the number of views, and propose the first 15 rows of data

#根据views量排序前15行数据
pop_talks = df[['title','main_speaker','views','film_date']].sort_values('views',ascending=False)[:15]
pop_talks

 

 Most viewed lecture: Do schools kill creativity? Views: 47227110

Do schools kill creativity? Ken Robinson 47227110 25-02-2006
#切分main_speaker的前三个字母,新增一列abbr数据
pop_talks["abbr"] = pop_talks['main_speaker'].apply(lambda x: x[:3])
pop_talks.head()

sns.set_style("whitegrid")
plt.figure(figsize=(10,6))
sns.barplot(x='abbr',y='views',data=pop_talks)

 

sns.distplot(df['views'])

 

sns.distplot(df[df['views'] < 0.4e7]['views'])

sns.distplot(df[(df['views'] > 0.5e4)&(df['views'] < 0.4e7)]['views']) #多条件布尔索引

 

df['views'].describe()

The average number of views of TED Talks is 1.6 million. The median is 1.12 million. This shows that the popularity of TED Talks is very high. We also noticed that most of Talks ’views are less than 4 million. We will use this block diagram as the cut-off point of the block diagram in later chapters. 

sns.distplot(df['comments'])

 

sns.distplot(df[df['comments'] < 500]['comments'])

sns.jointplot(x='views', y='comments', data=df)

df['comments'].describe()

On average, each Talks has 191.5 comments. Assuming that the comments are constructive criticisms, we can conclude that the TED community is highly involved in the discussion of circular negotiations. The standard deviation associated with reviews is large. In fact, it means even more than it means that these measures may be sensitive to outliers. We will draw this graph to examine the nature of the distribution. The minimum number of comments for a conversation is 2, and the maximum number is 6404. The range is 6402 .. Nevertheless, the lowest number may be the result of a recently published conversation.

df[['views', 'comments']].corr() #相关系数函数

 

As shown in the scatter plot and correlation matrix, the correlation coefficient is slightly greater than 0.5. This indicates a moderate to strong correlation between the two quantities. As mentioned above, this result is quite expected. Now let's check the number of views and comments of the 10 most talked about talks in history.

df[['title', 'main_speaker','views', 'comments']].sort_values('comments', ascending=False).head(10)

df['dis_quo'] = df['comments']/df['views'] #新增一列‘dis_quo’

 

#评论数/点击量之比 前10行
df[['title', 'main_speaker','views', 'comments', 'dis_quo', 'film_date']].sort_values('dis_quo', ascending=False).head(10)

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
df['month'] = df['film_date'].apply(lambda x: month_order[int(x.split('-')[1]) - 1])
df['month'].head()
month_df = pd.DataFrame(df['month'].value_counts()).reset_index()
month_df.columns = ['month', 'talks']
sns.barplot(x='month',y='talks',data=month_df,order=month_order)

February is clearly the most popular meeting, while August and January are the least popular. The popularity in February is largely due to the official meeting held in February.

#使用datetime的strtime方法获取一个日期是周几
def getday(x):
    day, month, year = (int(i) for i in x.split('-'))    
    answer = datetime.date(year, month, day).strftime("%A")
    return answer[:3]
#使用datetime的weekday方法获取一个日期是一周里的第几天,用这个当索引在day_order里取相应的value值
def getday2(x):
    day, month, year = (int(i) for i in x.split('-'))    
    answer = datetime.date(year, month, day).weekday()
    return day_order[answer]
df['day'] = df['film_date'].apply(getday) #新增一列day列
day_df = pd.DataFrame(df['day'].value_counts()).reset_index()
day_df.columns = ['day', 'talks']
sns.barplot(x='day', y='talks', data=day_df, order=day_order)

 As you can see from the picture above, Wed and Thu are the most popular days, and Sunday is the least. It seems that many people participate in the middle of the week and need to rest on the weekend.

df['year'] = df['film_date'].apply(lambda x: x.split('-')[2])
year_df = pd.DataFrame(df['year'].value_counts().reset_index())
year_df.columns = ['year', 'talks']
plt.figure(figsize=(18,5))
sns.pointplot(x='year', y='talks', data=year_df)

 

Reference blog post: https://www.jianshu.com/p/585019d60572

 

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105291769