[Machine learning notes] Analysis of TED speech data with Python (updated)

Analyzing TED speech data with Python

First prepare the TED speech data set. The TED speech data set and information can be obtained from the following resources:

https://www.datafountain.cn/datasets/11

The data set contains 2 files:

ted_main.csv
contains the main information of the speech, including the title of the speech, the speaker, the content of the speech, the number of views, the number of comments, the rating of the speech, etc.
transcripts.csv
contains speech links and official English subtitles.

1) The file ted_main.csv contains 17 fields, 2550 lines, each line represents a TED speech, specific information is as follows:

Serial number	Field name	type of data	Field description
1	name	Integer	The official name of the speech (main speaker + title)
2	title	String	Speech title
3	description	Integer	Speech content
4	main_speaker	String	Main speaker
5	speaker_occupation	Integer	Major Speaker's Occupation
6	num_speaker	Integer	Number of speakers
7	duration	String	Duration of speech, in seconds
8	event	String	TED / TEDx event where the lecture is held
9	film_date	Integer	Lecture shooting time (Unix timestamp)
10	published_date	Integer	Speech release time (Unix timestamp)
11	comments	String	Number of comments
12	tags	String	String
13	languages	String	Number of languages that can be selected when listening to the lecture
14	ratings	String	A list containing many dictionaries, each with a different speech rating (such as inspiring, fascinating, surprising, etc.)
15	related_talks	String	A list containing many dictionaries, each dictionary is recommended for the next speech worth watching
16	url	String	URL link to speech
17	views	Integer	Views

(2) The file transcripts.csv contains 2 fields, 2467 lines, each line represents a TED speech, the specific information is as follows:

Serial number	Field name	type of data	Field description
1	url	String	URL link to speech
2	transcript	String	Official English subtitles for the speech

Direction of exploration ( continuously updated )

The TED speech data set can be explored from the following aspects:

What type of speech is the most popular?
What are the commonalities in the content of high-volume speeches?
What is the most popular theme in TED?

Structural analysis

The number of new speeches on TED every year?
What is the distribution of speech types?
What is the distribution of the lecture duration?
Profession distribution of speakers?
What is the distribution of speech views?
What is the distribution of speech comments?

Value Analysis

What are the characteristics of high-view speeches? (Theme, content, duration)
What are the characteristics of a highly discussed speech? (Degree of discussion = number of comments / page views)

Group behavior analysis

What is the relationship between page views and discussion?
What is the relationship between page views and speech duration?
What is the ideal duration of different types of speech?

Data analysis process

In Jupyter Notebook, first load the data:

%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns #matplotlib的默认作图风格就会被覆盖成seaborn的格式
import json
from pandas.io.json import json_normalize

#加载数据集
df = pd.read_csv("C:\\Machine-Learning-with-Python-master\\data\\ted_main.csv")
#查看数据集行列数
print("该数据集共有 {} 行 {} 列".format(ted.shape[0],ted.shape[1]))

The data set contains 2550 rows and 17 columns of data.

Next, deal with the date format of the data column

import datetime
df['film_date'] = df['film_date'].apply(lambda x: datetime.datetime.fromtimestamp( int(x)).strftime('%d-%m-%Y'))
df['published_date'] = df['published_date'].apply(lambda x: datetime.datetime.fromtimestamp( int(x)).strftime('%d-%m-%Y'))

df.head()

Next, sort according to the number of views, and propose the first 15 rows of data

#根据views量排序前15行数据
pop_talks = df[['title','main_speaker','views','film_date']].sort_values('views',ascending=False)[:15]
pop_talks

Most viewed lecture: Do schools kill creativity? Views: 47227110

Do schools kill creativity?

Ken Robinson

47227110

25-02-2006

#切分main_speaker的前三个字母,新增一列abbr数据
pop_talks["abbr"] = pop_talks['main_speaker'].apply(lambda x: x[:3])
pop_talks.head()

sns.set_style("whitegrid")
plt.figure(figsize=(10,6))
sns.barplot(x='abbr',y='views',data=pop_talks)

sns.distplot(df['views'])

sns.distplot(df[df['views'] < 0.4e7]['views'])

sns.distplot(df[(df['views'] > 0.5e4)&(df['views'] < 0.4e7)]['views']) #多条件布尔索引

df['views'].describe()

The average number of views of TED Talks is 1.6 million. The median is 1.12 million. This shows that the popularity of TED Talks is very high. We also noticed that most of Talks ’views are less than 4 million. We will use this block diagram as the cut-off point of the block diagram in later chapters.

sns.distplot(df['comments'])

sns.distplot(df[df['comments'] < 500]['comments'])

sns.jointplot(x='views', y='comments', data=df)

df['comments'].describe()

On average, each Talks has 191.5 comments. Assuming that the comments are constructive criticisms, we can conclude that the TED community is highly involved in the discussion of circular negotiations. The standard deviation associated with reviews is large. In fact, it means even more than it means that these measures may be sensitive to outliers. We will draw this graph to examine the nature of the distribution. The minimum number of comments for a conversation is 2, and the maximum number is 6404. The range is 6402 .. Nevertheless, the lowest number may be the result of a recently published conversation.

df[['views', 'comments']].corr() #相关系数函数

As shown in the scatter plot and correlation matrix, the correlation coefficient is slightly greater than 0.5. This indicates a moderate to strong correlation between the two quantities. As mentioned above, this result is quite expected. Now let's check the number of views and comments of the 10 most talked about talks in history.

df[['title', 'main_speaker','views', 'comments']].sort_values('comments', ascending=False).head(10)

df['dis_quo'] = df['comments']/df['views'] #新增一列‘dis_quo’

#评论数/点击量之比 前10行
df[['title', 'main_speaker','views', 'comments', 'dis_quo', 'film_date']].sort_values('dis_quo', ascending=False).head(10)

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
df['month'] = df['film_date'].apply(lambda x: month_order[int(x.split('-')[1]) - 1])
df['month'].head()
month_df = pd.DataFrame(df['month'].value_counts()).reset_index()
month_df.columns = ['month', 'talks']
sns.barplot(x='month',y='talks',data=month_df,order=month_order)

February is clearly the most popular meeting, while August and January are the least popular. The popularity in February is largely due to the official meeting held in February.

#使用datetime的strtime方法获取一个日期是周几
def getday(x):
    day, month, year = (int(i) for i in x.split('-'))    
    answer = datetime.date(year, month, day).strftime("%A")
    return answer[:3]
#使用datetime的weekday方法获取一个日期是一周里的第几天，用这个当索引在day_order里取相应的value值
def getday2(x):
    day, month, year = (int(i) for i in x.split('-'))    
    answer = datetime.date(year, month, day).weekday()
    return day_order[answer]

df['day'] = df['film_date'].apply(getday) #新增一列day列
day_df = pd.DataFrame(df['day'].value_counts()).reset_index()
day_df.columns = ['day', 'talks']
sns.barplot(x='day', y='talks', data=day_df, order=day_order)

As you can see from the picture above, Wed and Thu are the most popular days, and Sunday is the least. It seems that many people participate in the middle of the week and need to rest on the weekend.

df['year'] = df['film_date'].apply(lambda x: x.split('-')[2])
year_df = pd.DataFrame(df['year'].value_counts().reset_index())
year_df.columns = ['year', 'talks']
plt.figure(figsize=(18,5))
sns.pointplot(x='year', y='talks', data=year_df)

Reference blog post: https://www.jianshu.com/p/585019d60572

Colonel Mason

Published 646 original articles · praised 198 · 690,000 views

His message board concerns