收集数据

1、Web数据抓取

使用Beautiful Soup来提取每个HTML文件。

创建一个空列表df_list，并附加字典。
通过rt_html文件夹中每个电影的Rotten Tomatoes HTML文件循环播放。
打开每个HTML文件，并将其传达到一个名为file的文件句柄中。
使用pd.DataFrame()将df_list转换为名为df的DataFrame.

from bs4 import BeautifulSoup
import os
import pandas as pd

df_list = []
folder_name = 'rt_html'
for movie_html in os.listdir(folder_name):
	with open(os.path.join(folder_name, movie_html)) as file:
		soup = BeautifulSoup(file, 'lxml')
		title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
		audience_score = soup.find('div', class_='audience-score master').find('span').contents[0][:-1]
		num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
		num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',', '')
		df_list.append({'title': title,
						'audience_score': int(audience_score),
						'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns=['title', 'audience_score', 'number_of_audience_ratings'])

2、从互联网下载文件

import requests
import os

folder_name = 'XXX'              #保存文件的文件夹名称
if not os.path.exists(folder_name):
	os.makedirs(folder_name)

url = '…'
response = requests.get(url)

with open(os.path.join(folder_name, url.split('/')[-1], mode='wb') as file:
	file.write(response.content)

3、使用glob打开文本文件

import glob
import pandas as pd

df_list = []
for ebert_review in glob.glob('ebert_reviews/*.txt'):
	with open(ebert_review, encoding='utf-8') as file:
		title = file.readline()[:-1]           #影评的第1行为标题（去掉最后的换行符）
		review_url = file.readline()[:-1]        #影评的第2行为影评链接（去掉最后的换行符）
		review_text = file.read()          #影评的第3行以后为影评内容
		df_list.append({'title': title,
						'review_url': review_url,
						'review_text': review_text})

df = pd.DataFrame(df_list, columns = ['title', 'review_url', 'review_text'])

4、查询API（wptools库）

对于MediaWiki，Python中最新和可读的库是wptools。下面是wptools使用ET 维基百科页面的示例：

如果要获取一个 $\color{red}{page}$ 对象：

page = wptools.page(‘E.T._the_Extra_Terrestrial’)

要获取所有的数据，用.get()方法：

page = wptools.page(‘E.T._the_Extra_Terrestrial’).get()

或已经将页面对象赋值给 $\color{red}{page}$ 了，再获取其数据：

page.get()

访问 $\color{red}{page}$ 的属性，用.data()方法。例如要获取页面上的图像数据列表:

page.data[‘image’]

5、JSON技能

访问JSON文件就像访问Python语言下的字典和列表一样，因为JSON对象被解释为字典，而JSON数组被解释为列表。

（1）JSON数组

要访问图片特性（它是一个JSON数组）的第一个图片：

page.data[‘image’][0]

（2）JSON对象

访问infobox特性（它是个JSON对象）中的director键：

page.data[‘infobox’][‘director’]

6、用数据库和SQL收集数据

连接python中的数据库。使用SQLAIchemy连接到SQLite数据库。
将pandas DataFrame里的数据储存至所连接的数据库中。使用pandas的.to_sql方法存储数据。
将所连接的数据库里的数据导入至pandas DataFrame中。使用pandas的read_sql方法。

关联数据库和pandas

import pandas as pd

df = pd.read_csv('bestofrt_master.csv')
df.head(3)

（1）关联数据库

from sqlalchemy import create_engine
#创建SQLAlchemy引擎和空白bestofrt数据库
engine = create_engine('sqlite://bestofrt.db')

（2）将pandas DataFrame保存在数据库中

将数据保存在清理后的数据库主要数据集（）中。

#将清理后的主DataFrame('df')保存在表格中，命名为主bestofrt.db
df.to_sql('master', engine, index=False)

（3）把数据库读回一个pandas DataFrame

将数据库中的全新数据读回一个pandas DataFrame。

df_gather = pd.read_sql('SELECT * FROM master', engine)
df_gather.head(3)