python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】

目标：爬取创科实验室网站中讲座的信息，

输出表：讲座标题、报告人、单位、报告时间、讲座内容、报告人简介

技术：requests + bs4

查看爬虫协议：

http://127.0.0.1/lab/robots.txt

（创科实验室是我自己写的网址，不反爬虫）

经过观察，在http://127.0.0.1/lab/lectur页面，讲座标题在a标签里，

通过点击讲座标题可以进入讲座内容，链接也在a标签里

代码：

import requests

import bs4

# 获取页面

url = 'http://127.0.0.1/lab/lecture'

r = requests.get(url)

r.encoding = r.apparent_encoding

html = r.text

#解析页面，获取讲座标题

soup = bs4.BeautifulSoup(html, 'html.parser')

titleList = soup.find_all('a')

lecture = []

for i in titleList:

lecture.append(i.string)

lecture

结果：

# 获取讲座内容链接，通过链接获取页面

content_url = 'http://127.0.0.1/lab/lectureContent/17'

req = requests.get(content_url)

req.encoding = req.apparent_encoding

content = req.text

soup_new = bs4.BeautifulSoup(content, 'html.parser')

soup_new.section.contents

结果：

总代码：

import requests

import bs4

# 获取页面

url = 'http://127.0.0.1/lab/lecture'

r = requests.get(url)

r.encoding = r.apparent_encoding

html = r.text

#解析页面，获取讲座标题

soup = bs4.BeautifulSoup(html, 'html.parser')

aList = soup.find_all('a')

lecture = []

for i in aList:

# 获取讲座标题 lecture.append(i.string)

# 获取讲座内容链接，通过链接获取页面

content_url = 'http://127.0.0.1' + i.attrs['href']

req = requests.get(content_url)

req.encoding = req.apparent_encoding

content = req.text

# 解析页面，获取讲座内容(报告人、单位。。)

soup_new = bs4.BeautifulSoup(content, 'html.parser')

# 便利section标签的子节点

j = soup_new.section.contents

lecture.append([i.string, j[1].string, j[3].string, j[5].string, j[7].string, j[9].string, j[13].string])

# 输出为表

import pandas as pd

# 先把list转为dataframe类型，然后使用.to_csv方法

table = pd.DataFrame(data=lecture,columns=["讲座", "报告人", "单位", "报告时间", "报告地点", "内容简介", "报告人简介"])

table.to_csv('D:/1.csv',index=False)

结果：

优化：

爬虫虽然实现了，但还有几个问题需要优化：

1. 没有爬到内容简介和报告人简介

推测是含有<br/>，.string方法不起作用

解决方法：用split分割

str(str(j[9]).split('>')[2]).split('<')[0]

先把这一大段<span>...</span>转为字符串，再分割两次得到内容

2. 报告人、单位、报告时间下面的内容冗余

解决方法：用split分割

j[1].string.split('：')[1]

3. 代码优化

使用函数分块，便于以后的项目调用（可不做）

最终代码：

import requests

import bs4

import re

# 获取页面

url = 'http://127.0.0.1/lab/lecture'

r = requests.get(url)

r.encoding = r.apparent_encoding

html = r.text

#解析页面，获取讲座标题

soup = bs4.BeautifulSoup(html, 'html.parser')

aList = soup.find_all('a')

lecture = []

for i in aList:

# 获取讲座标题 lecture.append(i.string)

# 获取讲座内容链接，通过链接获取页面

content_url = 'http://127.0.0.1' + i.attrs['href']

req = requests.get(content_url)

req.encoding = req.apparent_encoding

content = req.text

# 解析页面，获取讲座内容(报告人、单位。。)

soup_new = bs4.BeautifulSoup(content, 'html.parser')

# 便利section标签的子节点

j = soup_new.section.contents

# "内容简介", "报告人简介"不大好搞出来，先把标签内容转为字符串，分割，再转为字符串，再分割，搞定

m = str(str(j[9]).split('>')[2]).split('<')[0]

n = str(str(j[13]).split('>')[2]).split('<')[0]

lecture.append([i.string, j[1].string.split('：')[1], j[3].string.split('：')[1], j[5].string.split('：')[1], j[7].string.split('：')[1],m,n])

# 输出为表

import pandas as pd

table = pd.DataFrame(data=lecture,columns=["讲座", "报告人", "单位", "报告时间", "报告地点", "内容简介", "报告人简介"])

table.to_csv('D:/1.csv',index=False)

结果：

python爬虫（五）：实战 【2. 爬创客实验室（requests + bs4）】

猜你喜欢

python爬虫（五）：实战【2. 爬创客实验室（requests + bs4）】