1.爬虫设计
本文用Python语言实现网络爬虫抓取特定数据,使用Beautiful Soup进行HTML的解析。Beautiful Soup是一个HTML/XML解析器,主要功能是解析和提取HTML/XML中的数据,提取数据时,类似于正则表达式的功能。Beautiful Soup将整个文档载入,解析整个DOM树,其时空开销都比较大,性能不如lxml。Beautiful Soup为不同的解析器提供了相同的接口,但解析器本身是有区别的。使用不同的解析器解析同一篇文档,可能会产生不同结构的树型文档。本系统的页面解析使用的是lmxl解析器。lmxl解析器是Python的一个解析库,支持HTML和XML的解析,同时也支持Xpath解析方式,而且解析效率非常高。
医院和医生的信息页面分别为图1和图2。
图1 医院信息展示页面
图2 医生信息展示页面
2.URL分析
URL(Uniform Resource Locator)中文名称使统一资源定位器,表示网站的页面地址。
(1)首先,对医院的URL进行分析,经过分析,发现医院页面分地址有如下规律:
URL = “https://www.guahao.com/hospital/” + hospital_id
(2)其次,对科室页面URL进行分析,结果如下:
URL = “https://www.guahao.com/department/” + department_id + "?isStd=”
(3)医生页面地址:
URL = “https://www.guahao.com/expert/”+doctor_id +”?hospDeptId=” + department_id + ”&hospitalId=” + hospital_id
(4)医生全部评论URL设置了反爬虫手段,每页评论信息的URL中都有一个sign参数和timestamp参数:
https://www.guahao.com/commentslist/e-“ + doctor_id + ”/all-0?pageNo=” + pageNum + ” &sign=” + sign_value + “×tamp=” + timestamp_value
研究发现,跳转到下一页的URL中的sign和timestamp参数在本页面中的两个<input>标签中已经获取,如图3。
图3 含sign和timestamp参数的input标签
3.DOM分析
对于一个HTML文档,通过分析文档,根据DOM的上下文标签的关系,可以分析得到特定标签下的所需信息。以医生信息页面(如图4)为例,医生信息在<div class= "grid-section expert-card fix-clear">中,医生姓名、职称、所属科室等信息在其子标签下。
图4 医生信息页面源文件
首先通过城市获取到该城市的医院列表,进而得到每个医院的URL,再从医院信息页,获取所有科室和门诊信息,跳转到门诊信息页就可以得到医生列表,通过医生的URL,再跳转到医生信息页面,就可以解析得到医生详细信息了。
评论信息页面的抓取是通过医生详细信息页面跳转得到的。评论抓取过程中遇到抓取几页评论信息之后,反馈信息为“评价信息不存在”的情况(如图5),分析是因为网站设置了反爬功能。在通过URL访问页面时,添加“User-Agent”和“Cookie”属性,来“假装”浏览器操作。
图5 错误反馈信息
在使用网络爬虫进行抓取信息时,有些医生存在部分信息(简介、评论、擅长等)缺失,需要单独进行判断;另外,医院列表、医生列表、评论等都是分页显示,所以在抓取信息时,需要处理分页,特别是判断是否已是最后一页。例如在评论分页列表中,非尾页页面有跳转至下一页的<a>标签(如图6),通过判断页面中是否含有该标签即可得知此时是否是尾页。
图6 下一页<a>标签
4.数据库设计
SET FOREIGN_KEY_CHECKS=0;
-- ----------------------------
-- Table structure for t_doctor_beijing
-- ----------------------------
DROP TABLE IF EXISTS `t_doctor_beijing`;
CREATE TABLE `t_doctor_beijing` (
`doc_id` int(11) NOT NULL AUTO_INCREMENT,
`doc_name` varchar(30) DEFAULT NULL,
`doc_sex` varchar(6) DEFAULT NULL,
`doc_title` varchar(50) DEFAULT NULL,
`doc_hospital_name` varchar(50) DEFAULT NULL,
`doc_department_name` varchar(50) DEFAULT NULL,
`doc_clinic_name` varchar(50) DEFAULT NULL,
`doc_labels` varchar(100) DEFAULT NULL,
`doc_score` float(11,1) DEFAULT NULL,
`doc_hits` int(11) DEFAULT NULL,
`doc_impression` varchar(200) DEFAULT NULL,
`doc_goodat` varchar(2500) DEFAULT NULL,
`doc_about` varchar(2500) DEFAULT NULL,
`doc_hospital_number` varchar(50) NOT NULL,
`doc_clinic_number` varchar(50) NOT NULL,
`doc_number` varchar(50) NOT NULL,
PRIMARY KEY (`doc_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1DEFAULT CHARSET=utf8;
5.爬虫代码
(1)根据城市获取该城市医院的编号
import requests
from bs4 import BeautifulSoup
import codecs, sys
import re
target = codecs.open('hospitalUrl_Tianjin.txt', 'w', encoding="utf8")
def writeFile(urls):
for url in urls:
target.writelines('\'' + url + '\',\r\n')
def getHospitalUrl():
reslist = []
BaseUrl = 'https://www.guahao.com/hospital/3/北京/all/不限/p'
i = 1
while True :
print('page '+ str(i))
pageUrl = BaseUrl + str(i)
html = requests.get(pageUrl)
html.encoding = 'utf-8'
soup = BeautifulSoup(html.text, 'lxml')
data = soup.select('#g-cfg > div.g-grid2-l > div.g-hospital-items.to-margin > ul > li')
for item in data:
hos = list(item.find_all('a', class_='cover-bg seo-anchor-text'))
hosUrl = hos[0]['monitor-hosp-id']
reslist.append(hosUrl)
l = len(list(soup.find_all('a', class_='next J_pageNext_gh')))
if l <= 0 :
break
# if i > 2 :
# break
i = i + 1
print(reslist)
return reslist
if '__main__' == __name__:
urls = getHospitalUrl()
writeFile(urls)
(2)根据医院url获取该医院医生信息并存入MySQL。
import requests
from bs4 import BeautifulSoup
import re
import time
import pymysql
db = pymysql.Connect(
host="localhost",
port=3306,
user="root",
password="123456",
db="databeijing",
charset="utf8"
)
# 医生类
class Doctor:
name = None # 姓名
title = "" # 职称
hospital = None # 所属医院
department = None # 所属科室
clinic = None # 所属门诊
score = 0 # 评分
hits = 0 # 访问量
lables = [] # 标签
s_labels = ""
impression = ""
goodat = '暂无' # 擅长
about = '暂无' # 简介
number = ''
def writeToDatabase(doctors, hos_id, cli_id): # 将医生信息写入数据库
# try:
cur = db.cursor() # 获取游标
sql = 'INSERT INTO t_doctor_qingdao(doc_name, doc_title, doc_hospital_name, doc_department_name, doc_clinic_name, doc_labels, doc_score,' \
' doc_hits, doc_impression, doc_goodat, doc_about, doc_hospital_number, doc_clinic_number,doc_number) values(%s, %s, %s, %s, %s, %s, %s,%s, %s, %s, %s, %s, %s, %s)'
for doctor in doctors :
cur.execute(sql, (
doctor.name, doctor.title, doctor.hospital, doctor.department, doctor.clinic, doctor.s_labels, doctor.score, doctor.hits, doctor.impression, doctor.goodat, doctor.about, hos_id, cli_id, doctor.number))
db.commit() # 提交
# except:
# print('ERROR: writeToDatabase')
################## get_detail_doctor ########################
# 由医生详细信息url,获取医生详细信息
def get_detail_doctor(doc_url, doctors, hospital_name, department_name, clinic_name):
try:
strHtml_detail = requests.get(doc_url)
strHtml_detail.encoding = 'utf-8'
soup_detail = BeautifulSoup(strHtml_detail.text, 'lxml')
data_detail = soup_detail.select('#g-cfg > div.grid-group > div > div') # 个人详细信息页
doc = Doctor()
info = list(data_detail[0].children)[3] # 医生基本信息
try :
status = list(list(data_detail[1].children)[3].children) # 医生评价等信息
except :
print('ERROR: status')
doc.name = info.h1.strong.text # 姓名
titles = info.h1.find_all('span') # 职称
for title in titles:
if title.text.strip() != '点赞':
doc.title = doc.title + title.text.strip()
if len(doc.title) <= 0 :
doc.title = '无'
doc.hospital = hospital_name
doc.department = department_name
doc.clinic = clinic_name
try :
if (status[1].a.strong.text.strip() == '.0'): # 评分
doc.score = 0
else :
doc.score = status[1].a.strong.text.strip()
except :
doc.score = 0
try :
num = status[3].strong.text.strip() # 访问量
if (num == '暂无'):
doc.hits = 0
elif '万' in num:
doc.hits = float(num[0:len(num)-1]) * 10000
else:
doc.hits = num
except :
try :
doc.hits = int(list(soup_detail.find_all('p', class_='user-count'))[0].span.text.strip())
except :
doc.hits = 0
lab = list(info.find_all("div", class_="keys")[0].children) # 主治标签
i = 0
if len(lab)>0 :
for label in lab:
if (i % 2 == 1):
tmp = label.text.strip()
doc.s_labels = doc.s_labels + '/' + tmp
i = i + 1
goodat = list(list(info.find_all("div", class_="goodat"))[0].find_all('a'))
if len(goodat) > 0:
doc.goodat = goodat[0]['data-description'].strip() # 擅长
else:
goodat = list(list(info.find_all("div", class_="goodat"))[0].find_all('span'))
if len(goodat) > 0:
doc.goodat = goodat[0].text.strip() # 擅长
about = list(list(info.find_all("div", class_="about"))[0].find_all('a'))
if len(about) > 0:
doc.about = about[0]['data-description'].strip() # 简介
else:
about = list(list(info.find_all("div", class_="about"))[0].find_all('span'))
if len(about) > 0:
doc.abour = about[0].text.strip() # 简介
data_impression = soup_detail.select('#comment-filter > div:nth-child(1) > a.active > ul > li')
if len(data_impression) > 0 :
for item in data_impression :
doc.impression = doc.impression + item.text.strip() + '/'
if '?' in doc_url:
i = doc_url.index('?')
doc.number = doc_url[30:i]
else:
doc.number = doc_url[30:]
output_detail_doctor(doc)
doctors.append(doc)
except :
print('ERROR')
############### get_urls_simple ###########################
#由门诊url,获取该门诊下全部医生的url
#医生列表无分页
def get_urls_simple(data, doctors, hospital_name, department_name, clinic_name):
for item in data:
url_detail = item.dl.dt.a['href']
get_detail_doctor(url_detail, doctors, hospital_name, department_name, clinic_name)
############## get_urls_more ###########################
#由门诊url,获取该门诊下全部医生的url
#医生列表数超过首页数目,需进入更多页面查看
def get_urls_more(cli_url, doctors, hospital_name, department_name, clinic_name):
strHtml = requests.get(cli_url)
strHtml.encoding = 'utf-8'
soup = BeautifulSoup(strHtml.text, 'lxml')
if len(soup.find_all('div', class_='pagers')) > 0: # 医生列表分页显示
i = 1
while True:
url_page = cli_url + '?pageNo=' + str(i)
strHtml = requests.get(url_page)
strHtml.encoding = 'utf-8'
soup_page = BeautifulSoup(strHtml.text, 'lxml')
data = soup_page.select('#g-cfg > div.results > div.list > div') # 医生列表
for item in data: # 该页医生循环
try:
url_detail = item.div.dl.dt.a['href']
get_detail_doctor(url_detail, doctors, hospital_name, department_name, clinic_name)
except:
print('ERROR.100: get_urls_more')
if len(soup_page.find_all('a', {'monitor':'page,page,page_down'})) <= 0:
break
i = i + 1
else: # 不需要分页
data = soup.select('#g-cfg > div.results > div.list > div') # 医生列表
for item in data: # 该页医生循环
try:
url_detail = item.div.dl.dt.a['href']
get_detail_doctor(url_detail, doctors, hospital_name, department_name, clinic_name)
except:
print('ERROR.100: get_urls_more')
########################### get_list_doctor ############################
#由门诊url,获取该门诊全部医生的url
def get_list_doctor(cli_url, doctors, hospital_name, department_name, clinic_name):
strHtml = requests.get(cli_url)
strHtml.encoding = 'utf-8'
soup = BeautifulSoup(strHtml.text, 'lxml')
data = soup.select('#anchor > div.g-hddoctor-list.g-clear.js-tab-content > div') # 医生列表
if (soup.find_all('div', class_='more')):
get_urls_more(soup.find_all('div',class_='more')[0].a['href'], doctors, hospital_name, department_name, clinic_name)
else :
get_urls_simple(data, doctors, hospital_name, department_name, clinic_name)
################### output_detail_doctor ##############################
#输出医生详细信息
def output_detail_doctor(doctor):
print(doctor.name + ' : ' + doctor.hospital + ' : ' + doctor.department + ' : ' + doctor.clinic + ' : ' + str(doctor.number) + ' : ' + str(doctor.hits))
print('============================================================================================')
#################### output_list_doctors ###########################
#输出列表中全部医生的详细信息
def output_list_doctors(doctors):
for doctor in doctors:
output_detail_doctor(doctor)
#################### get_department ###############################
#由医院url,获取该医院全部科室及该科室下全部门诊的url
def get_department(hos_id, hospital_name):
hos_url = 'https://www.guahao.com/hospital/' + hos_id
strhtml = requests.get(hos_url)
strhtml.encoding = 'utf-8'
soup = BeautifulSoup(strhtml.text, 'lxml')
data = soup.select('#departments > div.grid-content > ul > li')
for department in data: # 科室循环
department_name = department.label.text.strip() # 科室名称
print('=================================================================================================')
print(department_name)
# if department_name == '特色科室':
clinics = department.p.find_all('span') # 科室下门诊部 list
for clinic in clinics:
doctors = [] # 暂存医生详细信息
if clinic.a.get('title') == None:
clinic_name = clinic.a.text.strip() # 门诊名称
else:
clinic_name = clinic.a.get('title') # 门诊名称
cli_id = clinic.a['monitor-div-id'] # 门诊编号
clinic_url = clinic.a['href'] # 门诊链接
print(' ' + clinic_name)
get_list_doctor(clinic_url, doctors, hospital_name, department_name, clinic_name)
writeToDatabase(doctors, hos_id, cli_id)
############# main ######################
if __name__ == '__main__':
urls = [
'002d5d5e-b917-4c5a-8fc0-f394c293fd0a000', # 此为爬取的医院的编号
]
BaseUrl = 'https://www.guahao.com/hospital/'
for hos_id in urls :
url_hospital = BaseUrl + hos_id
strhtml = requests.get(url_hospital)
strhtml.encoding = 'utf-8'
soup = BeautifulSoup(strhtml.text, 'lxml')
data = soup.select('#g-cfg > div.grid-group > section > div.info > div.detail.word-break')
hospital_name = data[0].h1.strong.a.text.strip() # 医院名称
print(hospital_name)
get_department(hos_id, hospital_name)