爬虫的基本流程

1、把页面放入到BeautifulSoup容器当中

with open('D:/xxxxx/the_blah.html',
'r')as web_data:
soup = BeautifulSoup(web_data, 'lxml')

2、获取页元素

images = soup.select('body > div.main-content > ul > li > img')
titles = soup.select('body > div.main-content > ul > li > h3 > a')
info = soup.select('body > div.main-content > ul > li > p')

3、筛选元素的具体信息

for image, title, info in zip(images, titles, infos):
data = {
'title': title.get_text(), #获取标签的值
'image': image.get('src'), #获取标签中的属性
'info': info.get_text()
}

猜你喜欢

转载自www.cnblogs.com/onlyhold/p/8997594.html