猫哥教你写爬虫 033--爬虫初体验-BeautifulSoup-作业

beautifulsoup 解析器

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(text, "html.parser") Python的内置标准库执行速度适中文档容错能力强 Python 2.7.3 or 3.2.2前的版本中文档容错能力差
lxml HTML 解析器 BeautifulSoup(text, "lxml") 速度快文档容错能力强 需要安装C语言库
lxml XML 解析器 BeautifulSoup(text, "xml") 速度快唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(text, "html5lib") 生成HTML5格式的文档 速度慢不依赖外部扩展

作业1:爬取文章, 并保存到本地(每个文章, 一个html文件)

wordpress-edu-3autumn.localprod.forc.work

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://wordpress-edu-3autumn.localprod.forc.work/').text,'html.parser')
for i in soup.find_all('h2',class_='entry-title'):
    print(i.find('a').text)
    with open('{}.html'.format(i.find('a').text),'w',encoding='utf8') as file:
        soup = BeautifulSoup(requests.get(i.find('a')['href']).text,'lxml')
        file.write(str(soup.find('div',class_='entry-content')))
复制代码

作业2: 爬取分类下的图书名和对应价格, 保存到books.txt

books.toscrape.com

最终效果...

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://books.toscrape.com/').text,'html.parser')
with open('books.txt','w',encoding='utf8') as file:
    for i in soup.find('ul',class_='nav nav-list').find('ul').find_all('li'):
        file.write(i.text.strip()+'\n')
        res = requests.get("http://books.toscrape.com/"+i.find('a')['href'])
        res.encoding='utf8'
        soup = BeautifulSoup(res.text,'html.parser')
        for j in soup.find_all('li',class_="col-xs-6 col-sm-4 col-md-3 col-lg-3"):
            print(j.find('h3').find('a')['title'])
            file.write('\t"{}" {}\n'.format(j.find('h3').find('a')['title'],j.find('p',class_='price_color').text))
复制代码
Travel
	"It's Only the Himalayas" £45.17
	"Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond" £49.43
	"See America: A Celebration of Our National Parks & Treasured Sites" £48.87
	"Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel" £36.94
	"Under the Tuscan Sun" £37.33
	"A Summer In Europe" £44.34
	"The Great Railway Bazaar" £30.54
	"A Year in Provence (Provence #1)" £56.88
	"The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)" £23.21
	"Neither Here nor There: Travels in Europe" £38.95
	"1,000 Places to See Before You Die" £26.08
Mystery
	"Sharp Objects" £47.82
	"In a Dark, Dark Wood" £19.63
	"The Past Never Ends" £56.50
	"A Murder in Time" £16.64
	"The Murder of Roger Ackroyd (Hercule Poirot #4)" £44.10
	"The Last Mile (Amos Decker #2)" £54.21
	"That Darkness (Gardiner and Renner #1)" £13.92
	"Tastes Like Fear (DI Marnie Rome #3)" £10.69
	"A Time of Torment (Charlie Parker #14)" £48.35
	"A Study in Scarlet (Sherlock Holmes #1)" £16.73
	"Poisonous (Max Revere Novels #3)" £26.80
	"Murder at the 42nd Street Library (Raymond Ambler #1)" £54.36
	"Most Wanted" £35.28
	"Hide Away (Eve Duncan #20)" £11.84
	"Boar Island (Anna Pigeon #19)" £59.48
	"The Widow" £27.26
	"Playing with Fire" £13.71
	"What Happened on Beale Street (Secrets of the South Mysteries #2)" £25.37
	"The Bachelor Girl's Guide to Murder (Herringford and Watts Mysteries #1)" £52.30
	"Delivering the Truth (Quaker Midwife Mystery #1)" £20.89
Historical Fiction
	"Tipping the Velvet" £53.74
	"Forever and Forever: The Courtship of Henry Longfellow and Fanny Appleton" £29.69
	"A Flight of Arrows (The Pathfinders #2)" £55.53
	"The House by the Lake" £36.95
	"Mrs. Houdini" £30.25
	"The Marriage of Opposites" £28.08
	"Glory over Everything: Beyond The Kitchen House" £45.84
	"Love, Lies and Spies" £20.55
	"A Paris Apartment" £39.01
	"Lilac Girls" £17.28
	"The Constant Princess (The Tudor Court #1)" £16.62
	"The Invention of Wings" £37.34
	"World Without End (The Pillars of the Earth #2)" £32.97
	"The Passion of Dolssa" £28.32
	"Girl With a Pearl Earring" £26.77
	"Voyager (Outlander #3)" £21.07
	"The Red Tent" £35.66
	"The Last Painting of Sara de Vos" £55.55
	"The Guernsey Literary and Potato Peel Pie Society" £49.53
	"Girl in the Blue Coat" £46.83
......
复制代码

快速跳转:

猫哥教你写爬虫 000--开篇.md
猫哥教你写爬虫 001--print()函数和变量.md
猫哥教你写爬虫 002--作业-打印皮卡丘.md
猫哥教你写爬虫 003--数据类型转换.md
猫哥教你写爬虫 004--数据类型转换-小练习.md
猫哥教你写爬虫 005--数据类型转换-小作业.md
猫哥教你写爬虫 006--条件判断和条件嵌套.md
猫哥教你写爬虫 007--条件判断和条件嵌套-小作业.md
猫哥教你写爬虫 008--input()函数.md
猫哥教你写爬虫 009--input()函数-人工智能小爱同学.md
猫哥教你写爬虫 010--列表,字典,循环.md
猫哥教你写爬虫 011--列表,字典,循环-小作业.md
猫哥教你写爬虫 012--布尔值和四种语句.md
猫哥教你写爬虫 013--布尔值和四种语句-小作业.md
猫哥教你写爬虫 014--pk小游戏.md
猫哥教你写爬虫 015--pk小游戏(全新改版).md
猫哥教你写爬虫 016--函数.md
猫哥教你写爬虫 017--函数-小作业.md
猫哥教你写爬虫 018--debug.md
猫哥教你写爬虫 019--debug-作业.md
猫哥教你写爬虫 020--类与对象(上).md
猫哥教你写爬虫 021--类与对象(上)-作业.md
猫哥教你写爬虫 022--类与对象(下).md
猫哥教你写爬虫 023--类与对象(下)-作业.md
猫哥教你写爬虫 024--编码&&解码.md
猫哥教你写爬虫 025--编码&&解码-小作业.md
猫哥教你写爬虫 026--模块.md
猫哥教你写爬虫 027--模块介绍.md
猫哥教你写爬虫 028--模块介绍-小作业-广告牌.md
猫哥教你写爬虫 029--爬虫初探-requests.md
猫哥教你写爬虫 030--爬虫初探-requests-作业.md
猫哥教你写爬虫 031--爬虫基础-html.md
猫哥教你写爬虫 032--爬虫初体验-BeautifulSoup.md
猫哥教你写爬虫 033--爬虫初体验-BeautifulSoup-作业.md
猫哥教你写爬虫 034--爬虫-BeautifulSoup实践.md
猫哥教你写爬虫 035--爬虫-BeautifulSoup实践-作业-电影top250.md
猫哥教你写爬虫 036--爬虫-BeautifulSoup实践-作业-电影top250-作业解析.md
猫哥教你写爬虫 037--爬虫-宝宝要听歌.md
猫哥教你写爬虫 038--带参数请求.md
猫哥教你写爬虫 039--存储数据.md
猫哥教你写爬虫 040--存储数据-作业.md
猫哥教你写爬虫 041--模拟登录-cookie.md
猫哥教你写爬虫 042--session的用法.md
猫哥教你写爬虫 043--模拟浏览器.md
猫哥教你写爬虫 044--模拟浏览器-作业.md
猫哥教你写爬虫 045--协程.md
猫哥教你写爬虫 046--协程-实践-吃什么不会胖.md
猫哥教你写爬虫 047--scrapy框架.md
猫哥教你写爬虫 048--爬虫和反爬虫.md
猫哥教你写爬虫 049--完结撒花.md

转载于:https://juejin.im/post/5cfc4adb51882512a675faf0

猜你喜欢

转载自blog.csdn.net/weixin_34296641/article/details/93180082