Web crawlers: crawler is an automated program to take information website content, is widely used search engines and data mining and other fields.
The basic flow of execution Web Crawler: download page - data extraction page - Link Extractor page
Scrapy: is an open source web crawler framework written in Python language features: easy to use, cross-platform, flexible and easy to expand and so on.
installation
Native environment Mac 10.14, Python3 pip3 install scrapy
after successful installation scrapy -h
View contains commands
Creating a projectscrapy startproject tutorial
Generates the following files
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Writing your first reptile
In fact, write a class
to create the file quotes_spider.py
into the tutorial/spiders
directory
import scrapy
class QuotesSpider(scrapy.Spider):
# 爬虫名,必须唯一
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
To the root directory of the project, the implementation of scrapy crawl quotes
the display process
Results: more than two out of html files, mean that we crawled down the page.
Reference: https://docs.scrapy.org/en/1.6/intro/tutorial.html
Reproduced in: https: //www.jianshu.com/p/90ded0d8787f