Scrapy framework introduced:
Write a reptile, you need to do a lot of things. For example: a network request, data analysis, data storage, counter-countermeasure crawler mechanisms (replacement ip proxy setting request class), and the like asynchronous request.
These work if every time from scratch to write their own words, more waste of time. So Scrapy
some basic things a good package, write on him reptiles can become more efficient (crawling efficiency and development efficiency).
So the real in the company, on the amount of some reptiles, are using the Scrapy
framework to resolve.
Scrapy frame module features:
Scrapy Engine(引擎)
:Scrapy
The core of the framework. ResponsibleSpider
andItemPipeline
,Downloader
,Scheduler
intermediate communications, data transfer and the like.Spider(爬虫)
: Send require crawling links to the engine, the engine finally come back to the other modules requested data is then sent to the reptiles, reptile went to parse the data you want. This is part of our developers wrote it myself, because to climb which links take, what data page is what we need, is decided by the programmers themselves.Scheduler(调度器)
: Is responsible for receiving the request sent from the engine, and arranged and organized into a certain way, the sequence responsible for the scheduling request and the like.Downloader(下载器)
: Pass over the engine is responsible for receiving the download request, and then to the network to download the corresponding data is then returned to the engine.Item Pipeline(管道)
: Responsible for theSpider(爬虫)
transfer from the data to be saved. Specific save where developers should look to their own needs.Downloader Middlewares(下载中间件)
: You can extend the middleware communication function between engine and downloader.Spider Middlewares(Spider中间件)
: Middleware communication function can extend between the engine and the crawler.
Scrapy Chart:
Scrapy Quick Start
Installation and documentation:
-
- Installation: by
pip install scrapy
to install. - Scrapy official document: http://doc.scrapy.org/en/latest
- Scrapy Chinese document: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
- Installation: by
note:
- In
ubuntu
the installationscrapy
before, you need to install the following dependence:sudo apt-get install python3-dev build-essential python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
, and then throughpip install scrapy
the installation. - If
windows
the system prompts this errorModuleNotFoundError: No module named 'win32api'
, use the following command can be resolved:pip install pypiwin32
.
Getting Started:
Create a project:
To use the Scrapy
framework to create a project, you need to create the command. First, you want to enter into this project storage directory. Then use the following command to create:
scrapy startproject [项目名称]
Directory Structure Description:
Here under the action of key documents:
-
-
- items.py: crawling down the crawler used to store model data.
- middlewares.py: a variety of middleware used to store files.
- pipelines.py: used to
items
model stored in the local disk. - settings.py: some configuration information present reptiles (such as a request header, transmitting a request how often, ip agent pool, etc.).
- scrapy.cfg: project configuration file.
- spiders package: after all reptiles, are stored in the inside.
-
Use Scrapy framework crawling embarrassments Encyclopedia piece:
1. Use the command to create a crawler: scrapy gensipder qsbk "qiushibaike.com"
Create a name called the qsbk
reptile, and the pages can be crawled only limitations in qiushibaike.com
this domain.
Reptile code analysis:
import scrapy class QsbkSpider(scrapy.Spider): name = 'qsbk' allowed_domains = ['qiushibaike.com'] start_urls = ['http://qiushibaike.com/'] def parse(self, response): pass
In fact, the code we can own hand to write, instead of command. Just do not order too much trouble to write their own code.
To create a Spider, you must customize a class that inherits from scrapy.Spider
, and then define three properties and a method in this class.
-
-
- name: The name of the reptiles, the name must be unique.
- allow_domains: allows domain name. Only reptile crawling pages under this domain, pages that are not the domain name will be automatically ignored.
- start_urls: reptile From this variable url.
- parse: engine will download to download the data back to throw crawler parses, reptiles and then pass the data
parse
method. This is a fixed wording. The role of this approach are two, the first is to extract the desired data. The second is to generate the next request url.
-
2. Modify settings.py
the code:
Before doing a reptile, you must remember to modify setttings.py
settings. Two places is strongly recommended settings.
-
-
ROBOTSTXT_OBEY
Set to False. The default is True. That machine comply with the agreement, then when reptiles, scrapy first went robots.txt file, if not found. Directly to stop crawling.DEFAULT_REQUEST_HEADERS
AddUser-Agent
. This also tells the server, my request is a normal request, not a reptile.
-
3. Complete the reptiles Code:
import scrapy from abcspider.items import QsbkItem class QsbkSpider(scrapy.Spider): name = 'qsbk' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): outerbox = response.xpath("//div[@id='content-left']/div") items = [] for box in outerbox: author = box.xpath(".//div[contains(@class,'author')]//h2/text()").extract_first().strip() content = box.xpath(".//div[@class='content']/span/text()").extract_first().strip() item = QsbkItem() item["author"] = author item["content"] = content items.append(item) return items
import scrapy class QsbkItem(scrapy.Item): author = scrapy.Field() content = scrapy.Field()
import json class AbcspiderPipeline(object): def __init__(self): self.items = [] def process_item(self, item, spider): self.items.append(dict(item)) print("="*40) return item def close_spider(self,spider): with open('qsbk.json','w',encoding='utf-8') as fp: json.dump(self.items,fp,ensure_ascii=False)
4. Run scrapy items:
Run scrapy project. In the terminal needs to enter the path of the project is located, and then scrapy crawl [爬虫名字]
you can run the specified reptiles.
If you do not want to run every time the command line, you can put this command in a written document. After the implementation of this document in pycharm run on it.
For example, create a new file called now start.py
, then fill the following code in this file:
from scrapy import cmdline
cmdline.execute("scrapy crawl qsbk".split())