table of Contents
Scrapy
Reptile is a very good framework, through
Scrapy
the frame, you can very easily implement powerful crawler system, we only need to focus on grasping the rules and on how to handle the data can be captured, through combat this article to introduce
Scrapy
entry-knowledge And some advanced applications.
1. Scrapy basics
1.1 Introduction to Scrapy
Scrapy
It is applicable to Python
a quick, high-level screen scraping and web crawling framework for crawling web sites and extract structured data from the page. Scrapy
It has a wide range of uses and can be used for data mining, monitoring and automated testing. Scrapy
Mainly includes the following 6
sections.
- Scrapy Engine (Scrapy Engine): Used to process the data flow of the entire system and trigger various events.
- Scheduler (Scheduler): from
URL
fetching a queueURL
. - Downloader: Download web resources from the Internet.
- Spiders (crawlers): receiving raw data downloader, for further processing, e.g., using the
Xpath
information of interest extracted. - Item Pipeline: Receive data from web crawlers for further processing. For example: save in database, save in text file.
- Middleware: entire
Scrapy
framework has many middleware, middleware such as downloading, web crawler middleware, the middleware corresponding to these filters, sandwiched between different portions of intercepted data streams, and special processing.
The workflow of the above parts can be described using the process shown in the figure below.
The process can be described as follows:
- Crawlers starting
URL
configuredRequests
target engine ⇒ ⇒ ⇒ crawler middleware scheduler - Scheduler
Requests
⇒ engine ⇒ download middleware ⇒ downloader - Sends a request to download, acquiring
Responses
response ⇒ ⇒ downloaded middleware engine ⇒ ⇒ crawler crawler middleware - Reptiles extracted
URL
address to assemble aRequests
target engine ⇒ ⇒ ⇒ crawler middleware scheduler, repeat steps 2 - Crawler extracts data ⇒ engine ⇒ pipeline processing and saving data
note:
- The Chinese in the picture is added for the convenience of understanding
- FIG
绿色线条
transfer of data representing - Pay attention to the position of the middleware in the figure, which determines its role
- Pay attention to the location of the engine, all modules were independent of each other before, only interact with the engine
Scrapy
The specific role of each module in:
1.2 Scrapy installation and configuration
Use Scrapy
before need to be installed Scrapy
, if the reader is using a Anaconda Python
development environment, you can use the following command to install Scrapy
.
conda install scrapy
If the reader is using a standard Python
development environment, you can use the following command to install Scrapy
.
# windows 安装命令如下 加上 --user 防止用户权限不够:
pip install --user -i http://pypi.douban.com/simple --trusted-host pypi.douban.com Scrapy
We recommend installation in a virtual environment on the platform in all Scrapy
, the author here to Windows
, for example, as follows:
(1) Create a new virtual environment
(2) Install in a virtual environment Scrapy
After installation, enter the following statement, if an exception is not thrown, description Scrapy
has been successfully installed.
1.3 Scrapy Shell grabs web resources
Scrapy
Provides an Shell
equivalent Python
of REPL
environment, you can use this Scrapy Shell
test Scrapy
code. In the Windows
open window in black, execute scrapy shell
commands, will enter Scrapy Shell
.
Scrapy Shell
And Python
a REPL
similar environment, may be executed in any of Python
the code, except for the addition of a Scrapy
support, for example, in the Scrapy Shell
input 10 + 20
, then the transport will be output 30
, as shown below:
Scrapy
Mainly the use of Xpath
filtering HTML
content of the page. So what is XPath
it? That is the path of the filter is similar to HTML
a technology code on XPath
the discussion again in more detail later content. There is no need to know XPath
the details, as Chrome
can HTML
automatically generate code for a node Xpath
.
Now first experience what is called XPath
. Start the Chrome
browser and go to Taobao home page and then click on the page context menu 检查
command in the pop debug window, select the first Elements
tab, then click Elements
the left arrow black button, move the mouse into the Taobao Home navigation bar 聚划算
on ,As shown below.
In this case, Elements
the tab of HTML
the code is automatically positioned to contain 聚划算
a label, and then right-click menu command Copy ⇒ Copy Xpath as shown in FIG., It will copy the current label Xpath
.
Obviously, that contains 聚划算
the text of a a
label, copy the a
label Xpath
as follows:
/html/body/div[3]/div/ul[1]/li[2]/a
According to this XPath
code you can basically guess XPath
in the end how it was. XPath
Through the relationship level, eventually specified a
label, which li[....]
this label indicates the parent label has more than one li
label, [...]
which is an index from the 1
start.
You can now Chrome
test about this XPath
, click the Console
tab at Console
the input the following code filters out contains 聚划算
the a
label.
$x('/html/body/div[3]/div/ul[1]/li[2]/a')
If you want to filter out a
the label contains the 聚划算
text, use XPath
the text
function.
$x('/html/body/div[3]/div/ul[1]/li[2]/a/text()')
The figure is the Console
result of the execution here is not started, because Chrome
lists a lot of auxiliary information, most of this information is not very useful.
In order to Scrapy Shell
test, use the following command to restart Scrapy Shell
.
scrapy shell https://www.taobao.com
In Scrapy Shell
you want to use response.xpath
method of testing Xpath
.
response.xpath('/html/body/div[3]/div/ul[1]/li[2]/a/text()').extract()
The above code output is a list, if you want to return directly 聚划算
, you need to use the following code:
response.xpath('/html/body/div[3]/div/ul[1]/li[2]/a/text()').extract()[0]
From the group consisting 聚划算
of a
can be seen around the label code li[1]
indicates 天猫
, li[3]
denotes 天猫超市
, so the use of the following two lines of code, can be separately 天猫
and 天猫超市
.
# 输出 "天猫"
response.xpath('/html/body/div[3]/div/ul[1]/li[1]/a/text()').extract()[0]
# 输出 "天猫超市"
response.xpath('/html/body/div[3]/div/ul[1]/li[3]/a/text()').extract()[0]
In Scrapy Shell
the input above 4
statement the output shown below:
2. Use Scrapy to write web crawlers
2.1 Create and use Scrapy project
Scrapy
Framework provides a scrapy
command to create a Scrapy
project, you can use the following command to create a file called myscrapy
a Scrapy
project.
scrapy startproject myscrapy
The crawler file is created through commands. The crawler file is the main code job file. Usually, the crawling action of a website will be written in the crawler file. The command is as follows:
cd myscrapy
scrapy genspider first_spider www.jd.com
The results of the generated directories and files are as follows:
In the spiders
build directory a first_spider.py
script file, which is a Spider
program that specifies the URL Web resources to crawl in the program. The sample code is as follows:
import scrapy
class FirstSpiderSpider(scrapy.Spider):
name = 'first_spider' # Spider的名称 需要该名称启动Scrapy
allowed_domains = ['www.jd.com']
start_urls = ['http://www.jd.com/'] # 指定要抓取的Web资源的 URL
# 每抓取一个URL对应的 Web资源,就会调用该方法,通过response参数可以执行 Xpath过滤标签
def parse(self, response):
# 输出日志信息
self.log('hello world')
Now enter from the terminal to the top of the myscrapy
directory, and then execute the following command to run Scrapy
.
scrapy crawl first_spider
The result of execution is shown in the figure below:
Run Scrapy
output after the Debug
news output hello world
, which shows a parse
method of operation, and thus description URL
specified Web
resource acquisition a success.
2.2 Debug Scrapy source code in Pycharm
In order to be able to direct the Python
operation of the network reptiles and commissioning works, it is necessary myscrapy
to establish a root directory main.py(文件名可以任意起)
file, and then enter the following code.
from scrapy.cmdline import execute
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# 如果要运行其他的网络爬虫,只需修改上面代码中字符串里面的命令即可
execute(["scrapy", "crawl", "first_spider"])
Now execute main.py
the script file, in PyCharm
the Run
information shown in FIG inputs, it can also be seen from the log information outputted hello world
.
2.3 Use extension tools to run Scrapy tools in Pycharm
In 2.2
the preparation of a main.py
file for running Scrapy
the program. In fact, the essence is the execution scrapy
command to run Scrapy
the program. But each create a Scrapy
project, we must write a main.py
file into the Pycharm
project for running the Scrapy
program appeared to be very troublesome, in order to Pycharm
more easily run the Scrapy
program, you can use the Pycharm
extension tool through scrapy
the command to run Scrapy
the program.
PyCharm
Extended tool allows Pycharm
the command to execute external commands by clicking. First click Pycharm
the File
⇒ Settings
command to open the Settings
dialog box.
Click the Tools ⇒ External Tools node on the left, and a list of extended tools will be displayed on the right, as shown in the figure below:
After you click shown below will pop up Create Tool
dialog box.
In the Create Tool
dialog box, usually you need to fill in the following contents:
Name
: The name of the extension tool, in this caserunscrapy
, it can also be any other name.Description
: The description of the extended tool can be filled in at will, which is equivalent to the comment of the program.Program
: Program to be executed, this case isC:\Users\AmoXiang\Envs\spider\Scripts\scrapy.exe
, pointing toscrapy
command absolute path. The reader should be changed on your machinescrapy
file pathArguments
: The command line parameters passed to the program to be executed. This embodiment iscrawl $FileNameWithoutExtension$
, where$FileNameWithoutExtension$
isPyCharm
the environment variable that represents the current selected file name (excluding extension), as the current file namefirst_spider.py
, after the file is selected,$FileNameWithoutExtension$
the value isfirst_spider
.Working directory
: Working directory, in this case$FileDir$/../..
. Which$FileDir$
represents the directory where the file currently selected. Since theScrapy
project all the reptiles code inspiders
the directory, you need to select thespiders
crawler script file (.py files) directory, use the extension to run crawler tool. With respect toscrapy
the generatedScrapy
project, thespiders
project
catalog located innermost layer, it is usually set up two counter-working directory. Therefore,Working directory
the value may be$FileDir$/..
or$FileDir$
.
After adding an extension tool, select spiders
a reptile files in the directory, such as first_spider.py
, and then click External Tools ⇒ runscrapy command in the context menu run first_spider.py
, and will output the same information above the console.