Scrapy framework combat (1): crawling well-known technical article websites


Scrapy Reptile is a very good framework, through Scrapy the frame, you can very easily implement powerful crawler system, we only need to focus on grasping the rules and on how to handle the data can be captured, through combat this article to introduce Scrapy entry-knowledge And some advanced applications.

1. Scrapy basics

1.1 Introduction to Scrapy

ScrapyIt is applicable to Pythona quick, high-level screen scraping and web crawling framework for crawling web sites and extract structured data from the page. ScrapyIt has a wide range of uses and can be used for data mining, monitoring and automated testing. ScrapyMainly includes the following 6sections.

  1. Scrapy Engine (Scrapy Engine): Used to process the data flow of the entire system and trigger various events.
  2. Scheduler (Scheduler): from URLfetching a queue URL.
  3. Downloader: Download web resources from the Internet.
  4. Spiders (crawlers): receiving raw data downloader, for further processing, e.g., using the Xpathinformation of interest extracted.
  5. Item Pipeline: Receive data from web crawlers for further processing. For example: save in database, save in text file.
  6. Middleware: entire Scrapyframework has many middleware, middleware such as downloading, web crawler middleware, the middleware corresponding to these filters, sandwiched between different portions of intercepted data streams, and special processing.

The workflow of the above parts can be described using the process shown in the figure below.

Insert picture description here
The process can be described as follows:

  1. Crawlers starting URLconfigured Requeststarget engine ⇒ ⇒ ⇒ crawler middleware scheduler
  2. Scheduler Requests⇒ engine ⇒ download middleware ⇒ downloader
  3. Sends a request to download, acquiring Responsesresponse ⇒ ⇒ downloaded middleware engine ⇒ ⇒ crawler crawler middleware
  4. Reptiles extracted URLaddress to assemble a Requeststarget engine ⇒ ⇒ ⇒ crawler middleware scheduler, repeat steps 2
  5. Crawler extracts data ⇒ engine ⇒ pipeline processing and saving data

note:

  1. The Chinese in the picture is added for the convenience of understanding
  2. FIG 绿色线条transfer of data representing
  3. Pay attention to the position of the middleware in the figure, which determines its role
  4. Pay attention to the location of the engine, all modules were independent of each other before, only interact with the engine

Scrapy The specific role of each module in:

Insert picture description here

1.2 Scrapy installation and configuration

Scrapy document address

Use Scrapybefore need to be installed Scrapy, if the reader is using a Anaconda Pythondevelopment environment, you can use the following command to install Scrapy.

conda install scrapy

If the reader is using a standard Pythondevelopment environment, you can use the following command to install Scrapy.

# windows 安装命令如下 加上 --user 防止用户权限不够:
pip install --user -i http://pypi.douban.com/simple --trusted-host pypi.douban.com Scrapy

We recommend installation in a virtual environment on the platform in all Scrapy, the author here to Windows, for example, as follows:

(1) Create a new virtual environment

Insert picture description here
(2) Install in a virtual environment Scrapy

Insert picture description here
After installation, enter the following statement, if an exception is not thrown, description Scrapyhas been successfully installed.

Insert picture description here

1.3 Scrapy Shell grabs web resources

ScrapyProvides an Shellequivalent Pythonof REPLenvironment, you can use this Scrapy Shelltest Scrapycode. In the Windowsopen window in black, execute scrapy shellcommands, will enter Scrapy Shell.

Insert picture description here

Scrapy ShellAnd Pythona REPLsimilar environment, may be executed in any of Pythonthe code, except for the addition of a Scrapysupport, for example, in the Scrapy Shellinput 10 + 20, then the transport will be output 30, as shown below:

Insert picture description here

ScrapyMainly the use of Xpathfiltering HTMLcontent of the page. So what is XPathit? That is the path of the filter is similar to HTMLa technology code on XPaththe discussion again in more detail later content. There is no need to know XPaththe details, as Chromecan HTMLautomatically generate code for a node Xpath.

Now first experience what is called XPath. Start the Chromebrowser and go to Taobao home page and then click on the page context menu 检查command in the pop debug window, select the first Elementstab, then click Elementsthe left arrow black button, move the mouse into the Taobao Home navigation bar 聚划算on ,As shown below.

Insert picture description here
In this case, Elementsthe tab of HTMLthe code is automatically positioned to contain 聚划算a label, and then right-click menu command Copy ⇒ Copy Xpath as shown in FIG., It will copy the current label Xpath.

Insert picture description here
Obviously, that contains 聚划算the text of a alabel, copy the alabel Xpathas follows:

/html/body/div[3]/div/ul[1]/li[2]/a

According to this XPathcode you can basically guess XPathin the end how it was. XPathThrough the relationship level, eventually specified alabel, which li[....]this label indicates the parent label has more than one lilabel, [...]which is an index from the 1start.

You can now Chrometest about this XPath, click the Consoletab at Consolethe input the following code filters out contains 聚划算the alabel.

$x('/html/body/div[3]/div/ul[1]/li[2]/a')

If you want to filter out athe label contains the 聚划算text, use XPaththe textfunction.

$x('/html/body/div[3]/div/ul[1]/li[2]/a/text()')

The figure is the Consoleresult of the execution here is not started, because Chromelists a lot of auxiliary information, most of this information is not very useful.

Insert picture description here
In order to Scrapy Shelltest, use the following command to restart Scrapy Shell.

scrapy shell https://www.taobao.com

Insert picture description here
In Scrapy Shellyou want to use response.xpathmethod of testing Xpath.

response.xpath('/html/body/div[3]/div/ul[1]/li[2]/a/text()').extract()

The above code output is a list, if you want to return directly 聚划算, you need to use the following code:

response.xpath('/html/body/div[3]/div/ul[1]/li[2]/a/text()').extract()[0]

From the group consisting 聚划算of acan be seen around the label code li[1]indicates 天猫, li[3]denotes 天猫超市, so the use of the following two lines of code, can be separately 天猫and 天猫超市.

# 输出 "天猫"
response.xpath('/html/body/div[3]/div/ul[1]/li[1]/a/text()').extract()[0]
# 输出 "天猫超市"
response.xpath('/html/body/div[3]/div/ul[1]/li[3]/a/text()').extract()[0]

In Scrapy Shellthe input above 4statement the output shown below:

Insert picture description here

2. Use Scrapy to write web crawlers

2.1 Create and use Scrapy project

ScrapyFramework provides a scrapycommand to create a Scrapyproject, you can use the following command to create a file called myscrapya Scrapyproject.

scrapy startproject myscrapy

Insert picture description here
The crawler file is created through commands. The crawler file is the main code job file. Usually, the crawling action of a website will be written in the crawler file. The command is as follows:

cd myscrapy
scrapy genspider first_spider www.jd.com

Insert picture description here
The results of the generated directories and files are as follows:

Insert picture description here
In the spidersbuild directory a first_spider.pyscript file, which is a Spiderprogram that specifies the URL Web resources to crawl in the program. The sample code is as follows:

import scrapy


class FirstSpiderSpider(scrapy.Spider):
    name = 'first_spider'  # Spider的名称 需要该名称启动Scrapy
    allowed_domains = ['www.jd.com']
    start_urls = ['http://www.jd.com/']  # 指定要抓取的Web资源的 URL

    # 每抓取一个URL对应的 Web资源,就会调用该方法,通过response参数可以执行 Xpath过滤标签
    def parse(self, response):
        # 输出日志信息
        self.log('hello world')

Now enter from the terminal to the top of the myscrapydirectory, and then execute the following command to run Scrapy.

scrapy crawl first_spider

The result of execution is shown in the figure below:

Insert picture description here
Run Scrapyoutput after the Debugnews output hello world, which shows a parsemethod of operation, and thus description URLspecified Webresource acquisition a success.

2.2 Debug Scrapy source code in Pycharm

In order to be able to direct the Pythonoperation of the network reptiles and commissioning works, it is necessary myscrapyto establish a root directory main.py(文件名可以任意起)file, and then enter the following code.

from scrapy.cmdline import execute

import os
import sys

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# 如果要运行其他的网络爬虫,只需修改上面代码中字符串里面的命令即可
execute(["scrapy", "crawl", "first_spider"])

Now execute main.pythe script file, in PyCharmthe Runinformation shown in FIG inputs, it can also be seen from the log information outputted hello world.

Insert picture description here

2.3 Use extension tools to run Scrapy tools in Pycharm

In 2.2the preparation of a main.pyfile for running Scrapythe program. In fact, the essence is the execution scrapycommand to run Scrapythe program. But each create a Scrapyproject, we must write a main.pyfile into the Pycharmproject for running the Scrapyprogram appeared to be very troublesome, in order to Pycharmmore easily run the Scrapyprogram, you can use the Pycharmextension tool through scrapythe command to run Scrapythe program.

PyCharmExtended tool allows Pycharmthe command to execute external commands by clicking. First click Pycharmthe FileSettingscommand to open the Settingsdialog box.

Insert picture description here
Click the Tools ⇒ External Tools node on the left, and a list of extended tools will be displayed on the right, as shown in the figure below:

Insert picture description here
After you click shown below will pop up Create Tooldialog box.

Insert picture description here
In the Create Tooldialog box, usually you need to fill in the following contents:

  1. Name: The name of the extension tool, in this case runscrapy, it can also be any other name.
  2. Description: The description of the extended tool can be filled in at will, which is equivalent to the comment of the program.
  3. Program: Program to be executed, this case is C:\Users\AmoXiang\Envs\spider\Scripts\scrapy.exe, pointing to scrapycommand absolute path. The reader should be changed on your machine scrapyfile path
  4. Arguments: The command line parameters passed to the program to be executed. This embodiment is crawl $FileNameWithoutExtension$, where $FileNameWithoutExtension$is PyCharmthe environment variable that represents the current selected file name (excluding extension), as the current file name first_spider.py, after the file is selected, $FileNameWithoutExtension$the value is first_spider.
  5. Working directory: Working directory, in this case $FileDir$/../... Which $FileDir$represents the directory where the file currently selected. Since the Scrapyproject all the reptiles code in spidersthe directory, you need to select the spiderscrawler script file (.py files) directory, use the extension to run crawler tool. With respect to scrapythe generated Scrapyproject, the spidersproject
    catalog located innermost layer, it is usually set up two counter-working directory. Therefore, Working directorythe value may be $FileDir$/..or $FileDir$.

Insert picture description here
After adding an extension tool, select spidersa reptile files in the directory, such as first_spider.py, and then click External Tools ⇒ runscrapy command in the context menu run first_spider.py, and will output the same information above the console.

Insert picture description here

Guess you like

Origin blog.csdn.net/xw1680/article/details/108702939