Scraping data with Scrapy

       Scrapy is a fast, high-level screen scraping and web scraping framework developed in Python for scraping web sites and extracting structured data from pages. Scrapy is versatile and can be used for data mining, monitoring, and automated testing.

Scrapy uses the Twisted asynchronous networking library to handle network communication. The overall structure is roughly as follows (note: the picture is from the Internet):

scrapy

Scrapy mainly includes the following components:

  • The engine is used to process the data stream processing of the entire system and trigger transactions.
  • The scheduler is used to accept the request sent by the engine, push it into the queue, and return when the engine requests again.
  • The downloader is used to download web content and return the web content to the spider.
  • Spider, spider is the main work, it is used to formulate the parsing rules of a specific domain name or web page.
  • Project pipeline, responsible for processing projects extracted from web pages by spiders, his main task is to clarify, validate and store data. When the page is parsed by the spider, it is sent to the project pipeline and the data is processed in several specific sequences.
  • Downloader middleware, a hook framework between the Scrapy engine and the downloader, mainly handles requests and responses between the Scrapy engine and the downloader.
  • Spider middleware, a hook framework between the Scrapy engine and the spider, the main job is to handle the spider's response input and request output.
  • Scheduling middleware, middleware between Scrapy engine and scheduling, requests and responses sent from Scrapy engine to scheduling.

Using Scrapy can easily complete the collection of online data, it has done a lot of work for us, and we don't need to work hard to develop it.

1. Installation

install python

The latest version of Scrapy is 0.22.2, which requires python 2.7, so you need to install python 2.7 first. Here I use the centos server for testing, because the system comes with python, you need to check the python version first.

Check out the python version:

$ python -V
Python 2.6.6

Upgrade version to 2.7:

$ wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
$ tar xf Python-2.7.6.tar.xz
$ cd Python-2.7.6
$ ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
$ make && make altinstall

Establish a soft connection so that the default python of the system points to python2.7

$ mv /usr/bin/python /usr/bin/python2.6.6 
$ ln -s /usr/local/bin/python2.7 /usr/bin/python 

Check the python version again:

$ python -V
Python 2.7.6

Install

Here use wget to install setuptools :

$ wget https://bootstrap.pypa.io/ez_setup.py -O - | python

install zope.interface

$ easy_install zope.interface

install twisted

Scrapy uses the Twisted asynchronous network library to handle network communication, so twisted needs to be installed.

Before installing twisted, you need to install gcc:

$ yum install gcc -y

Then, install twisted via easy_install:

$ easy_install twisted

If the following error occurs:

$ easy_install twisted
Searching for twisted
Reading https://pypi.python.org/simple/twisted/
Best match: Twisted 14.0.0
Downloading https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
Processing Twisted-14.0.0.tar.bz2
Writing /tmp/easy_install-kYHKjn/Twisted-14.0.0/setup.cfg
Running Twisted-14.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kYHKjn/Twisted-14.0.0/egg-dist-tmp-vu1n6Y
twisted/runner/portmap.c:10:20: error: Python.h: No such file or directory
twisted/runner/portmap.c:14: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:31: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:45: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘PortmapMethods’
twisted/runner/portmap.c: In function ‘initportmap’:
twisted/runner/portmap.c:55: warning: implicit declaration of function ‘Py_InitModule’
twisted/runner/portmap.c:55: error: ‘PortmapMethods’ undeclared (first use in this function)
twisted/runner/portmap.c:55: error: (Each undeclared identifier is reported only once
twisted/runner/portmap.c:55: error: for each function it appears in.)

Please install python-devel and run again:

$ yum install python-devel -y
$ easy_install twisted

If the following exception occurs:

error: Not a recognized archive type: /tmp/easy_install-tVwC5O/Twisted-14.0.0.tar.bz2

Please download and install manually, the download address is here

$ wget https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
$ tar -vxjf Twisted-14.0.0.tar.bz2
$ cd Twisted-14.0.0
$ python setup.py install

Install pyOpenSSL

Install some dependencies first:

$ yum install libffi libffi-devel openssl-devel -y

Then, install pyOpenSSL via easy_install:

$ easy_install pyOpenSSL

Install Scrapy

Install some dependencies first:

$ yum install libxml2 libxslt libxslt-devel -y

Finally, install Scrapy again:

$ easy_install scrapy

2. Using Scrapy

After the installation is successful, you can understand some basic concepts and usage of Scrapy, and learn the example dirbot of the Scrapy project.

The Dirbot project is located at https://github.com/scrapy/dirbot , and the project includes a README file that describes the content of the project in detail. If you are familiar with git, you can checkout its source code. Or you can download the file in tarball or zip format by clicking Downloads.

The following example describes how to use Scrapy to create a crawler project.

New Construction

Before scraping, you need to create a new Scrapy project. Go to a directory where you want to save your code and execute:

$ scrapy startproject tutorial

This command will create a new directory tutorial in the current directory with the following structure:

.
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

These files are mainly:

  • scrapy.cfg: project configuration file
  • tutorial/: The project python module, the code will be imported from here
  • tutorial/items.py: project items file
  • tutorial/pipelines.py: project pipeline file
  • tutorial/settings.py: Project configuration file
  • tutorial/spiders: directory where spiders are placed

Define Item

Items is the container that will hold the scraped data, it works like a dictionary in python, but it provides more protection, like padding of undefined fields against misspellings.

It is declared by creating a scrapy.item.Itemclass , defining its properties as an scrpy.item.Fieldobject, like an object-relational mapping (ORM).
We control the site data obtained from dmoz.org by modeling the required item, for example, we want to get The name of the site, the url and the site description, we define the domains for these three properties. To do this, we edit the items.py file in the tutorial directory and our Item class will look like this

from scrapy.item import Item, Field 
class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

It might seem confusing at first, but defining these items lets you know what your items are when you use other Scrapy components.

Write a crawler (Spider)

Spiders are user-written classes that scrape information from a domain (or group of domains). They define a preliminary list of URLs for downloading, how to follow links, and how to parse the content of these pages for extracting items.

To create a Spider, you can scrapy.spider.BaseSpidercreate a subclass for and define three main, mandatory properties:

  • name: The identification name of the crawler, it must be unique, you must define different names in different crawler.
  • start_urls: A list of URLs the crawler starts crawling. The crawler starts scraping data from here, so the first download of data will start from these URLs. Other sub-URLs will be generated inherited from these starting URLs.
  • parse(): The method of the crawler. When calling, the Response object returned from each URL is passed as a parameter. The response will be the only parameter of the parse method.

This method is responsible for parsing the returned data, matching the scraped data (parsed as item) and following more URLs.

Create DmozSpider.py in the tutorial/spiders directory

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

run the project

$ scrapy crawl dmoz

This command starts the crawler from the dmoz.org domain, and the third parameter is the value of the name attribute in DmozSpider.py.

xpath selector

Scrapy uses a mechanism called XPath selectors, which are based on XPath expressions. If you want to learn more about selectors and other mechanisms you can check out the documentation .

Here are some examples of XPath expressions and their meanings:

  • /html/head/title: Select the tag below the HTML document <head>element .<title>
  • /html/head/title/text(): selects the text content below the aforementioned <title>element
  • //td: select all <td>elements
  • //div[@class="mine"]: selects all div tag elements that contain the class="mine"attribute

These are just a few simple examples of using XPath, but XPath is actually very powerful. If you want to learn more about XPATH, we recommend this XPath tutorial to you

In order to facilitate the use of XPaths, Scrapy provides the Selector class, there are three methods

  • xpath(): Returns a list of selectors, each select represents a node selected by an xpath parameter expression.
  • extract(): returns a unicode string that is the data returned by the XPath selector
  • re(): Returns a list of unicode strings, which are extracted by regular expressions as parameters
  • css()

Extract data

We can select each <li>element :

sel.xpath('//ul/li') 

And then the site description:

sel.xpath('//ul/li/text()').extract()

Website Title:

sel.xpath('//ul/li/a/text()').extract()

Website link:

sel.xpath('//ul/li/a/@href').extract()

As mentioned, each xpath()call returns a list of selectors, so we can combine xpath()to dig deeper nodes. We will use these features, so:

sites = sel.xpath('//ul/li')
for site in sites:
    title = site.xpath('a/text()').extract()
    link = site.xpath('a/@href').extract()
    desc = site.xpath('text()').extract()
    print title, link, desc

Use Item

scrapy.item.ItemThe calling interface is similar to python's dict, Item contains multiple scrapy.item.Field. This is the same as django's Model and

Item is usually used in Spider's parse method, which is used to save the parsed data.

Finally, modify the crawler class and use Item to save the data. The code is as follows:

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Website


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    def parse(self, response):
        """
        The lines below is a spider contract. For more info see:
        http://doc.scrapy.org/en/latest/topics/contracts.html

        @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
        @scrapes name
        """
        sel = Selector(response)
        sites = sel.xpath('//ul[@class="directory-url"]/li')
        items = []

        for site in sites:
            item = Website()
            item['name'] = site.xpath('a/text()').extract()
            item['url'] = site.xpath('a/@href').extract()
            item['description'] = site.xpath('text()').re('-\s([^\n]*?)\\n')
            items.append(item)

        return items

Now, you can run the project again to see the results:

$ scrapy crawl dmoz

Using Item Pipelines

Set in settings.py ITEM_PIPELINES, which defaults to [], similar to django's MIDDLEWARE_CLASSESetc.
The Item data returned from Spider's parse will be processed in turn by the Pipeline classes in the ITEM_PIPELINESlist .

An Item Pipeline class must implement the following methods:

  • process_item(item, spider)Called for each item pipeline component and needs to return an scrapy.item.Iteminstance object or throw an scrapy.exceptions.DropItemexception. When an exception is thrown, the item will not be processed by subsequent pipelines. parameter:
    • item (Item object)– The Item object returned by the parse method
    • spider (BaseSpider object)– Grab the crawler object corresponding to this Item object

The following two methods can also be additionally implemented:

  • open_spider(spider)Called when the crawler is opened. Parameters: spider (BaseSpider object)– already running crawler
  • close_spider(spider)Called when the crawler is closed. Parameters: spider (BaseSpider object)– crawler that has been closed

Save scraped data

The easiest way to save information is through Feed exports , the command is as follows:

$ scrapy crawl dmoz -o items.json -t json

In addition to the json format, JSON lines, CSV, XML formats are also supported, and you can also extend some formats through the interface.

This method is also sufficient for small projects. If it is more complex data, you may need to write an Item Pipeline for processing.

All scraped items will be saved in the newly generated items.json file in JSON format

Summarize

The above describes the process of how to create a crawler project, you can refer to the above process to contact again. As a learning example, you can also refer to this article: scrapy Chinese tutorial (crawling cnbeta instance) .

The crawler code in this article is as follows:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
 
from cnbeta.items import CnbetaItem
 
class CBSpider(CrawlSpider):
    name = 'cnbeta'
    allowed_domains = ['cnbeta.com']
    start_urls = ['http://www.cnbeta.com']
 
    rules = (
        Rule(SgmlLinkExtractor(allow=('/articles/.*\.htm', )),
             callback='parse_page', follow=True),
    )
 
    def parse_page(self, response):
        item = CnbetaItem()
        sel = Selector(response)
        item['title'] = sel.xpath('//title/text()').extract()
        item['url'] = response.url
        return item

It should be noted:

  • The crawler class inherits the CrawlSpiderclass and defines rules. /articles/.*\.htmThe rules specify that links containing will be matched.
  • This class does not implement the parse method, and the callback function is defined in the rule parse_page. You can refer to more information to understand the usage of CrawlSpider

3. Learning Materials

I came into contact with Scrapy because I wanted to crawl some Zhihu data. At the beginning, I searched for some relevant information and other people's implementation methods.

Some people on Github have more or less realized the crawling of Zhihu data. I searched for the following warehouses:

other information:

Example of interaction between scrapy and javascript:

There are still some knowledge points to be sorted out:

  • How to log in first and then crawl data
  • How to use rules to filter
  • How to crawl data recursively
  • Parameter setting and optimization of scrapy
  • How to implement distributed crawling

4. Summary

The above is a note and knowledge arrangement of learning Scrapy in the past few days. I wrote this article with reference to some online articles. I would like to express my thanks for this, and I hope this article can be helpful to you. If you have any ideas, please leave a message; if you like this article, please help to share, thank you!

 

Original reprint: http://blog.javachen.com/2014/05/24/using-scrapy-to-cralw-data.html

 

 

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326992205&siteId=291194637