Scrapy is a fast, high-level screen scraping and web scraping framework developed in Python for scraping web sites and extracting structured data from pages. Scrapy is versatile and can be used for data mining, monitoring, and automated testing.
- Official homepage: http://www.scrapy.org/
- Chinese documentation: Scrapy 0.22 documentation
- GitHub project homepage: https://github.com/scrapy/scrapy
Scrapy uses the Twisted asynchronous networking library to handle network communication. The overall structure is roughly as follows (note: the picture is from the Internet):
Scrapy mainly includes the following components:
- The engine is used to process the data stream processing of the entire system and trigger transactions.
- The scheduler is used to accept the request sent by the engine, push it into the queue, and return when the engine requests again.
- The downloader is used to download web content and return the web content to the spider.
- Spider, spider is the main work, it is used to formulate the parsing rules of a specific domain name or web page.
- Project pipeline, responsible for processing projects extracted from web pages by spiders, his main task is to clarify, validate and store data. When the page is parsed by the spider, it is sent to the project pipeline and the data is processed in several specific sequences.
- Downloader middleware, a hook framework between the Scrapy engine and the downloader, mainly handles requests and responses between the Scrapy engine and the downloader.
- Spider middleware, a hook framework between the Scrapy engine and the spider, the main job is to handle the spider's response input and request output.
- Scheduling middleware, middleware between Scrapy engine and scheduling, requests and responses sent from Scrapy engine to scheduling.
Using Scrapy can easily complete the collection of online data, it has done a lot of work for us, and we don't need to work hard to develop it.
1. Installation
install python
The latest version of Scrapy is 0.22.2, which requires python 2.7, so you need to install python 2.7 first. Here I use the centos server for testing, because the system comes with python, you need to check the python version first.
Check out the python version:
$ python -V
Python 2.6.6
Upgrade version to 2.7:
$ wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
$ tar xf Python-2.7.6.tar.xz
$ cd Python-2.7.6
$ ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
$ make && make altinstall
Establish a soft connection so that the default python of the system points to python2.7
$ mv /usr/bin/python /usr/bin/python2.6.6
$ ln -s /usr/local/bin/python2.7 /usr/bin/python
Check the python version again:
$ python -V
Python 2.7.6
Install
Here use wget to install setuptools :
$ wget https://bootstrap.pypa.io/ez_setup.py -O - | python
install zope.interface
$ easy_install zope.interface
install twisted
Scrapy uses the Twisted asynchronous network library to handle network communication, so twisted needs to be installed.
Before installing twisted, you need to install gcc:
$ yum install gcc -y
Then, install twisted via easy_install:
$ easy_install twisted
If the following error occurs:
$ easy_install twisted
Searching for twisted
Reading https://pypi.python.org/simple/twisted/
Best match: Twisted 14.0.0
Downloading https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
Processing Twisted-14.0.0.tar.bz2
Writing /tmp/easy_install-kYHKjn/Twisted-14.0.0/setup.cfg
Running Twisted-14.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kYHKjn/Twisted-14.0.0/egg-dist-tmp-vu1n6Y
twisted/runner/portmap.c:10:20: error: Python.h: No such file or directory
twisted/runner/portmap.c:14: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:31: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:45: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘PortmapMethods’
twisted/runner/portmap.c: In function ‘initportmap’:
twisted/runner/portmap.c:55: warning: implicit declaration of function ‘Py_InitModule’
twisted/runner/portmap.c:55: error: ‘PortmapMethods’ undeclared (first use in this function)
twisted/runner/portmap.c:55: error: (Each undeclared identifier is reported only once
twisted/runner/portmap.c:55: error: for each function it appears in.)
Please install python-devel and run again:
$ yum install python-devel -y
$ easy_install twisted
If the following exception occurs:
error: Not a recognized archive type: /tmp/easy_install-tVwC5O/Twisted-14.0.0.tar.bz2
Please download and install manually, the download address is here
$ wget https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
$ tar -vxjf Twisted-14.0.0.tar.bz2
$ cd Twisted-14.0.0
$ python setup.py install
Install pyOpenSSL
Install some dependencies first:
$ yum install libffi libffi-devel openssl-devel -y
Then, install pyOpenSSL via easy_install:
$ easy_install pyOpenSSL
Install Scrapy
Install some dependencies first:
$ yum install libxml2 libxslt libxslt-devel -y
Finally, install Scrapy again:
$ easy_install scrapy
2. Using Scrapy
After the installation is successful, you can understand some basic concepts and usage of Scrapy, and learn the example dirbot of the Scrapy project.
The Dirbot project is located at https://github.com/scrapy/dirbot , and the project includes a README file that describes the content of the project in detail. If you are familiar with git, you can checkout its source code. Or you can download the file in tarball or zip format by clicking Downloads.
The following example describes how to use Scrapy to create a crawler project.
New Construction
Before scraping, you need to create a new Scrapy project. Go to a directory where you want to save your code and execute:
$ scrapy startproject tutorial
This command will create a new directory tutorial in the current directory with the following structure:
.
├── scrapy.cfg
└── tutorial
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
These files are mainly:
- scrapy.cfg: project configuration file
- tutorial/: The project python module, the code will be imported from here
- tutorial/items.py: project items file
- tutorial/pipelines.py: project pipeline file
- tutorial/settings.py: Project configuration file
- tutorial/spiders: directory where spiders are placed
Define Item
Items is the container that will hold the scraped data, it works like a dictionary in python, but it provides more protection, like padding of undefined fields against misspellings.
It is declared by creating a scrapy.item.Item
class , defining its properties as an scrpy.item.Field
object, like an object-relational mapping (ORM).
We control the site data obtained from dmoz.org by modeling the required item, for example, we want to get The name of the site, the url and the site description, we define the domains for these three properties. To do this, we edit the items.py file in the tutorial directory and our Item class will look like this
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
It might seem confusing at first, but defining these items lets you know what your items are when you use other Scrapy components.
Write a crawler (Spider)
Spiders are user-written classes that scrape information from a domain (or group of domains). They define a preliminary list of URLs for downloading, how to follow links, and how to parse the content of these pages for extracting items.
To create a Spider, you can scrapy.spider.BaseSpider
create a subclass for and define three main, mandatory properties:
-
name
: The identification name of the crawler, it must be unique, you must define different names in different crawler. -
start_urls
: A list of URLs the crawler starts crawling. The crawler starts scraping data from here, so the first download of data will start from these URLs. Other sub-URLs will be generated inherited from these starting URLs. -
parse()
: The method of the crawler. When calling, the Response object returned from each URL is passed as a parameter. The response will be the only parameter of the parse method.
This method is responsible for parsing the returned data, matching the scraped data (parsed as item) and following more URLs.
Create DmozSpider.py in the tutorial/spiders directory
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
run the project
$ scrapy crawl dmoz
This command starts the crawler from the dmoz.org domain, and the third parameter is the value of the name attribute in DmozSpider.py.
xpath selector
Scrapy uses a mechanism called XPath selectors, which are based on XPath expressions. If you want to learn more about selectors and other mechanisms you can check out the documentation .
Here are some examples of XPath expressions and their meanings:
-
/html/head/title
: Select the tag below the HTML document<head>
element .<title>
-
/html/head/title/text()
: selects the text content below the aforementioned<title>
element -
//td
: select all<td>
elements -
//div[@class="mine"]
: selects all div tag elements that contain theclass="mine"
attribute
These are just a few simple examples of using XPath, but XPath is actually very powerful. If you want to learn more about XPATH, we recommend this XPath tutorial to you
In order to facilitate the use of XPaths, Scrapy provides the Selector class, there are three methods
-
xpath()
: Returns a list of selectors, each select represents a node selected by an xpath parameter expression. -
extract()
: returns a unicode string that is the data returned by the XPath selector -
re()
: Returns a list of unicode strings, which are extracted by regular expressions as parameters css()
Extract data
We can select each <li>
element :
sel.xpath('//ul/li')
And then the site description:
sel.xpath('//ul/li/text()').extract()
Website Title:
sel.xpath('//ul/li/a/text()').extract()
Website link:
sel.xpath('//ul/li/a/@href').extract()
As mentioned, each xpath()
call returns a list of selectors, so we can combine xpath()
to dig deeper nodes. We will use these features, so:
sites = sel.xpath('//ul/li')
for site in sites:
title = site.xpath('a/text()').extract()
link = site.xpath('a/@href').extract()
desc = site.xpath('text()').extract()
print title, link, desc
Use Item
scrapy.item.Item
The calling interface is similar to python's dict, Item contains multiple scrapy.item.Field
. This is the same as django's Model and
Item is usually used in Spider's parse method, which is used to save the parsed data.
Finally, modify the crawler class and use Item to save the data. The code is as follows:
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Website
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s([^\n]*?)\\n')
items.append(item)
return items
Now, you can run the project again to see the results:
$ scrapy crawl dmoz
Using Item Pipelines
Set in settings.py ITEM_PIPELINES
, which defaults to []
, similar to django's MIDDLEWARE_CLASSES
etc.
The Item data returned from Spider's parse will be processed in turn by the Pipeline classes in the ITEM_PIPELINES
list .
An Item Pipeline class must implement the following methods:
-
process_item(item, spider)
Called for each item pipeline component and needs to return anscrapy.item.Item
instance object or throw anscrapy.exceptions.DropItem
exception. When an exception is thrown, the item will not be processed by subsequent pipelines. parameter:-
item (Item object)
– The Item object returned by the parse method -
spider (BaseSpider object)
– Grab the crawler object corresponding to this Item object
-
The following two methods can also be additionally implemented:
-
open_spider(spider)
Called when the crawler is opened. Parameters:spider (BaseSpider object)
– already running crawler -
close_spider(spider)
Called when the crawler is closed. Parameters:spider (BaseSpider object)
– crawler that has been closed
Save scraped data
The easiest way to save information is through Feed exports , the command is as follows:
$ scrapy crawl dmoz -o items.json -t json
In addition to the json format, JSON lines, CSV, XML formats are also supported, and you can also extend some formats through the interface.
This method is also sufficient for small projects. If it is more complex data, you may need to write an Item Pipeline for processing.
All scraped items will be saved in the newly generated items.json file in JSON format
Summarize
The above describes the process of how to create a crawler project, you can refer to the above process to contact again. As a learning example, you can also refer to this article: scrapy Chinese tutorial (crawling cnbeta instance) .
The crawler code in this article is as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from cnbeta.items import CnbetaItem
class CBSpider(CrawlSpider):
name = 'cnbeta'
allowed_domains = ['cnbeta.com']
start_urls = ['http://www.cnbeta.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/articles/.*\.htm', )),
callback='parse_page', follow=True),
)
def parse_page(self, response):
item = CnbetaItem()
sel = Selector(response)
item['title'] = sel.xpath('//title/text()').extract()
item['url'] = response.url
return item
It should be noted:
- The crawler class inherits the
CrawlSpider
class and defines rules./articles/.*\.htm
The rules specify that links containing will be matched. - This class does not implement the parse method, and the callback function is defined in the rule
parse_page
. You can refer to more information to understand the usage of CrawlSpider
3. Learning Materials
I came into contact with Scrapy because I wanted to crawl some Zhihu data. At the beginning, I searched for some relevant information and other people's implementation methods.
Some people on Github have more or less realized the crawling of Zhihu data. I searched for the following warehouses:
- https://github.com/KeithYue/Zhihu_Spider implements logging in with username and password before crawling data, see zhihu_spider.py for the code .
- https://github.com/immzz/zhihu-scrapy uses selenium to download and execute javascript code.
- https://github.com/tangerinewhite32/zhihu-stat-py
- https://github.com/Zcc/zhihu is mainly to crawl the topanswers of the specified topic, as well as the user profile, and add the login code.
- https://github.com/pelick/VerticleSearchEngine is based on crawling academic resources, providing search, recommendation, visualization, and sharing. Scrapy, MongoDB, Apache Lucene/Solr, Apache Tika and other technologies are used.
- https://github.com/geekan/scrapy-examples Some examples of scrapy, including examples of getting Douban data, linkedin, Tencent recruitment data, etc.
- https://github.com/owengbs/deeplearning implements paging to get topics.
- https://github.com/gnemoug/distribute_crawler A distributed web crawler implemented using scrapy, redis, mongodb, and graphite, the underlying storage is mongodb cluster, distributed using redis, and the crawler status display is implemented using graphite
- https://github.com/weizetao/spider-roach A simple implementation of a distributed directional scraping cluster.
- https://github.com/scrapinghub/portia This is a visual crawler based on Scrapy. It provides a web page for visual operations, you only need to click on the data you want to extract on the page
- https://github.com/binux/pyspider If you don't like Scrapy, you can try pyspider, it allows you to write debugging scripts on the WEB interface, monitor the execution status, view history and results, you can try the demo online: Dashboard - pyspider
other information:
- http://www.52ml.net/tags/Scrapy collects a lot of articles about Scrapy,recommended reading
- Use Python Requests to scrape Zhihu user information
- Use the scrapy framework to crawl your own blog posts
- Scrapy digs a little deeper
- Using python, scrapy to write (customized) crawler experience, information, miscellaneous.
- Scrapy easily customizes web crawler
- How to let Spider automatically crawl Douban group page in scrapy
Example of interaction between scrapy and javascript:
- Crawling js interactive table data with scrapy framework
- scrapy + selenium parse javascript instance
There are still some knowledge points to be sorted out:
- How to log in first and then crawl data
- How to use rules to filter
- How to crawl data recursively
- Parameter setting and optimization of scrapy
- How to implement distributed crawling
4. Summary
The above is a note and knowledge arrangement of learning Scrapy in the past few days. I wrote this article with reference to some online articles. I would like to express my thanks for this, and I hope this article can be helpful to you. If you have any ideas, please leave a message; if you like this article, please help to share, thank you!
Original reprint: http://blog.javachen.com/2014/05/24/using-scrapy-to-cralw-data.html