Python Crawler(1) - Scrappy Introduce

Python Crawler(1) - Scrappy Introduce

>python --version
Python 2.7.13

>pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (python 2.7)

>pip install scrapy

https://docs.scrapy.org/en/latest/intro/overview.html
First example here quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
    name="quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }
       
            next_page = response.css('li.next a::attr("href")').extract_first()
            if next_page is not None:
                yield response.follow(next_page, self.parse)

Command to check
>scrapy runspider quotes_spider.py -o quotes.json


https://docs.scrapy.org/en/latest/intro/tutorial.html
Start a New Project
>scrapy startproject tutorial

First Spider under spiders, quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Save file %s' % filename)

Run the Project
>scrape crawl quotes

A shortcut to the start_requests
start_urls = [
    'http://quotes.toscrape.com/page/1/',
    'http://quotes.toscrape.com/page/2/',
]

This shell command will open all the DOM elements on the page
>scrapy shell 'http://quotes.toscrape.com/page/1’
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x104c3db90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x104c3d110>
[s]   spider     <DefaultSpider 'default' at 0x10582e550>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

>response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

>response.css('title::text').extract()
[u'Quotes to Scrape’]

>response.css('title::text').extract_first()
u'Quotes to Scrape’

>response.xpath('//title/text()').extract_first()
u'Quotes to Scrape’

>quote = response.css("div.quote")[0]
>title = quote.css("span.text::text").extract_first()
>title
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'

>for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
...
{'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}

Change the Python Script to Parse the data in Spider
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').extract_first(),
            'author': quote.css('small.author::text').extract_first(),
            'tags': quote.css('div.tags a.tag::text').extract(),
        }

Output the JSON in somewhere
>scrapy crawl quotes -o quotes.json

>response.css('li.next a::attr(href)').extract_first()
u'/page/2/‘

Find Next Page
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

Or alternatively

if next_page is not None:
    yield response.follow(next_page, callback=self.parse)

Author Spider
import scrapy

class AuthorSpider(scrapy.Spider):
    name = 'author'
    start_urls = [ 'http://quotes.toscrape.com/' ]

    def parse(self, response):
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

>scrapy crawl author -o authors.json

Receive Parameters
>scrapy crawl quotes -o quotes-humor.json -a tag=humor

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)



References:
https://www.debrice.com/building-a-simple-crawler/
https://gist.github.com/debrice/a34563fb078d9d2d15e8
https://scrapy.org/
https://medium.com/python-pandemonium/develop-your-first-web-crawler-in-python-scrapy-6b2ee4baf954


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326490695&siteId=291194637