Python Crawler(1) - Scrappy Introduce
>python --version
Python 2.7.13
>pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (python 2.7)
>pip install scrapy
https://docs.scrapy.org/en/latest/intro/overview.html
First example here quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
Command to check
>scrapy runspider quotes_spider.py -o quotes.json
https://docs.scrapy.org/en/latest/intro/tutorial.html
Start a New Project
>scrapy startproject tutorial
First Spider under spiders, quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Save file %s' % filename)
Run the Project
>scrape crawl quotes
A shortcut to the start_requests
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
This shell command will open all the DOM elements on the page
>scrapy shell 'http://quotes.toscrape.com/page/1’
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x104c3db90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x104c3d110>
[s] spider <DefaultSpider 'default' at 0x10582e550>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]
>response.css('title::text').extract()
[u'Quotes to Scrape’]
>response.css('title::text').extract_first()
u'Quotes to Scrape’
>response.xpath('//title/text()').extract_first()
u'Quotes to Scrape’
>quote = response.css("div.quote")[0]
>title = quote.css("span.text::text").extract_first()
>title
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
>for quote in response.css("div.quote"):
... text = quote.css("span.text::text").extract_first()
... author = quote.css("small.author::text").extract_first()
... tags = quote.css("div.tags a.tag::text").extract()
... print(dict(text=text, author=author, tags=tags))
...
{'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}
Change the Python Script to Parse the data in Spider
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Output the JSON in somewhere
>scrapy crawl quotes -o quotes.json
>response.css('li.next a::attr(href)').extract_first()
u'/page/2/‘
Find Next Page
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Or alternatively
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
Author Spider
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = [ 'http://quotes.toscrape.com/' ]
def parse(self, response):
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
>scrapy crawl author -o authors.json
Receive Parameters
>scrapy crawl quotes -o quotes-humor.json -a tag=humor
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
References:
https://www.debrice.com/building-a-simple-crawler/
https://gist.github.com/debrice/a34563fb078d9d2d15e8
https://scrapy.org/
https://medium.com/python-pandemonium/develop-your-first-web-crawler-in-python-scrapy-6b2ee4baf954
Python Crawler(1) - Scrappy Introduce
Guess you like
Origin http://43.154.161.224:23101/article/api/json?id=326490695&siteId=291194637
Ranking