python Scrapy framework

Scrapy architecture
Here Insert Picture Description

Scrapy Installation
1. Install python3.6
2. installation Scrapy
open http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted download the corresponding twisted version of whl files
such as: My Twisted-17.5. 0-cp36-cp36m-win_amd64.whl,
behind cp is the python version,
AMD64 on behalf of 64-bit, run the command (to distinguish their computers to install python is the number of bits).
PIP install d: \ xxx \ Twisted-18.7.0- cp36-cp36m-win_amd64.whl
then PIP Scrapy the install
Scrapy info

Scrapy steps:
New Project project
clear goals items
produced crawler spider
stored content pipeline

  1. Create a project
scrapy startproject myproject
  1. clear goal

Item.py

import scrapy
class TorrentItem(scrapy.Item):
    url = scrapy.Field()
    name = scrapy.Field()
    description = scrapy.Field()
    size = scrapy.Field()
  1. Writing spider
    here will need to use the Xpath
XPath
/html/body/h2/text()
//img[@class=“f”]
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MininovaSpider(CrawlSpider):

    name = 'mininova'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(LinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = response.xpath("//h1/text()").extract()
        torrent['description'] = response.xpath("//div[@id='description']").extract()
        torrent['size'] = response.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent
import scrapy
class DmozSpider(scrapy.spiders.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)

douban of Scrapy

# -*- coding: utf-8 -*-
import scrapy
from firstScrapyProject.items import DoubanmovieItem


class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        for info in response.xpath('//div[@class="item"]'):
            item = DoubanmovieItem()
            item['rank'] = info.xpath('div[@class="pic"]/em/text()').extract()
            item['title'] = info.xpath('div[@class="pic"]/a/img/@alt').extract()
            item['link'] = info.xpath('div[@class="pic"]/a/@href').extract()
            item['star'] = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()
            item['rate'] = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[4]/text()').extract()
            item['quote'] = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract()
            yield item

            # 翻页
            next_page = response.xpath('//span[@class="next"]/a/@href')
            if next_page:
                url = response.urljoin(next_page[0].extract())
                yield scrapy.Request(url, self.parse)

scrapy.Spider three attribute name, start_urls, parse ()
performs data acquisition spider

scrapy crawl mininova -o scraped_data.json

You can feed export, csv and xml; if you need more complex, you need to write item pipeline

Selector有四个基本的方法(点击相应的方法可以看到详细的API文档):
xpath(): 传入xpath表达式,返回该表达式所对应的所有节点的selector list列表 。
css(): 传入CSS表达式,返回该表达式所对应的所有节点的selector list列表.
extract(): 序列化该节点为unicode字符串并返回list。
re(): 根据传入的正则表达式对数据进行提取,返回unicode字符串list列表。

Here Insert Picture Description
Create a new spider

scrapy genspider mydomain mydomain.com

Exports your feed / Item Pipeline
the yield plurality of items returned

import scrapyfrom myproject.items import MyItem
class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        sel = scrapy.Selector(response)
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

yield is the return to return a value, and remember this return position, the next iteration starts from this position.

#url 相对路径拼接
from urllib.parse import urljoin
x=urljoin('http://www.baidu.com','../index.html')print(x)

Python is simple to understand in yield:
yield common usage: This keyword is used for the packaging function will function generator. You can then iterate the generator: for x in fun (param ).

According to my understanding, the yield can be understood as the effect of pause and play.
In a function, the program execution to yield statement, the program pauses, returns the value of the expression yield back, next time you call, be suspended from the yield statement to proceed with execution, and so on until the function is executed.

Expansion:
Next function is very similar to the send function, a yield value can be obtained after an expression of the generator, except that the parameters can be transferred to the send function generator.
yield from: the package includes a function of yield, so that the sub-function is also a generator.

In addition python modules:
BeautifulSoup
Requests
PIT module
lxml module
pandas library
pillow crop the image, save
schedule
personally think relatively easy to use:
with Open File operation is really easy to use
format function to manipulate strings
split slice

Guess you like

Origin blog.csdn.net/beyondxiaohu15/article/details/83111381