Learn scrapy framework climb fiction

I. Background: Recent learning python crawler technology, I feel very interesting. Since reptiles feel inefficient hand-made, learned reptile community with advanced tools available, try to use to learn scrapy reptile framework.

Second, the environment: centos7, python3.7, scrapy1.7.3

Three, scrapy principle briefly:

1, scrapy frame composed of: engine, a scheduler downloader (including downloaded middleware), crawler assembly (Spider, reptiles including intermediate), the output pipe (item pipelines)

2, scrapy work process:

(1) engine launched reptiles request, submitted to the scheduler to schedule the task order.

(2) scheduler to arrange download tasks submitted Downloader download tasks carried out by the engine (issued Quests request).

Response (3) downloader processing contents obtained by the crawling engine back to the crawler assembly. (This step is focused on working people, crawling what, how to crawl and other design requires manual control settings.)

(4) the data output by the processing component reptiles item pipelines assembly, can be output as JSON, databases, csv like format. (This step requires the specific needs, can be artificially controlled output scheme.)

(5) the above-described iterative process, until a predetermined crawling task is completed.

Four, scrapy project build process (to crawling "Chinese village" on the new novel pen Fun Club website www.xbiquge.com for example)

1, the new reptile engineering

(base) [python@ELK ~]$ scrapy startproject hanxiang
New Scrapy project 'hanxiang', using template directory '/home/python/miniconda3/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /home/python/hanxiang

You can start your first spider with:
    cd hanxiang
    scrapy genspider example example.com
(base) [python@ELK ~]$ tree hanxiang
hanxiang
├── hanxiang
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

2, into the project directory and generate reptile reptiles file

(Base) [Python @ ELK ~] $ cd Hanxiang
(Base) [Python @ ELK Hanxiang] $ scrapy genspider Hanxiang www.xbiquge.la/15/15158 # note, reptiles name must be unique project name; do not add http link : // or https: // keywords, links do not have a tail / symbol.
Spider the Created 'Hanxiang' a using Template 'Basic' in Module:
  hanxiang.spiders.Hanxiang

(Base) [Python @ ELK Hanxiang] $ Tree
.
├── Hanxiang
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── __pycache__
│ ├── the __init __ │. CPython-37.pyc
│ │ └── settings.cpython-37.pyc
│ ├── the settings.py
│ └── Spiders
│ ├── Hanxiang.py
│ ├── __init__.py
│ __pycache__ └──
│ └── __init __. CPython-37.pyc
└── scrapy.cfg
in engineering reptile newly created file, the focus is Hanxiang.py reptile program files.

3, document preparation and reptiles related configuration files

(1) modify settings.py file

(Base) [python @ ELK hanxiang] $ vi hanxiang / settings.py

# -*- coding: utf-8 -*-
# Scrapy settings for hanxiang project

BOT_NAME = 'hanxiang'
SPIDER_MODULES = ['hanxiang.spiders']
NEWSPIDER_MODULE = 'hanxiang.spiders'

ROBOTSTXT_OBEY = False # crawling without site restrictions

...

The Configure pipelines Item #
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'hanxiang.pipelines.HanxiangPipeline': 300, # pipelines function is enabled, a priority configuration 300 , which is in the range 0-1000
}
...


FEED_EXPORT_ENCODING = 'utf-8' # output coded set

(2) modify the configuration items.py (content determines that crawling)

(Base) [python @ ELK hanxiang] $ vi hanxiang / items.py

# - * - Coding: UTF-8 - * -

# the Define here Wallpaper at The Models for your Scraped items
#
# See Documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
Import scrapy
class HanxiangItem (scrapy.Item):
    : # the DEFINE at the Fields like here Wallpaper for your Item
    # name = scrapy.Field ()
    url = scrapy.Field () # need to obtain novel chapters link
    on preview_page = scrapy.Field () # novels a chapter link
    next_page = Next chapter links scrapy.Field () # novel
    chapters content = scrapy.Field () # novels

(3) preparation pipelines.py program, the control output. (Target: crawling the content output to four fields mysql database, to facilitate the subsequent processing.)

(Base) [Python @ ELK Hanxiang] $ vi Hanxiang / pipelines.py
# - * - Coding: UTF-8 - * -
# the Define your Item Pipelines here Wallpaper
#
# to the Do not forget your Pipeline to the Add Setting at The ITEM_PIPELINES
# See : https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Import OS
Import pymysql
from twisted.enterprise Import adbapi
from pymysql Import Cursors

class HanxiangPipeline (Object): # class name is generated automatically, do not change.

    def __init __ (self): # define classes initializing operation includes connecting a database and creating hanxiang novels data tables.
        = {dbparams
            ' Host ':' 127.0.0.1 ',
            ' Port ': 3306,
            ' User ':' the root ',
            ' password ':'
            'database': 'novels',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None
        self.cursor.execute("drop table if exists hanxiang")
        self.cursor.execute("create table hanxiang (id int unsigned auto_increment not null primary key, url varchar(50) not null, preview_page varchar(50), next_page varchar(50), content TEXT not null) charset=utf8")

    def process_item(self, item, spider):        #此方法名字也是自动生成,不能更改。
        self.cursor.execute(self.sql, (item['url'], item['preview_page'], item['next_page'], item['content'])) # Sql execute command, write data to the crawling hanxiang data tables.
        self.conn.commit () # required to perform the commit, the contents of the data table will be updated.
        Item return

    @Property
    DEF SQL (Self):
        IF Not self._sql:
            self._sql = "" "
                INSERT INTO Hanxiang (ID, URL, preview_page, next_page, Content) values (null,% S,% S,% S, S%)
                "" "
            return self._sql
        return self._sql

(. 4) preparation crawler main Hanxiang.py, crawling setting contents.

(Base) [python @ ELK hanxiang] $ vi hanxiang / spiders / Hanxiang.py

# - * - Coding: UTF-. 8 - * -
Import Scrapy
from hanxiang.items Import HanxiangItem

class HanxiangSpider (scrapy.Spider): # reptiles automatically generated name
    name = 'Hanxiang'
    #allowed_domains = [ 'www.xbiquge.la / 15/15158 '] # crawling page automatically generated control range (domain)
    allowed_domains = [' xbiquge.la '] # relaxing crawling pages limited otherwise, it is necessary to use the method in the Request dont_filter emitted deeper stage = True parameter.
    def start_requests (self): # Name Method can not be changed
        start_urls = [ 'http://www.xbiquge.la/15/15158/'] # should start crawling pages list stored manner, the variable name can not be changed.
        URL in start_urls for:
            the yield scrapy.Request (URL = URL, the callback = self.parse) # generator mode (the yield) call processing method reptiles parse, very efficient.
    def parse (self, response): # Name Method can not be changed
        dl = response.css ( '# list dl dd'
        for dd in dl:
            chapters link self.url_c = "http://www.xbiquge.la" + dd.css ( 'a :: attr (href)') extract () [0] # Novel combination thereof.
            #Print (self.url_c)
            #yield scrapy.Request (self.url_c, the callback = self.parse_c, dont_filter = True)
            the yield scrapy.Request (self.url_c, the callback = self.parse_c) # to generate a mode (the yield) call parse_c way to get each chapter links, previous links, next chapter links and content information.
            #Print (self.url_c)
    DEF parse_c (Self, Response):
        Item = HanxiangItem ()
        Item [ 'URL'] = response.url
        Item [ 'preview_page'] = "http://www.xbiquge.la" Response + .css ( 'div .bottem1 A :: attr (the href)'). Extract () [. 1]
        Item [ 'next_page'] = "http://www.xbiquge.la" + response.css ( '
        = response.css title ( 'text con_top ::.') Extract () [. 4].
        Contents response.css = ( '# :: text Content') Extract ().
        text = ''
        for Contents in Content:
            text = + content text
        #Print (text)
        Item [ 'content'] = + title "\ n-" + text.replace ( '\ 15', '\ n-') # chapters are combined into content title and content data, \ 15 ^ M octal representation, need to be replaced newline.
        yield item # content to generate a mode (the yield) to the Item object pipelines output module.

4, run reptiles:

(Base) [python @ ELK hanxiang] $ scrapy runspider hanxiang / spiders / Hanxiang.py

 

Guess you like

Origin www.cnblogs.com/sfccl/p/11401782.html