"Learning scrapy framework climb fiction" to further improve

A perfect target:

1, for ease of use, the novel alphabet or English, Chinese name novel output, a first section of the url variable, modify these parameters to crawl different novel.

2, settings.py modify settings files, configuration information recording debug log to facilitate troubleshooting.

3, modify the character set encoding, solve some of the pages have emoji symbols, making it impossible to web content crawled into question the data table.

Second, the implementation process

1, modify pipelines.py file:

(Python) [root @ localhost xbiquge] # vi xbiquge / pipelines.py
        self.url_firstchapter = "http://www.xbiquge.la/43/43474/19425971.html" # Here is the first chapter of the novel link address.
# - * - Coding: UTF-8 - * -

# the Define your Item Pipelines here Wallpaper
#
# to the Do not forget your Pipeline to the Add Setting at The ITEM_PIPELINES
# See: https://docs.scrapy.org/en/latest/topics /item-pipeline.html
Import OS
Import Time
Import pymysql
from twisted.enterprise Import adbapi
from pymysql Import Cursors

class XbiqugePipeline (Object):
    # define classes initializing operation includes connecting a database and to build the table novels
    DEF the __init __ (Self):
        dbparams = {
            'Host': '127.0.0.1',
            'Port': 3306,
            'the User': 'root',
            'password': 'password',
            'Database': 'Novels',
            'charset': ' utf8mb4 ' # use utf8mb4 character set emoji emoticons to avoid erroneous data can not be stored in the table, which because the mysql utf8 support only three bytes of storage, and the general character is three bytes, but emoji expression symbol is 4 bytes.

        }
        Self.conn = pymysql.connect (** dbparams)
        self.cursor = self.conn.cursor ()
        self._sql = None
        self.name_novel = "heifen" # here in English or Pinyin fiction, this is also the name Fiction storage table file name.
        self.url_firstchapter = "http://www.xbiquge.la/43/43474/19425948.html" # Here is the link address of the first chapter of the novel.
        self.name_txt = "Wife number one black powder" # Here is the Chinese name of the novel, the output file named.

    # Reptile start
    def open_spider (self, spider):
        self.createtable()  #爬虫开始时先初始化小说存储表
        return

    #建表
    def createtable(self):
        self.cursor.execute("drop table if exists "+ self.name_novel)
        self.cursor.execute("create table " + self.name_novel + " (id int unsigned auto_increment not null primary key, url varchar(50) not null, preview_page varchar(50), next_page varchar(50), content TEXT not null) charset=utf8mb4")
        return

    def process_item(self, item, spider):
        self.cursor.execute(self.sql, (item['url'], item['preview_page'], item['next_page'], item['content']))
        self.conn.commit()
        return item

    @property
    def sql(self):
        if not self._sql:
            = self._sql "" "
                INSERT INTO" "" + + self.name_novel "" "(ID, URL, preview_page, next_page, Content) values (null,% S,% S,% S,% S)
                " ""
            self._sql return
        return self._sql

    # chapters take novel written txt file from a database
    DEF content2txt (Self):
        self.cursor.execute ( "the SELECT COUNT (*) from" + self.name_novel)
        record_num = self.cursor. or fetchall () [0] [0]
        Print (record_num)
        Counts = record_num
        url_c = "\" "+ self.url_firstchapter +" \ ""
        START_TIME the time.time = () # extract content of the novel Gets the start time running
        f = open (self.name_txt + ". txt" , mode = 'w',encoding = 'utf-8') # write novels opened a file name with txt composition
        for i in range (counts):
            sql_c = "select content from" + self.name_novel + "where url =" + url_c # chapters of the novel combination get sql command. Here you need to modify the database file name
            self.cursor.execute (sql_c)
            record_content_c2a0 = self.cursor.fetchall () [0] [0] # obtain novel chapters
            record_content = record_content_c2a0.replace (u '\ xa0 ', u '' ) elimination of special characters # \ XC2 \ XA0
            f.write ( '\ n-')
            f.write (record_content + '\ n-')
            f.write ( '\ n-\ n-')
            sql_n = "SELECT next_page from" Self +. name_novel + "where url =" + url_c # combination of sql command links get the next chapter. Here you need to modify the database file name
            self.cursor.execute (sql_n)
            url_c = "\" "+ self.cursor.fetchall () [0] [0] +" \ "" # The next chapter link address assigned to url_c, ready for the next cycle.
        f.close ()
        Print (time.time () - start_time)
        Print (self.name_txt + ".txt" + "file has been generated!")
        return

    end # reptile, call content2txt method to generate txt file
    DEF close_spider (Self, Spider):
        Self. content2txt ()
        return

2, spider file:

(Python) [root @ localhost xbiquge] # vi xbiquge / spiders / heifen.py # reptile files can be copied using, do not use scrapy genspider command again to produce.

. 8-UTF - - *: # - - * Coding
Import Scrapy
from xbiquge.items Import XbiqugeItem

when # is generated by such scrapy genspider sancun www.xbiquge.la command fetch different novel,: class SancunSpider (scrapy.Spider) such names can not be modified.
    name = 'heifen' # different reptiles, here need to set a different name.
    = allowed_domains [ 'www.xbiquge.la']
    #start_urls = [ 'http://www.xbiquge.la/10/10489/']

    DEF start_requests (Self):
        start_urls = [ 'HTTP: //www.xbiquge. La / 43 is / 43474 / ']
        for URL in start_urls:
            the yield scrapy.Request (URL = URL, the callback = self.parse)

    DEF the parse (Self, Response):
        DL = response.css (' DL dd List # ') # extracting section links to related information
        for dd in dl:
            self.url_c = "http://www.xbiquge.la" + dd.css ( 'a :: attr (href)'). extract () [0] # are combined to form the respective sections of the novel link
            #print (self. url_c)
            #yield scrapy.Request (self.url_c, the callback = self.parse_c, dont_filter = True)
            the yield scrapy.Request (self.url_c, the callback = self.parse_c) # to generate a mode (the yield) is obtained for each method call parse_c chapter links, links Previous, Next chapter links and
section information.
            #Print (self.url_c)
    DEF parse_c (Self, Response):
        Item = XbiqugeItem ()
        Item [ 'URL'] = response.url
        Item [ 'preview_page'] = "http://www.xbiquge.la" Response + .css ( 'div .bottem1 A :: attr (the href)'). Extract () [. 1]
        Item [ 'next_page'] = "http://www.xbiquge.la" + response.css ( 'div .bottem1 a ::
        = response.css title ( 'text con_top ::.') Extract () [. 4].
        Contents response.css = ( '# :: text Content') Extract ().
        text = ''
        for Contents in Content:
            text = + content text
        #Print (text)
        Item [ 'content'] = + title "\ n-" + text.replace ( '\ 15', '\ n-') # chapters are combined into content title and content data, \ 15 ^ M octal representation, need to be replaced newline.
        yield item # content to generate a mode (the yield) to the Item object pipelines output module.

3, modify the settings file:

  (python) [root@localhost xbiquge]# vi xbiquge/settings.py

...

ROBOTSTXT_OBEY = False

...

ITEM_PIPELINES = {
    'xbiquge.pipelines.XbiqugePipeline': 300,
}

...

FEED_EXPORT_ENCODING = 'utf-8'
LOG_LEVEL = 'DEBUG'
LOG_FILE =  './myspiders.log'

4, items.py file:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class XbiqugeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    preview_page = scrapy.Field()
    next_page = scrapy.Field()
    content = scrapy.Field()

Third, the use of different crawling Novel:

1, copy spider file: cp heifen.py xueba.py

2, modify reptiles name (name) new spider file and directory page url address (start_urls):

(1) name = 'heifen' modify name = 'xueba';

(2) start_urls = [ 'http://www.xbiquge.la/43/43474/'] modified as start_urls = [ 'http://www.xbiquge.la/19/19639/']

3, modify pipelines.py three variables file content: self.name_novel, self.url_firstchapter , self.name_txt

4, running crawlers (in / root under / xbiquge directory): scrapy runspider xbiquge / spiders / xueba.py

Run is complete, you can (/ root / xbiquge) to see the resulting novel txt files in the current directory. Reptile run debug information can be viewed in /root/xbiquge/myspiders.log in.

Guess you like

Origin www.cnblogs.com/sfccl/p/12055800.html