Use scrapy for persistent storage

scrapy framework

  • Architecture

    • Crawler folder: spider
    • Pipeline file: pipelines.py
    • Middleware file: middlewares.py
    • item module: items.py
    • Settings file: settings.py
  • Related commands

    • Create scrapy project: scrapy startproject project name
    • Create a crawler file:
      • cd project name
      • scrapy genspider crawler file name www.xxx.com
    • Execute crawler file: scrapy crawl crawler file name
    • Do not view logs: scrapy crawl crawler file name-nolog 
    • Project initialization:
      • Configure settings:
        • UA camouflage
        • Close the robot.txt protocol
        • Add log configuration LOG_LEVEL
    • Persistent storage
      • Based on the storage specified by the terminal
        • scrapy crawl crawler file-o xxx.csv (json, xml, csv ...)
      • Pipeline based persistent storage
        • File storage
        • MySql storage
        • Redis storage 

Requirement 1: scrapy crawls the content and author of the middle section of the sister's home page

1. First create a crawler project and enter it on the command line

scrapy startproject budejiepro

Go to the project directory and create a crawler file named budejie

cd budejiepro
scrapy genspider first www.xxx.com

Generate a first.py crawler file in the spiders of the project

2. Between the start of the crawler, we first change `ROBOTSTXT_OBEY = True` in the setting.py file to ** False **, otherwise scrapy will deny access according to the robots agreement, we also need to set ua, this is ready to crawl website money what need to do!!

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 
Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

3. We open the crawler file budejie.py and start
crawling the webpage. The initial URL of the crawling webpage we configured at the beginning is www.xxx.com, here we need to manually modify it, or we can configure the correct one from the beginning. url, allowed_domains under this category refers to domain names that allow automatic sending of requests, as long as the URLs in this list, scrapy will automatically send requests, start_urls is the starting url, scrapy will automatically send requests to the starting url, and get the response Data. The parse method is used to parse the data. Our next data analysis is done here
-we analyze the tag position where the paragraph and the author are located. There is no cumbersome description here, just like the previous general crawler step

2c429432071f59a80c5f8e7cacdee54a.png  

However, it should be noted that the data printed by the author and paragraph we crawled out will find that he is placed in a selector, and this selector is placed in a list, so we need to use extract () To fetch93249c5c0de25254874c8e11e240e6a7.png  

author = li.xpath('./div[1]/div[2]/a/text()').extract()[0]
author = li.xpath('./div[1]/div[2]/a/text()').extract_first()

 

d1552232a1e01d23719ec9ea230b6358.png

Problem: Persistent storage, currently the data only appears in the terminal, if the returned data is to be persistently stored, it needs to be encapsulated into a dictionary

# -*- coding: utf-8 -*-
import scrapy

class BudejieSpider(scrapy.Spider):
    name = 'budejie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.budejie.com/text/']

    def parse(self, response):
        """
        在这里实现数据解析
        :param response:
        :return:
        """
        li_list = response.xpath('//div[@class="j-r-list"]/ul/li')
        name_list = list()
        for li in li_list:
             # xpath analytical method returns a Selected label, use .extract () method to get inside the text data 
            # if we can determine the list of returning there is only one Selecter label, then you can use extract_first direct () method to get the text 
            # author = li.xpath ('./ div [1] / div [2] / a / text ()'). extract () [0] 
            author = li.xpath ( ' ./div [1] / div [ 2] / a / text () ' ) .extract_first () 
            Content = li.xpath ( ' ./div [2] / div / a / text () ' ) .extract_first ()
             # returned if the data to be persisted Storage, need to be encapsulated in a dictionary 
            dic = {
                 " author " : author,
                 " content ":content
            }
            name_list.append(dic)

        return name_list
scrapy crawls the content and author of the middle section of the sister's home page

 

Requirement 2: Requirement: The scrapy pipeline persistently stores the title and img_url

 -Crawling ideas (ignoring basic configuration and other operations)

  1. The crawling of the school flower net is actually the same as the previous crawling idea of ​​the unsister. The difficulty of this crawling is how to pipeline the persistent storage.The above example of the unsister is based on the persistent storage of the terminal instruction, usually To encapsulate the data and return it in a dictionary, write the file in the specified format through the terminal instruction for persistence operation; now we can easily use the efficient and convenient persistence operation designed by scrapy framework for us, we can Used directly

  2. Persistent storage of pipelines will involve two py files
    items.py (data structure template file, defining data attributes)
    pipelines.py (pipeline file, used to receive data items, for persistence operations)
  3 Persistence process
    calibration. After py crawls the website data, we need to encapsulate the data into the items object.
       We use the yield keyword in the .py file to submit the items object to the pipeline pipeline persistence operation
    . Receive the crawler file in the process_item method in the pipeline file. Item object, and then write persistent storage code to persistently store the data stored in the item object
    to open the pipeline in the settings.py configuration file

# First configure the UA camouflage setting.py file, close the robots.txt protocol log settings 
USER_AGENT = ' Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 78.0.3904.108 Safari / 537.36 ' 

# Obey, a robots.txt the rules 
ROBOTSTXT_OBEY = False 
LOG_LEVLE = " ERROR " 

# this step is in the open duct profile 
ITEM_PIPELINES = {    ' xiaohuapro.pipelines.XiaohuaproPipeline ' : 300,}
import scrapy
from xiaohuapro.items import XiaohuaproItem

class XiaohuaSpider(scrapy.Spider):
    name = 'xiaohua'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.521609.com/meinvxiaohua/']

    def parse(self, response):
        li_list = response.xpath('//div[@class="index_img list_center"]/ul/li')

        for li in li_list:
            title = li.xpath('./a[2]/text()| ./a[2]/b/text()' ) .Extract_first () 
            img_url = " http://www.521609.com " + li.xpath ( ' ./a[1]/img/@src ' ) .extract_first ()
             Print (title, img_url)
             # from Import the XiaohuaproItem class in the items.py file and instantiate an item object 
            item = XiaohuaproItem () #
             # # The crawled data needs to be added to the attributes of the object.In
             fact, the item object can be understood as a dictionary, through the dictionary The mistake is to store the data 
            item [ " title " ] = title 
            item [ " img_url " ] = img_url 

            ## How to submit the item object to the pipeline? Yield generator knowledge ~~ 
            # # Only need to use the yield keyword to submit the item object to the pipeline 
            # #This operation must be placed in the loop and submit multiple data to the pipeline 
            #
             yield item
xiaohua.py
import scrapy

class XiaohuaproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    img_url = scrapy.Field()
items.py
import pymysql
 import redis 

class XiaohuaproPipeline (object): 
    f = None
     def open_spider (self, spider):
         print ( " Start crawler " ) 
        self.f = open ( ' ./xiaohua.txt ' , " w " , encoding = " utf -8 " ) 

    def process_item (self, item, spider):
         " "" 
        Here the pipeline is stored persistently. 
        Since the crawler file will submit data to the pipeline multiple times, this method will be executed many times  
        : param item: is items. An object instantiated by the scrapy.Item class in the py file
        : param spider 
        :: return: 
        """
        title = item["title"]
        img_url = item["img_url"]

        self.f.write(title+":"+img_url+"\n")
        return item

    def close_spider(self, spider):
        self.f.close()
        print("结束爬虫")


class MysqlPipeline(object):
    conn = None
    cursor =None
     def open_spider (self, spider):
        print ( " Start crawling " ) 
        self.conn = pymysql.Connect (host = " 10.0.3.156 " , port = 3306, user = " root " , password = "" , db = " qishi8 " , charset = " utf8 " ) 

    def process_item (self, item, spider):
         "" " 
        Persistent storage of the pipeline here 
        Since the crawler file will submit data to the pipeline multiple times, this method will be executed many times 
        : param item: is scrapy in the items.py file An object instantiated by the .Item class 
        : param spider 
        :: return: 
        "" " 
        self.cursor = self.conn.cursor()
        title = item["title"]
        img_url = item["img_url"]
        sql = "insert into xiaohua values('{}','{}')".format(title,img_url)
        try:
            self.cursor.execute(sql)
        except Exception as e:
            self.cursor.rollback()
        self.conn.commit()

        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()
        print ( "End Crawler " )
pipelines.py

 

Guess you like

Origin www.cnblogs.com/groundcontrol/p/12716147.html