Day 43 reptile_scrapy framework

scrapy framework

scrapy

  • What is a framework?
    • The so-called simple and universal interpretation of the framework is a project template with strong versatility and integrated many functions, which can be applied to different project requirements. It can also be regarded as a semi-finished product of a project.
  • How to learn framework?
    • For newcomers or junior programmers, for a new framework, you only need to master the role of the framework and the use and application of its various functions. For the underlying implementation and principles of the framework, in the step-by-step process Just go deeper.
  • What is scrapy?
    • Scrapy is an application framework written for crawling website data and extracting structural data. It is very famous and very powerful. It has been integrated with various functions (high-performance asynchronous download, queue, distribution, parsing, persistence, etc.). For framework learning, the key is to learn the characteristics of the framework and the usage of each function.

Basic use of scrapy

Environment installation:
linux and mac operating system: pip install scrapy


Windows system:
  1. pip install wheel
  2. Download twisted, the download address is http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
  3. Install twisted: pip install Twisted‑17.1.0‑cp36‑ cp36m‑win_amd64.whl
  4, pip install pywin32
  5, pip install scrapy
test: enter scrapy command in the terminal, if no error is reported, the installation is successful!

The process of using scrapy:
1. Create a project: scrapy startproject ProName
2. Enter the project directory: cd ProName
3. Create a crawler file: scrapy genspider spiderName www.xxx.com
4. Write relevant operation code
5. Perform project: scrapy crawl spiderName

Code example:

Project files under spiders

class FirstSpider (scrapy.Spider):
     # crawler File name: crawler is a unique identifier of the source file 
    name = ' First ' 
    # allowed Name: start_urls list used to define the url which transmits the request to improvise (not used) 
    # allowed_domains = [ 'www.xxx.com'] 

    # url initial list: the list stored in the transmission station url is automatically requested Scrapy 
    start_urls = [ ' http://www.baidu.com/ ' , ' http://www.sogou.com ' ]
     # for data analysis: response parameter indicates the success of the request corresponding to the respective objects 
    DEF the parse (Self, Response):
         Print (Response) 

    # run 
    # Scrapy crawl First 
    #scrapy crawl first log information is not displayed when --nolog execution, but will not display an error message (not used) 
    # can set the log level log LEVEN displayed in the settings file: ERROR 
    # Note: The default is to support Roboot agreement, we Now understand that you can first turn off 
    ROBOTSTXT_OBEY = True in # settings.py and change to ROBOTSTXT_OBEY = False

settings.py

  # Content and the results were as follows: 
  # 19 Line: camouflage carrier identity request 
  the USER_AGENT = ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 80.0.3987.132 Safari / 537.36 ' 
  # 22 is OK : # robots can ignore or non-compliance with the agreement 
  ROBOTSTXT_OBEY = False  
   # add print log level 
  , LOG_LEVEL, = ' ERROR '

 

scrapy's high-performance persistent storage operations

Persistent storage based on terminal instructions

Based on terminal instructions:

Requirements: Only the return value of the parse method can be stored in a local text file

Note: The type of the text file corresponding to the persistent storage can only be: json, jsonlines, jl, csv, xml, marshal, pickle)

Instruction: scrapy crawl xxx -o filePath

Advantages: simple, efficient and convenient

Disadvantages: strong limitations (data can only be stored in files with specified suffixes)

Ensure that the parse method of the crawler file returns an iterable type object (usually a list or dictionary), and the return value can be written to a file in a specified format through a terminal instruction for persistence operation.

 

Execution instruction:
execute the output in the specified format for storage, and write the crawled data to a file in a different format for storage
scrapy crawl crawler name-o xxx.json
scrapy crawl crawler name-o xxx.xml
scrapy crawl crawler name-o xxx .csv

 

Import Scrapy 


class QiubaiproSpider (scrapy.Spider): 
    name = ' qiubaiPro ' 
    allowed_domains = [ ' www.qiushibaike.com/text/ ' ] 
    start_urls = [ ' https://www.qiushibaike.com/text/ ' ] 

    # based on the terminal persistent storage instructions 
    DEF the parse (Self, the Response):
         # for author and content 
        div_list response.xpath = ( ' // * [@ the above mentioned id = "content"] / div / div [2] / div ' ) 
        DATA_LIST = [ ]
         for div indiv_list:
             # xpath returns a list, but the list element must be an object of type Selector 
            # extract You can extract the string stored in the data parameter of the Selector object 
            user = div.xpath ( ' ./div/a[2]/ H2 / text () ' ) [0] .extract ()
             # , after the call list extract, which shows a list of data each corresponding to a Selector object character string extracted out 
            # User div.xpath = (' ./ div / a [2] / h2 / text () '). extract_first () 
            info = div.xpath ( ' ./a/div/span//text () ' ) .extract () 
            info = '  ' .join (info ) 

            data = {
                 ' user ':user,
                'info':info
            }
            data_list.append(data)
        return data_list

Pipeline-based persistent storage operations

The scrapy framework has specially integrated efficient and convenient persistent operation functions for us, we can use it directly. To use scrapy's persistence operation function, we first recognize the following two files:
items.py: data structure template file. Define data attributes.
pipelines.py: pipeline file. Receive data (items) and perform persistence operations.

Persistence process:

1. Data analysis

2. Define related properties in the item class

3. Encapsulate and store the parsed data to item type objects

4. Use the yield keyword to submit the items object to the pipelines for persistence.

5. Receive the item object submitted by the crawler file in the process_item method in the pipeline file, and then write persistent storage code to persistently store the data stored in the item object

6. Open the pipeline in the settings.py configuration file

 

Code demo:

qiubaiPro.py

import scrapy
 from qiubai.items import QiubaiItem 

class QiubaiproSpider (scrapy.Spider): 
    name = ' qiubaiPro ' 
    allowed_domains = [ ' www.qiushibaike.com/text/ ' ] 
    start_urls = [ ' https://www.qiushibaike.com/text / ' ] 

    # persistent storage terminal based instruction 
    # DEF the parse (Self, the Response): 
    #      # for author and content 
    #      div_list = response.xpath ( "// * [@ the above mentioned id =" content "] / div / div [ 2] / div ') 
    #      data_list = [] 
    #     for div in div_list: 
    #          # xpath returns a list, but the list element must be an object of type Selector 
    #          # extract You can extract the string stored in the data parameter of the Selector object 
    #          user = div.xpath ('./ div / a [2] / h2 / text () ') [0] .extract () 
    #          # After extract is called in the list, this means that the string corresponding to the data in each Selector object in the list is extracted 
    #          # user = div.xpath ('./ div / a [2] / h2 / text ()'). extract_first () 
    #          info = div.xpath ('./ a / div / span // text ()'). extract ( ) 
    #          info = '' .join (info) 
    #
     #          data = { 
    #              'user': user, 
    #              'info': info 
    #          } 
    #         data_list.append (Data) 
    #      return DATA_LIST 

    # based on persistent storage pipeline operation 
    DEF the parse (Self, Response):
         # Get the content of 
        div_list response.xpath = ( ' // * [@ ID = "Content"] / div / div [2] / div ' ) 
        data_list = []
         for div in div_list:
             # xpath returns a list, but the list element must be an object of type Selector 
            # extract You can extract the string stored in the data parameter of the Selector object 
            user = div.xpath ( ' ./div/a [2] / h2 / text () ' ) [0] .extract ()
             #After extract is called in the list, it means that the string corresponding to the data in each Selector object in the list is extracted 
            # user = div.xpath ('./ div / a [2] / h2 / text ()'). Extract_first () 
            info = div.xpath ( ' ./a/div/span//text () ' ) .extract () 

            info = '  ' .join (info) 
            item = QiubaiItem () 
            item [ ' user ' ] = user 
            item [ ' info ' ] = info
             yield item

items.py

import scrapy


class QiubaiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    user = scrapy.Field()
    info = scrapy.Field()

pipelines.py

Import pymysql 


class QiubaiPipeline (Object): 
    FP = None 

    # override a superclass method: This method is only called at the beginning of a crawler 
    DEF open_spider (Self, Spider):
         Print ( ' crawlers start ... ' ) 
        self.fp Open = (R & lt ' ./qiubai/file/qiubai.xlsx ' , ' W ' , encoding = ' UTF-. 8 ' ) 

    # designed to handle the type of the object item 
    # the method may crawler submissions received over the item object 
    # that The method will be called once every time an item object is received 
    def process_item (self, item, spider): 
        userItem = [ ' User ' ] 
        info = Item [ ' info ' ] 
        self.fp.write (User + info)
         return Item 

    # override a superclass method: This method is only called at the end of a crawler 
    DEF close_spider (Self, Spider):
         Print ( " crawlers end ... " ) 
        self.fp.close () 


# pipe conduit a class file corresponding to a set of data stored in a platform or carrier 
class QiubaiPipeline_DB (Object): 
    Conn = None 
    Cursor = None 

    def open_spider (self, spider):
        print('写入数据库。。。')
        self.conn = pymysql.Connect(host='192.168.214.23', port=3306, user='root', password='123456', db='srcapy',
                                    charset='utf8')

    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute('insert into src_qiushi value("%s","%s")' % (item['user'], item['info']))
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        print('入库成功')
        self.cursor.close()
        self.conn.close()

settings.py

BOT_NAME = 'qiubai'

SPIDER_MODULES = ['qiubai.spiders']
NEWSPIDER_MODULE = 'qiubai.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 OPR/67.0.3575.115 (Edition B2)'

ROBOTSTXT_OBEY = False

LOG_LEVEL = 'ERROR'

ITEM_PIPELINES = {
   'qiubai.pipelines.QiubaiPipeline': 300,
   'qiubai.pipelines.QiubaiPipeline_DB': 200,
}

 

Guess you like

Origin www.cnblogs.com/ysging/p/12709993.html