Scrapy crawler method

Table of contents

1. Introduction

1.1, what is scrapy

1.2. Structured data

1.3. Installation

Two, the use of scrapy

2.1. Create a scrapy project

2.2. Create a crawler file

2.3. Run the crawler code

2.4, actual combat

2.4.1, scrapy project structure

2.4.2. The attributes and methods of response

2.4.3, composition of scrapy architecture

2.4.4, scrapy working principle

三、scrapy shell

3.1, what is scrapy shell

3.2. Installation

3.3. Application

3.3.1, enter the scrapy terminal

3.3.2. Grammar

4. Crawl Spider

4.1. Introduction

4.2. Practical operation

5. Data storage

6. Log information and log level

Seven, scrapy's post request

 8. Agent


1. Introduction

1.1, what is scrapy

        scrapy is an application framework written to crawl website data and extract structured data. It can be used in a range of programs including data mining, information processing, or storing historical data.

1.2. Structured data

        Similarly, data with the same structure is called structured data, as shown in the figure below.

1.3. Installation

Two, the use of scrapy

2.1. Create a scrapy project

        1) Enter in the terminal: scrapy startproject project name

              Note : The project name cannot start with a number and cannot contain Chinese

        2) At this point, a new scrapy project will appear in the directory

2.2. Create a crawler file

        1) Enter the spiders folder and create a crawler file: scrapy genspider crawler file name

scrapy genspider baidu http://www.baidu.com

               Note : 1) You need to create a crawler file in the spiders folder

                          2) The domain name does not need to add the http protocol, scrapy will automatically add it

        2) Created successfully:

                

         3) baidu.py file content

                Note : If the requested page ends with html , the last " / " needs to be removed

import scrapy

class BaiduSpider(scrapy.Spider):

    # 爬虫名字:运行爬虫时使用
    name = 'baidu'

    # 允许访问的域名
    allowed_domains = ['www.baidu.com']

    # 起始的url地址,表示第一次访问的域名:
    # start_urls = 'http://' + allowed_domains + '/'
    start_urls = ['http://www.baidu.com/']

    # 方法中response为爬取网页后的返回对象
    # 类似于:response = urllib.request.urlopen(request)
    def parse(self, response):
        pass

2.3. Run the crawler code

        1) Comment out ROBOTSTXT_OBEY in the settings.py file

         2) Run: scrapy crawl crawler name

2.4, actual combat

2.4.1, scrapy project structure

The structure of the scrapy project:
    --project name
      --project name
        --spiders folder (storing crawler files)
            --__init__.py
            --tc.py (custom crawler file core function file)
        --__init__.py
        --items.py (what is included in the crawled data where the data structure is defined)
        --middlewares.py (middleware proxy)
        --pipelines.py (pipelines are used to process downloaded data)
        --settings.py (configuration file robots protocol ua definition, etc.)

2.4.2. The attributes and methods of response

method effect
response.text Get the response string
response.body get binary data
response.xpath It can be directly an xpath method to parse the content in the response
response.extract() Extract the data attribute of the selector object
response.extract_first() Extract the first data of the selector list

2.4.3, composition of scrapy architecture

2.4.4, scrapy working principle

三、scrapy shell

3.1, what is scrapy shell

3.2. Installation

3.3. Application

3.3.1, enter the scrapy terminal

(1)scrapy shell www.baidu.com

(2)scrapy shell http://www.baidu.com

(3)scrapy shell “http://www.baidu.com”

(4)scrapy shell “www.baidu.com”

3.3.2. Grammar

response object:

        response.body

        response.text

        response.url

        response.status

Response analysis:

        response.xpath

        response.extract_first()

        response.css()

4. Crawl Spider

4.1. Introduction

4.2. Practical operation

5. Data storage

6. Log information and log level

Seven, scrapy's post request

 8. Agent

 

Guess you like

Origin blog.csdn.net/weixin_44302046/article/details/126809328