[python crawler notes] scrapy

Table of contents

create new project

how to use scrapy

A complete case application of scrapy

Introduction to css selectors

The basic syntax of css selectors

How to use css in scrapy


0

create new project

 download scrapy

 pip install scrapy

This command first   creates a folder  based on the project namescrpy , and then creates a  project under the folder. This step is the starting point of all subsequent codes. 

scrpy startproject <项目名>

create new project

scrapy startproject my_scrapy

 Create the first  scrapy crawler file pm

scrapy genspider pm imspm.com

If you want to run the project command, you must first enter the red underlined  my_scrapy folder, in order to control the project in the project directory.

 cd my_scrapy

At this time  spiders , a file appears in the folder  pm.py , and the content of the file is as follows: 

import scrapy


class PmSpider(scrapy.Spider):
    name = 'pm'
    allowed_domains = ['imspm.com']
    start_urls = ['http://imspm.com/']

    def parse(self, response):
        pass

Test the operation of the scrapy crawler
Use the command  scrapy crawl <spider>, spider which is the crawler file name generated above, if the following content appears, it means that the crawler is loaded correctly.

2022-11-12 15:27:02 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: my_scrapy)

how to use scrapy

scrapy The workflow is very simple:

  1. Collect the source code of the first page;
  2. Parse the source code of the first page and get the link to the next page;
  3. Request the source code of the next page;
  4. Parse the source code and get the next page of source code;
  5. […]
  6. During the process, after the target data is extracted, it is saved.

scrapy A complete case application

> scrapy startproject my_project 爬虫
> cd 爬虫
> scrapy genspider pm imspm.com

Get the project structure as follows:

Are you afraid you won’t learn python scrapy?  Just read this article.  The 42nd case of 120 cases of reptiles, collecting super product manager channels

  • scrapy.cfg: configuration file path and deployment configuration;
  • items.py: the structure of the target data;
  • middlewares.py: middleware file;
  • pipelines.py: pipeline file;
  • settings.py: configuration information.

The number of code requests is 7, because it  pm.py is not added by default in the file  www. If this content is added, the number of requests becomes 4.

The file code now  pm.py looks like this:

import scrapy


class PmSpider(scrapy.Spider):
    name = 'pm'
    allowed_domains = ['www.imspm.com']
    start_urls = ['http://www.imspm.com/']

    def parse(self, response):
        print(response.text)

Wherein  parse indicates  start_urls the address in the request, the callback function after obtaining the response, and directly   outputs the source code of the webpage through the attributes response of  the parameters..text

After obtaining the source code, it is necessary to parse and store the source code.
Before storing, a data structure needs to be defined manually. The content  items.py is implemented in the file, and the class name in the code is modified, MyProjectItem →  ArticleItem.

import scrapy

class ArticleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()  # 文章标题
    url = scrapy.Field()  # 文章地址
    author = scrapy.Field()  # 作者

Modify  the functions pm.py in the file  parse and add operations related to webpage parsing. This operation is similar to  pyquery knowledge points, and you can master it by directly observing the code.

    def parse(self, response):
        # print(response.text)
        list_item = response.css('.list-item-default')
        # print(list_item)
        for item in list_item:
            title = item.css('.title::text').extract_first()  # 直接获取文本
            url = item.css('.a_block::attr(href)').extract_first() # 获取属性值
            author = item.css('.author::text').extract_first()  # 直接获取文本
            print(title, url, author)

The  response.css method returns a list of selectors, which can be iterated and  css methods invoked on the objects in it.

  • item.css('.title::text'), get the text in the label;
  • item.css('.a_block::attr(href)'), get the tag attribute value;
  • extract_first(): parse the first item in the list;
  • extract(): Get the list.

pm.py Import  items.py the class in  ArticleItem and then modify it according to the following code:

    def parse(self, response):
        # print(response.text)
        list_item = response.css('.list-item-default')
        # print(list_item)
        for i in list_item:
            item = ArticleItem()
            title = i.css('.title::text').extract_first()  # 直接获取文本
            url = i.css('.a_block::attr(href)').extract_first()  # 获取属性值
            author = i.css('.author::text').extract_first()  # 直接获取文本
            # print(title, url, author)
            # 对 item 进行赋值
            item['title'] = title
            item['url'] = url
            item['author'] = author
            yield item

At this time, when the crawler is running  scrapy , the following prompt message will appear.

Are you afraid you won’t learn python scrapy?  Just read this article.  The 42nd case of 120 cases of reptiles, collecting super product manager channels

At this point a single page crawler is complete

Next,  parse modify the function again so that after parsing the first page, it can parse the data of the second page.

    def parse(self, response):
        # print(response.text)
        list_item = response.css('.list-item-default')
        # print(list_item)
        for i in list_item:
            item = ArticleItem()
            title = i.css('.title::text').extract_first()  # 直接获取文本
            url = i.css('.a_block::attr(href)').extract_first()  # 获取属性值
            author = i.css('.author::text').extract_first()  # 直接获取文本
            # print(title, url, author)
            # 对 item 进行赋值
            item['title'] = title
            item['url'] = url
            item['author'] = author
            yield item
        next = response.css('.nav a:nth-last-child(2)::attr(href)').extract_first()  # 获取下一页链接
        # print(next)
        # 再次生成一个请求
        yield scrapy.Request(url=next, callback=self.parse)

In the above code, the variable  next represents the address of the next page, and  response.css the link is obtained through the function.  css Please focus on learning the selector.

Introduction to css selectors

  • The selector in css is a mode for selecting elements that need to be styled. CSS implements one-to-one, one-to-many or many-to-one control on the elements in the html page, all need to use css selector, html The elements in the page are controlled by css selectors;

The basic syntax of css selectors

  • Class selector : the class attribute of the element, such as class="box"selecting an element whose class is box;
  • ID selector : the id attribute of the element, for example, id="box"it means to select the element whose id is box;
  • Element selector : directly select document elements, such as p means to select all p elements, div means to select all div elements;
  • Attribute selector : Select elements with a certain attribute, such as *[title]selecting all titleelements containing attributes, a[href]indicating selecting all a elements with href attributes, etc.;
  • Descendant selector : Select elements that contain descendants of elements, for li aexample, select all a elements under all li;
  • Child element selector : Select an element that is a child element of an element, such as h1 > strongselecting all strong elements whose parent element is h1;
  • Adjacent sibling selector : Select an element immediately after another element, and both have the same parent element, such as h1 + p means select all p elements immediately after the h1 element;

How to use css in scrapy

Take the a element as an example

  • response.css('a'): Returns the selector object;
  • response.css('a').extract(): Returns the a tag object;
  • response.css('a::text').extract_first(): Returns the value of the text in the first a tag;
  • response.css('a::attr(href)').extract_first(): Returns the value of the href attribute in the first a tag;
  • response.css('a[href*=image]::attr(href)').extract(): Returns the value of the href attribute containing image in all a tags;
  • response.css('a[href*=image] img::attr(src)').extract(): Returns the src attribute of the image tag under all a tags;

yield scrapy.Request(url=next, callback=self.parse) Indicates that a request is created again, and the callback function of the request is  parse itself. The effect of the code operation is as follows.
Are you afraid you won’t learn python scrapy?  Just read this article.  The 42nd case of 120 cases of reptiles, collecting super product manager channels
If you want to save the running results, just run the following command.

scrapy crawl pm -o pm.json

Are you afraid you won’t learn python scrapy?  Just read this article.  The 42nd case of 120 cases of reptiles, collecting super product manager channels
If you want to store each piece of data as a single row, use the following command  scrapy crawl pm -o pm.jl .

Are you afraid you won’t learn python scrapy?  Just read this article.  The 42nd case of 120 cases of reptiles, collecting super product manager channels

The generated file also supports csv, xml, marchal, pickle, you can try it yourself.

Let's use the data pipeline to
open  pipelines.py the file, modify the class name  MyProjectPipeline →  TitlePipeline, and then compile the following code:

class TitlePipeline:
    def process_item(self, item, spider):  # 移除标题中的空格
        if item["title"]:
            item["title"] = item["title"].strip()
            return item
        else:
            return DropItem("异常数据")

This code is used to remove the left and right spaces in the title.

After writing, you need to  settings.py open  ITEM_PIPELINES the configuration in the file.

ITEM_PIPELINES = {
   'my_project.pipelines.TitlePipeline': 300,
}

300 It is  PIPELINES the priority order of operation, which can be modified as needed. Run the crawler code again, and you will find that the left and right spaces of the title have been removed.

At this point, a basic crawler of scrapy has been written.

The css selector of the crawler Scrapy framework uses  0

Guess you like

Origin blog.csdn.net/m0_51933492/article/details/127820920