Take "Station B" as a practical case! Teach you to master the necessary framework "Scrapy" for crawlers

1 Introduction

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it by yourself

Python free learning materials, codes and exchange answers click to join


As a crawler, mastering a crawler framework is an essential skill, so as a novice, I would like to recommend "Scrapy" to you.

What is the specific "Scrapy", the role of these is not long-winded (all nonsense, Baidu has an introduction to Scrapy), time is precious, just go directly to the dry goods (the actual case takes you to experience the use of scrapy).

Next, we will conduct actual combat with the goal of "Station B"!

2. Scrapy introductory combat

1. Environmental preparation

Install scrapy


pip install scrapy

The scrapy library can be installed directly through the above command

2. Build a scrapy project


scrapy startproject Bili

Through the above command, you can create a project name: Bili's crawler project.

 

 

Here you can create a crawler project named Bili on the desktop

Project structure

Bili
  ├── Bili
  │   ├── __init__.py
  │   ├── items.py
  │   ├── middlewares.py
  │   ├── pipelines.py
  │   ├── __pycache__
  │   ├── settings.py
  │   └── spiders
  │       ├── __init__.py
  │       └── __pycache__
  └── scrapy.cfg

The role of each file

  • scrapy.cfg: The overall configuration file of the project, usually without modification.
  • Bili: The Python module of the project, the program will import Python code from here.
  • Bili/items.py: Used to define the Item class used by the project. The Item class is a DTO (Data Transfer Object), which usually defines N attributes. This class needs to be defined by the developer.
  • Bili/pipelines.py: The pipeline file of the project, which is responsible for processing the crawled information. This file needs to be written by the developer.
  • Bili/settings.py: The configuration file of the project, in which the project-related configuration is carried out.
  • Bili/spiders: Store the spiders needed by the project in this directory, and the spiders are responsible for crawling the information that the project is interested in.

3. Clearly crawl content


https://search.bilibili.com/all?keyword=%E8%AF%BE%E7%A8%8B&page=2

Take the above link as an example (Station B), crawl the title (title) and link (url) of the video

4. Define each class in the project

Items class


import scrapy

class BiliItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #pass
    # 视频标题
    title = scrapy.Field()
    # 链接
    url = scrapy.Field()

The crawling fields are the title (title) and link (url) of the video, so for the two variables title and url

Define spider class

The purpose of the spider class is to customize web page parsing rules (there is no new scrapy project, you need to create it yourself).

Scrapy provides the scrapy genspider command for creating spiders. The syntax of the command is as follows:

scrapy genspider [options] <name> <domain>

Enter the Bili directory in the command line window, and then execute the following command to create a Spider:


scrapy genspider lyc "bilibili.com"

 

Run the above command, you can find a lyc.py file in the Bili /spider directory of the Bili project
Edit lyc.py


import scrapy
from Bili.items import BiliItem

class LycSpider(scrapy.Spider):
    name = 'lyc'
    allowed_domains = ['bilibili.com']
    start_urls = ['https://search.bilibili.com/all?keyword=课程&page=2']

    # 爬取的方法
    def parse(self, response):
        item = BiliItem()
        # 匹配
        for jobs_primary in response.xpath('//*[@id="all-list"]/div[1]/ul/li'):
            item['title'] = jobs_primary.xpath('./a/@title').extract()
            item['url'] = jobs_primary.xpath('./a/@href').extract()
            # 不能使用return
            yield item

        # pass

Modify the pipeline class

This class is the final processing of the crawled files, and is generally responsible for writing the crawled data to the file or database.
Here we output it to the console.


from itemadapter import ItemAdapter

class BiliPipeline:
    def process_item(self, item, spider):
        print("title:", item['title'])
        print("url:", item['url'])

Modify the settings class


BOT_NAME = 'Bili'

SPIDER_MODULES = ['Bili.spiders']
NEWSPIDER_MODULE = 'Bili.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Bili (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# 配置默认的请求头
DEFAULT_REQUEST_HEADERS = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'Bili.pipelines.BiliPipeline': 300,
}

The simple architecture of a Scarpy project is complete, we can run it and try it.

Startup project


scrapy crawl lyc

 

But there is only one page of content, we can parse the next page.
Add the following code to lyc.py


import scrapy
from Bili.items import BiliItem

class LycSpider(scrapy.Spider):
    name = 'lyc'
    allowed_domains = ['bilibili.com']
    start_urls = ['https://search.bilibili.com/all?keyword=课程&page=2']

    # 爬取的方法
    def parse(self, response):
        item = BiliItem()
        # 匹配
        for jobs_primary in response.xpath('//*[@id="all-list"]/div[1]/ul/li'):
            item['title'] = jobs_primary.xpath('./a/@title').extract()
            item['url'] = jobs_primary.xpath('./a/@href').extract()
            # 不能使用return
            yield item

        # 获取当前页的链接
        url = response.request.url
        # page +1
        new_link = url[0:-1]+str(int(url[-1])+1)
        # 再次发送请求获取下一页数据
        yield scrapy.Request(new_link, callback=self.parse)

Next crawl

 

If executed again, it will be crawled page by page.

3. Summary

1. Through the actual case "Station B", create a scrapy project by hand, parse the web page, and finally successfully crawl the data and print (save)
2. Suitable for beginners to start scrapy, welcome to collect, analyze, and learn

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/114662831