Scrapy crawler operation for beginners-super detailed case to get you started

This article is to write a Scrapy crawler case from 0, save the result as a local json format, which will introduce the role of some files. Suitable for novices to learn together. A complete and operational code will be given at the end of the article

1. The crawled website

What we intend to crawl is: http://www.itcast.cn/channel/teacher.shtml The names, titles and information of all teachers on the website
Insert picture description herecan be displayed by right-clicking "check" or F12 (notebook needs Fn+F12) in the browser For the above debugging page, the Google plugin xpath helper recommends that you download it, which is easier to use.

Second, the detailed steps of crawling

1. Create a crawler project

The command to create a crawler project is as follows:
scrapy startproject project name
Insert picture description here
The project folder we created appears on the desktop. The content of the initial creation is as follows:
Insert picture description here
Insert picture description here
Insert picture description here
Here we explain the role of each file

profile project: scrapy.cfg
Spiders /: We wrote reptiles files in this folder below
the init .py: usually empty file, but it must exist, there is no directory __init__.py show where he is not just a directory package
items .py: The target file of the project, which defines the structured fields, and saves the crawled data
middlewares.py: the project's middleware
pipelines.py: the project's pipeline file
setting.py: the project's settings file

2. Create a crawler file

Enter the project you just created (cd ITcast) to
create a crawler file command as follows:
scrapy genspider file name (write our crawler code here)
Insert picture description here
At this time, the newly created file itcast.py appears in the spider folder.
Insert picture description here

3. Write items.py

This file is used to define what specific content we want to crawl, which is equivalent to a field in the database or the Pojo class in java.

import scrapy
class ItcastItem(scrapy.Item):
    # define the fields for your item here like:
    # 老师姓名
    name = scrapy.Field()
    # 老师职称
    title = scrapy.Field()
    # 老师信息
    info = scrapy.Field()

4. Set setting.py

This file is a configuration file. The content of the file is modified as follows:
First, because it is for learning purposes, it is not necessary to comply with the robots.txt protocol, so find ROBOTSTXT_OBEY to modify

ROBOTSTXT_OBEY = False

Secondly, we remove the ITEM_PIPELINES comment, and get

ITEM_PIPELINES = {
    
    
    'Teacher.pipelines.TeacherPipeline': 300,
}

Through the above two steps, modifying to False and uncommenting respectively, our configuration is complete.

5. Write itcast.py

import scrapy
from ITcast.items import ItcastItem

class ItcastSpider(scrapy.Spider):
    # 爬虫名  启动爬虫时需要的参数  *必须
    name = 'itcast'
    # 爬取域范围  允许爬虫在这个域名下面进行爬取 可选
    allowed_domains = ['http://www.itcast.cn']
    # 起始url列表,爬虫执行后第一批请求,将从这个列表里获取
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
    
    # 解析响应文件 //div[@class='li_txt']是xpath语法 大家可以学一学
    def parse(self, response):
        node_list = response.xpath("//div[@class='li_txt']")
        items = [] # 用来存储所有的item字段
        for node in node_list:
            # 创建item字段对象,用来存储信息
            item = ItcastItem()
            name = node.xpath("./h3/text()").extract()
            title = node.xpath("./h4/text()").extract()
            info = node.xpath("./p/text()").extract()
            # 注意 这里返回的不是文本而是一个xpath对象
            # 需要用.extrac()将xpath对象转化为 Unicode字符串
            
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]
            items.append(item)
        
        return items # 返回给engine引擎

6. Start crawling and save as json format

Next, start crawling the required information. The cmd command is as follows:
scrapy crawl project name -o project name.json
here my command is: scrapy crawl itcast -o itcast.json
here can also be stored in csv format: scrapy crawl project name- o Project name.csv
Insert picture description here Here you can see that the crawled information is in json format, and a local file of itcast.json appears under the spider folder.
Insert picture description here
Open the itcast.json file here and you can see that the format is Unicode encoding:
Insert picture description here
we need a json converter to view it, click json online parsing https://www.json.cn/ , copy it in to parse:
Insert picture description here
so far, the entire scrapy crawl The work is done!

The final source code is detailed at:
https://github.com/zmk-c/scrapy/tree/master/scrapy_itcast
A wave of benefits, a video of learning how to get started with scrapy: https://www.bilibili.com/video/BV1jx411b7E3

Guess you like

Origin blog.csdn.net/qq_40169189/article/details/107580965