A simple crawler based on the Scrapy framework

  The complete code of this project is at the end

Table of contents

1. Environment installation

2. Create a scrapy project in cmd

3. Create the main crawler file under the spider package in cmd

4. Specific writing of scrapy files

4.1 Use xpath to locate the crawled content

4.2 Write the test.py crawler main file under spiders

4.3 Write items.py

4.4 Write settings.py

4.5 Write pipelines.py to save crawling content files

5. Run the crawler in cmd


1. Environment installation

    Activate your own virtual environment: activate virtual environment name

    Then install the scrapy environment

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy

2. Create a scrapy project in cmd

scrapy startproject 项目名

     The project name is set by yourself, it is best to set it as a word

For example, if I am under a folder in the F disk of the computer, enter cmd in the upper left corner and press Enter, and I can quickly enter the path

 1. Enter activate pytorch (pytorch is the name of my virtual environment)

 2. Enter scrapy startproject CDSN (CSND is the name of the crawler project, write the name according to your own situation)

 3. Then we can see that there is an additional CSND folder under this path, we open it with pycharm

 You can see the file structure under the CSND folder

spiders package: the directory where the crawler code is stored, and the main crawler file will be created under this package later

items.py: the object file of the project

middlewares.py : The input and output processing files of the project

pipelines.py : pipeline file for the project

settings.py: The project's settings file

3. Create the main crawler file under the spider package in cmd

   Pay attention to cd to the spiders file, enter the following:

scrapy genspider test https://www.csdn.net

    It means to create a crawler file named test (the name is set according to your preference), and limit crawling under the https://www.csdn.net domain. The domain name of the last connected website can be written or not. (Note to cd to the spiders file)

      You can see that there are more test.py files in the spiders folder.

4. Specific writing of scrapy files

4.1 Use xpath to locate the crawled content

  For example, if I want to crawl the title of the article on the homepage of CSDN, we come to the relevant page of CSDN.

   Select one of the titles, right click and select ""Inspect""

 ​​​​​​

    You can see the approximate position of the title in the source code of the webpage from the figure below. We need to locate each title for him.

Here is a plug-in XPath Helper for Google Chrome browser, which uses xpath to locate and verify the content we need to crawl. Please install it on Baidu yourself.

     Next, write xpath by hand (although the xpath plug-in can quickly locate, but I still recommend handwriting, which can deepen the understanding of xpath) To locate the crawled content, we can verify the handwritten positioning code in the xpath plug-in, this article is not here To talk about how to position by hand, I suggest you find a video about xpath to learn.

    You can see that we can locate the title of the home page.

4.2 Write the test.py crawler main file under spiders

# -*- coding: utf-8 -*-
import scrapy
from CSDN.items import CsdnItem

class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['www.csdn.net']
    start_urls = ['http://www.csdn.net/']

    def parse(self, response):
        node_list = response.xpath("//span[@class='blog-text']")
        for node in node_list:
            item = CsdnItem()
            title = node.xpath("./text()").extract()
            item['title'] = title[0]
            yield item

4.3 Write items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy
class CsdnItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()#标题

4.4 Write settings.py

  The file only needs to modify the part:

1. First of all, because it is for learning, it is not necessary to abide by the robots.txt protocol, so find ROBOTSTXT_OBEY

to modify. (around line 20)

ROBOTSTXT_OBEY = False

2. Insert a fake browser (insert the following code in a blank line, such as the ROBOTSTXT_OBEY line above)

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'#把自己伪装成浏览器

3. Remove the ITEM_PIPELINES comments around line 65--67, the following three lines

ITEM_PIPELINES = {
   'CSDN.pipelines.CsdnPipeline': 300,
}

4.5 Write pipelines.py to save crawling content files

# -*- coding: utf-8 -*-
# useful for handling different item types with a single interface
import json
class CsdnPipeline:
    def __init__(self):
        self.f = open("csnd.json", "w", encoding='utf-8')  # “w”
    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.f.write(content)
        return item
    # 关闭文件
    def close_spider(self, spider):
        self.f.close()

   Other files that have not been changed can be used by default.

5. Run the crawler in cmd

       Run in the root directory of the project:

scrapy crawl test

      You can also write run.py in pycharm, so you don’t need to run the code in cmd: the content is as follows

# -*- coding: utf-8 -*-
from scrapy import cmdline
cmdline.execute('scrapy crawl test'.split())

      Remember to change the test name to your setup name.

      The location of run.py is as follows:

    After running, we get the following:

   There is an extra csdn.json file in the directory of the project file

    Open:

    This article is the most basic reptile teaching, and some details may not be mentioned, please forgive me. Recently, I feel that I have very little free time, and some partners' questions are not answered in time, please bear with me!

   Code address:

Link: https://pan.baidu.com/s/1KgVpMFx8JGo5dOD0X6s6Zw?pwd=5555&_at_=1656090501842 
Extraction code: 5555

Guess you like

Origin blog.csdn.net/weixin_39357271/article/details/125453668