Some pits encountered by crawlers using scrapy: crawlers use scrapy to crawl web pages and return 403 errors

The possibility of some pits encountered when learning scrapy to crawl the network today

Normal: DEBUG: Crawled (200) <GET http://www.techbrood.com/> (referer: None)

Error condition: DEBUG: Crawled (403) <GET http://www.techbrood.com/> (referer: None)

1. URL error

At first I looked at the scrapy documentation, and then output the following code according to the documentation:

import scrapy

class DmozSpider(scrapy.spiders.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)

Then there is a return 403 error. I have been looking for the reason for a long time. After all, there is no problem in other places. Only through the elimination method, I finally focus on the URL, so inputting the URL into the browser is indeed wrong. Both URLs are wrong.


Solution: First check that your URL has no errors, this is a very critical step. It appeared after I changed it.

# -*- coding:utf-8 -*-
import scrapy

class DmozSpider(scrapy.spiders.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "https://blog.csdn.net/weixin_41931602/article/details/80199750",
    ]
    def parse(self, response):
        with open("TY.html", "w") as f:
            f.write(response.body)

2. Indentation issues

When watching the public video of scrapy by the dark horse programmer, it is also wrong to type the code, and the HTML source code of the crawled webpage has never appeared.

#scrapy genspider itcast "itcast.cn"
# -*- coding: utf-8 -*-
import scrapy
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from mySpider.items import ItcastItem

class ItcastSpider(scrapy.Spider):
    name = "itcast"
    allowed_domains = ["itcast.cn"]
    start_urls = ("http://www.itcast.cn/channel/teacher.shtml#ajavaee")

#Be sure to indent the parse function
def parse(self, response):
    with open("teacher.html", "w") as f:
        f.write(response.text)
    print response.text

Seeing it over and over again doesn't seem to be a problem, right? The URL is correct this time! After checking for a long time, I found that I actually wrote the parse function outside. No wonder I didn't write the relevant webpage code. After all, the code I wrote did not call this function when it was running.

By the way, parse(self, response) : The parsing method, which will be called after each initial URL is downloaded. When calling, the Response object returned from each URL is passed as the only parameter. The main functions are as follows:

  1. Responsible for parsing the returned web page data (response.body), extracting structured data (generating item)
  2. Generate a URL request that requires the next page.

Solution: Be sure to indent the parse function

3. The webpage has anti-crawlers, which I haven't encountered yet, but when using a crawler to crawl data, you should think about whether there is an anti-crawler on the crawled website in advance. The code you write should have a proxy, right?

Solution: Construct a User Agent in the request header.

(The rest will be added later)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325575141&siteId=291194637