The possibility of some pits encountered when learning scrapy to crawl the network today
Normal: DEBUG: Crawled (200) <GET http://www.techbrood.com/> (referer: None)
Error condition: DEBUG: Crawled (403) <GET http://www.techbrood.com/> (referer: None)
1. URL error
At first I looked at the scrapy documentation, and then output the following code according to the documentation:
import scrapy class DmozSpider(scrapy.spiders.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)
Then there is a return 403 error. I have been looking for the reason for a long time. After all, there is no problem in other places. Only through the elimination method, I finally focus on the URL, so inputting the URL into the browser is indeed wrong. Both URLs are wrong.
Solution: First check that your URL has no errors, this is a very critical step. It appeared after I changed it.
# -*- coding:utf-8 -*- import scrapy class DmozSpider(scrapy.spiders.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "https://blog.csdn.net/weixin_41931602/article/details/80199750", ] def parse(self, response): with open("TY.html", "w") as f: f.write(response.body)
2. Indentation issues
When watching the public video of scrapy by the dark horse programmer, it is also wrong to type the code, and the HTML source code of the crawled webpage has never appeared.
#scrapy genspider itcast "itcast.cn" # -*- coding: utf-8 -*- import scrapy import sys reload(sys) sys.setdefaultencoding("utf-8") from mySpider.items import ItcastItem class ItcastSpider(scrapy.Spider): name = "itcast" allowed_domains = ["itcast.cn"] start_urls = ("http://www.itcast.cn/channel/teacher.shtml#ajavaee") #Be sure to indent the parse function def parse(self, response): with open("teacher.html", "w") as f: f.write(response.text) print response.text
Seeing it over and over again doesn't seem to be a problem, right? The URL is correct this time! After checking for a long time, I found that I actually wrote the parse function outside. No wonder I didn't write the relevant webpage code. After all, the code I wrote did not call this function when it was running.
By the way, parse(self, response)
: The parsing method, which will be called after each initial URL is downloaded. When calling, the Response object returned from each URL is passed as the only parameter. The main functions are as follows:
- Responsible for parsing the returned web page data (response.body), extracting structured data (generating item)
- Generate a URL request that requires the next page.
Solution: Be sure to indent the parse function
3. The webpage has anti-crawlers, which I haven't encountered yet, but when using a crawler to crawl data, you should think about whether there is an anti-crawler on the crawled website in advance. The code you write should have a proxy, right?
Solution: Construct a User Agent in the request header.
(The rest will be added later)