[Scrapy-02] Crawler Development Skills and Cases for Image Websites

1. The main techniques used.

——Some settings for bypassing anti-climbing are mainly in settings.pyit. This case uses three settings.

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

——Some websites use lazy loading. This lazy loading can’t be crawled directly by crawling the home page. We need to get the lazy loading URL, then manually request the URL, and then analyze the URL response.

# 这边直接通过接口获得频道连接
start_urls = ['xxx']

——Sometimes, lazy loading responsemay be a jsonform, and there are many escaped identifiers \in it. At this time, we can use pythonthe string replacemethod to deal with it.

# 拿到的链接需要处理一下转义字符
cateurl = cateurl.replace("\/", "/")

——The method of downloading pictures is urllib.requestbelow, so you need to pay attention when importing the package.

import urllib.request

urllib.request.urlretrieve(url,filename)

2. The specific website information is removed from the case, and the crawler is only for learning.

Download address: Use scrapy to crawl and download all the pictures of a picture website

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325826763&siteId=291194637