Python crawler uses Scrapy framework to crawl pictures of car home Mercedes-Benz-actual combat

Let's take a look at the rendering of using the scrapy framework to crawl the car home Mercedes-Benz A-class

Insert picture description here
Each sub-file corresponds to the corresponding picture

1) Enter the cmd command mode, enter the file you want to access the crawler code, I am here to enter the python_spider folder under the e drive


C:\Users\15538>e:
E:\>cd python_spider
E:\python_spider>scrapy startproject bc
创建成功后,进入文件夹内
E:\python_spider>cd bc
E:\python_spider\bc>scrapy genspider bcA级 "https://www.autohome.com.cn"
创建成功,会有相对应的输出代码

2) Open pycharm, import the file, the effect is as follows

Among them, there is a start.py file that you need to create

3) Create a start.py file to run the crawler better

from scrapy import cmdline
cmdline.execute("scrapy crawl bcA级".split())
#至此,start文件建立好

4) Modify settings.py

Line 22, change True to False
Uncomment and add 'User-Agent'
Just uncomment

5) Enter the main crawler bcA level.py, write a crawler

# -*- coding: utf-8 -*-
import scrapy

class Bca级Spider(scrapy.Spider):
	name = 'bcA级'
	allowed_domains = ['https://www.autohome.com.cn/4764/']
	start_urls = ['https://car.autohome.com.cn/pic/series/4764.html#pvareaid=3454438']
	#其中start_urls需要我们修改,打开汽车之家官网,按品牌找车--> 奔驰 --> 奔驰A级--> 图片实拍
	#然后复制其地址,与原来的start_urls的参数替换即可
	def parse(self,response):
		#利用xpath提取,不需要全景看车,所以索引从1开始
		uibox = response.xpath("//div[@class='uibox']")[1:]
		for uibox in uiboxs:
			title = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
			urls = uibox.xpath(".//ul/li/a/img/@src").getall()
			urls = list(map(lambda url:response.urljoin(url),urls))
			
			#后面这两行代码是需要编写完items.py后,才写的。
			#为了讲解清楚,我先给这两行代码注释
			"""
			item = BcItem(title=title,urls=urls)
			yield item
			"""

6) Write items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BcItem(scrapy.Item):
	#只需写两行代码即可
	title = scrapy.Field()
	urls = scrapy.Field()

Go back to bcA level.py, import above

from bc.items import BcItem

After that, you can uncomment the last two lines of code in the bcA level. Note that the two lines of code were also written. At the time, they were not. It was like this import. I just made it clearer.

7) Write pipelines.py

import os
from urllib import request

class BcPipeline(object):
	def __init__(self):
		self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
		if not os.path.exists(self.path):
			os.mkdir(self.path)
	def process_item(self,item,spider):
		title = item['title']
		urls = item['urls']
		title_path = os.path.join(self.path,title)
		if not os.path.exists(title_path):
			os.mkdir(title_path)
		for url in urls:
			#这行代码是给每一种图片以它的地址命名,你仔细分析每一张图的图片地址前面的都一样,所以以下划线分割,取到最后一位字符就是名字。
			image_name = url.split("_")[-1]
			#利用request库的urlretrieve将图片下载到title_path绝对路径。
			request.urlretrieve(url,os.path.join(title_path,image_name))
		return item

8) Run start.py to crawl



Published 26 original articles · praised 5 · visits 777

Guess you like

Origin blog.csdn.net/weixin_44730235/article/details/104431208