Use scrapy framework crawling Amoy car network information page of the list and details page, to achieve scrapy framework of several levels of request

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allow https://blog.csdn.net/g_optimistic/article/details/90182050

Amoy car network: https://www.taoche.com/

When we choose the right city, brand, url becomes as follows

Creating scrapy project:

scrapy startproject scrapyProject

Create a spider small reptiles:

scrapy genspider s_taoche taoche.com

table of Contents

1. Request List

(1) Analysis Interface

(2) a request url

(3) Paging

2.items.py

3. Resolution List page

4. resolve details page

5.pipelines.py

6.settings.py

(1) failure to comply with robots.txt protocol

(2)pipeline

(3) Enter the log file

7. Run the project

Method One: cmd -> scrapy crawl s_taoche

Method Two: main.py


1. Request List

(1) Analysis Interface

https: // .taoche.com {} / {} /

The first names add {}, {} the second brand Tim

Brand names acquired

Method One: there have been direct location, brand written to a file inside, can be used directly

Method two: = request URL ' https://www.taoche.com/ ', the tree constructed using xpath Address, brand, and then stitching url, request again

Because a lot of the back step further, I would choose a method

(2) a request url

s_taoche.py

class STaocheSpider(scrapy.Spider):
    name = 's_taoche'
    allowed_domains = ['taoche.com']
    # 一次请求
    start_urls = []
    'https://chongqing.taoche.com/volkswagen/'
    # 1.生成所有城市不同车型的首页地址
    for city in CITY_CODE:
        for car in CAR_CODE_LIST:
            # url = 'https://{}.taoche.com/{}/'.format(city, car)
            url=f'https://{city}.taoche.com/{car}/'
            start_urls.append(url)

    def parse(self, response):
        pass

(3) Paging

Obtained by the response request url  

In the page for the maximum number of pages, from 1 to traverse the maximum number of pages, spliced ​​into complete url

Code:

In this case the need for secondary request url, which can not be written directly start_urls, a secondary function needs to be encapsulated request,

Using the format parse function

    def parse(self, response):
        # 1.从首页获取最大的页面page
        max_page1 = response.xpath('//div[@class="paging-box the-pages"]/div/a[last()-1]/text()').extract()
        max_page = self.get_value(max_page1)
        # 二次请求,不能直接写在start_urls里面,需要封装一个二次请求的函数
        # 2.列表页的翻页
        'https://chongqing.taoche.com/volkswagen/?page=3'
        for i in range(1, int(max_page) + 1):
            # url=response.url+'?page={}'.format(i)
            url = response.url + f'?page={i}'
            # 3.封装请求  自己写一个parse的函数
            yield scrapy.Request(url=url, callback=self.parse_taoche)  # 看到yield 要挂起
    
    def get_value(self,value):
        if value:
            value=value[0]
        else:
            value=1
        return value

2.items.py

We need to write the field crawling in items.py file

List:

title title; resisted_data card time; mileage car; city city; price original price; sail_price price; url detail_url details page

Details page:

displacement 排量;transmission 变速箱;brand_type 品牌型号;loc_od_lic 牌照所在地;oil_wear 油耗;engine 发动机;

tree_high 长宽高;drive_way 驱动方式;body_type 车身类型;che_level 车辆级别;trunk_cao 后备箱容量

import scrapy


# 写项目之前,要先把字段要起好

class TaoCheItem(scrapy.Item):
    # 列表页
    title = scrapy.Field()  # 标题
    resisted_date = scrapy.Field()  # 上牌时间
    mileage = scrapy.Field()  # 车程
    city = scrapy.Field()  # 城市
    price = scrapy.Field()  # 原价
    sail_price = scrapy.Field()  # 售价
    detail_url = scrapy.Field()  # 详情页的url
    # 详情页
    displacement = scrapy.Field()  # 排量
    transmission = scrapy.Field()  # 变速箱
    brand_type = scrapy.Field()  # 品牌型号
    loc_od_lic = scrapy.Field()  # 牌照所在地
    oil_wear = scrapy.Field()  # 油耗
    engine = scrapy.Field()  # 发动机
    tree_high = scrapy.Field()  # 长宽高
    drive_way = scrapy.Field()  # 驱动方式
    body_type = scrapy.Field()  # 车身类型
    che_level = scrapy.Field()  # 车辆级别
    trunk_cao = scrapy.Field()  # 后备箱容量

3.解析列表页

请自行去找要爬取的列表页的字段

详情页的url在标题的a标签里面

获得详情页的链接以后,要再次请求detail_url  

scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'data': item}, encoding='utf-8',
                                 dont_filter=True)
把请求的重复的url过滤掉 dont_filter=True

代码如下:

# 列表页的解析函数
    def parse_taoche(self, response):
        # 1.获取整个车辆信息
        car_info_list = response.xpath('//ul[@class="gongge_ul"]/li')
        for car in car_info_list:
            title = car.xpath('./div[2]/a/span/text()').extract()  # 标题
            title = self.get_value(title)
            resisted_date = car.xpath('./div[2]/p/i[1]/text()').extract()  # 上牌时间
            resisted_date = self.get_value(resisted_date)
            mileage = car.xpath('./div[2]/p/i[2]/text()').extract()  # 车程
            mileage = self.get_value(mileage)
            city = car.xpath('./div[2]/p/i[3]//text()').extract()  # 城市
            city = ''.join([i.strip() for i in city])
            # city=self.get_value(city)
            price = car.xpath('./div[2]/div[1]/i[3]/text()').extract()  # 原价
            price = self.get_value(price)
            sail_price = car.xpath('./div[2]/div[1]/i[2]//text()').extract()
            sail_price = ''.join(sail_price)  # 售价
            detail_url = car.xpath('./div[2]/a/@href').extract()  # 详情页的url
            detail_url = 'https:' + self.get_value(detail_url)
            # 2.实例化:
            item = TaoCheItem()
            # 3.赋值
            item['title'] = title
            item['resisted_date'] = resisted_date
            item['mileage'] = mileage
            item['city'] = city
            item['price'] = price
            item['sail_price'] = sail_price
            item['detail_url'] = detail_url
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'data': item}, encoding='utf-8',
                                 dont_filter=True)
            # 把请求的重复的url过滤掉 dont_filter=True

4.解析详情页

详情页所需的字段在以下的位置

具体的xpath的过程我就不展示了,很容易找到的

代码如下:

# 解析详情页
    def parse_detail(self, response):
        # print(response.url)
        li_list = response.xpath('//div[@class="row parameter-configure"]/div[2]/ul')[0]
        displacement = li_list.xpath('./li[1]//text()').extract()
        displacement = ''.join([i.strip() for i in displacement])  # 排量
        oil_wear = li_list.xpath('./li[2]//text()').extract()
        oil_wear = ''.join(oil_wear)  # 油耗
        tree_high = li_list.xpath('./li[3]//text()').extract()
        tree_high = ''.join(tree_high)  # 长宽高
        body_type = li_list.xpath('./li[4]//text()').extract()
        body_type = ''.join(body_type)  # 车身类型
        trunk_cao = li_list.xpath('./li[5]//text()').extract()
        trunk_cao = ''.join(trunk_cao)  # 后备箱容量
        ul_box = response.xpath('//div[@class="row parameter-configure"]/div[1]/ul')[0]
        brand_type = ul_box.xpath('./li[1]/span//text()').extract()
        brand_type = ''.join(brand_type)  # 品牌型号
        loc_od_lic = ul_box.xpath('./li[2]/span//text()').extract()
        loc_od_lic = ''.join(loc_od_lic)  # 牌照所在地
        engine = ul_box.xpath('./li[3]/span//text()').extract()
        engine = ''.join(engine)  # 发动机
        drive_way = ul_box.xpath('./li[4]/span//text()').extract()
        drive_way = ''.join(drive_way)  # 驱动方式
        che_level = ul_box.xpath('./li[5]/span//text()').extract()
        che_level = ''.join(che_level).strip()  # 车辆级别
        transmission = response.xpath('//div[@class="summary-attrs"]/dl[3]/dd/text()').extract()[0]
        transmission = transmission.split('/')[1]  # 变速箱
        item = response.meta['data']
        item['displacement'] = displacement
        item['transmission'] = transmission
        item['brand_type'] = brand_type
        item['loc_od_lic'] = loc_od_lic
        item['oil_wear'] = oil_wear
        item['engine'] = engine
        item['tree_high'] = tree_high
        item['drive_way'] = drive_way
        item['body_type'] = body_type
        item['che_level'] = che_level
        item['trunk_cao'] = trunk_cao
        # print(item)
        yield item

5.pipelines.py

保存数据

class TaoChePipeline(object):
    def process_item(self, item, spider):
        fp = open('taoche.txt', 'a', encoding='utf-8')
        json.dump(dict(item), fp, ensure_ascii=False)
        return item

6.settings.py

要修改的内容

(1)不遵守robots.txt协议

(2)pipeline

(3)输入日志文件

7.运行项目

方法一:cmd-->scrapy crawl s_taoche

方法二:main.py

from scrapy import cmdline

cmdline.execute("scrapy crawl s_taoche --nolog".split())
cmdline.execute("scrapy crawl s_taoche".split())

第一行说明:

无论settings里是否设置了LOG_FILE、LOG_ENABLED等四项,控制台与taoche.log文件都不会出现日志内容

第二行说明:

若settings里设置了LOG_FILE、LOG_ENABLED等四项,命令行就不会打印日志内容,但是日志内容会保存到taoche.log文件中

若settings里没设置LOG_FILE、LOG_ENABLED等四项,命令行就会打印出日志内容

  是否设置了settings.py 命令行是否会出现日志内容 Will not be saved to a log file
1   no no
2 Yes no Yes
2 no Yes no

 

 

Guess you like

Origin blog.csdn.net/g_optimistic/article/details/90182050