Python 爬虫实战 汽车某家(三) 车型

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/guohan_solft/article/details/85012882

一、爬取逻辑分析

核心

1、将爬取分为爬取流程和内容解析
1)爬取流程控制请求在售、即将销售、停售的请求分发
2) 内容解析负责当前页面的循环解析和分页请求

二、爬取页面销售状态分析

<div class="tab-nav border-t-no"> 
   <!--状态tab、排序--> 
   <div class="brandtab-cont-sort">
    <a href="/price/series-4741-0-2-0-0-0-0-1.html" class="ma-r15  current">最热门<i class="icon10 icon10-up"></i></a>
    <a href="/price/series-4741-1-2-0-0-0-0-1.html">按价格<span class="icon-cont"><i class="icon10 icon10-sjt"></i><i class="icon10 icon10-sjb"></i></span></a>
   </div>
   <ul data-trigger="click">
    <li class="disabled"><span title="在售是指官方已经公布售价且正式在国内销售的车型">在售</span></li>
    <li class="current"><a href="/price/series-4741-0-2-0-0-0-0-1.html" data-toggle="tab" data-target="#brandtab-2" rel="nofollow" target="_self" title="即将销售是指近期即将在国内销售的车型">即将销售</a></li>
    <li class="disabled"><span title="停售是指厂商已停产并且经销商处已无新车销售的车型">停售</span></li>
   </ul> 
  </div>

三、爬取页面车型列表分析

  <ul class="interval01-list">
   <li data-value="33885">
    <div class="interval01-list-cars">
     <div class="interval01-list-cars-infor">
      <p id="p33885"><a href="//www.autohome.com.cn/spec/33885/#pvareaid=2042128" target="_blank">2018款 基本型</a></p>
      <p></p>
      <p><span></span><span></span></p>
     </div>
    </div>
    <div class="interval01-list-attention">
     <div class="attention">
      <span id="spgzd33885" class="attention-value" style="width:100%"></span> 
     </div>
    </div>
    <div class="interval01-list-guidance">
     <div>
      <a href="//j.autohome.com.cn/pcplatform/staticpage/loan/index.html?specid=33885&amp;pvareaid=2020994" target="_blank" title="购车费用计算"><i class="icon16 icon16-calendar"></i></a> 16.99万
     </div>
    </div>
    <div class="interval01-list-lowest">
     <div>
      <span class="red price-link">暂无报价</span> 
      <a class="js-xj btn btn-mini btn-blue btn-disabled" id="pxj33885" name="pxjs4741" target="_blank">询价</a>
     </div>
    </div>
    <div class="interval01-list-related">
     <div>
      <span id="spspk33885">口碑</span>
      <a class="fn-hide" target="blank" id="spk33885">口碑</a> 
      <a href="/pic/series-s33885/4741.html#pvareaid=100678" target="_blank">图片</a> 
      <span id="spsps33885">视频</span>
      <a class="fn-hide" target="blank" id="sps33885">视频</a> 
      <span>配置</span> 
     </div>
    </div></li>
  </ul>

四、停售分页分析

<div class="page">
   <a class="page-item-prev page-disabled" href="javascript:void(0)">上一页</a>
   <a href="javascript:void(0);" class="current">1</a>
   <a href="/price/series-65-0-3-0-0-0-0-2.html">2</a>
   <a href="/price/series-65-0-3-0-0-0-0-3.html">3</a>
   <a href="/price/series-65-0-3-0-0-0-0-4.html">4</a>
   <a href="/price/series-65-0-3-0-0-0-0-5.html">5</a>
   <a href="/price/series-65-0-3-0-0-0-0-6.html">6</a>
   <a class="page-item-next" href="/price/series-65-0-3-0-0-0-0-2.html">下一页</a>
  </div>

五、指导价和评分爬取

由于指导价和评分在车型详情页,因此需要在爬取车型配置的时候更新车型表

1、指导价html片段

<span class="factoryprice">厂商指导价:12.08<em>万元</em></span> 

2、有评分html片段

       <div class="spec-content"> 
        <div class="koubei-con"> 
         <div class="koubei-left"> 
          <div class="koubei-data"> 
           <span>网友评分:<a href="//k.autohome.com.cn/spec/33409/#pvareaid=3454572" class="scroe">4.28分</a></span> 
           <span>口碑印象:<a href="//k.autohome.com.cn/spec/33409/#pvareaid=3454573" class="count">5人参与评价</a></span> 
          </div> 
          <div class="koubei-tags"> 
           <a href="//k.autohome.com.cn/spec/33409/?summarykey=530800&amp;g#pvareaid=3454574" class="athm-tags athm-tags--blue">油耗满意</a> 
           <a href="//k.autohome.com.cn/spec/33409/?summarykey=457074&amp;g#pvareaid=3454574" class="athm-tags athm-tags--default">胎噪很硬</a> 
          </div> 
         </div> 
         <div class="koubei-right"> 
          <p class="koubei-user"> <span> <a href="javascript:void(0)" id="koubei_user" data-userid="32543097" target="_blank"> </a> <i>发表</i> </span> <span> <b title="2018款 118i 时尚型">2018款 118i 时尚型</b> <i>口碑</i> </span> <span><em>车主已追加口碑</em></span> </p> 
          <p class="koubei-info"> <span>裸车价:<em>15.9万</em></span> <span>购车时间:<em>2018年3月</em></span> <span> 耗 电 量: <em>暂无</em> </span> </p> 
          <p class="koubei-list"> <a href="//k.autohome.com.cn/spec/33409/view_2013088_1.html#pvareaid=3454575"> 【最满意的一点】 1、颜值。这个毋庸置疑,尤其是车头真长,显得整车就没有那么小了,而且蓝色虽然已经烂大街了,不过真的...<i>详细 &gt;</i> </a> </p> 
         </div> 
        </div> 
       </div> 

3、无评分html片段

  <div class="spec-content"> 
   <!-- 空数据 --> 
   <div class="koubei-blank"> 
    <p>本车型暂无优秀口碑,发表优秀口碑赢丰富好礼</p> 
    <p><a href="//k.autohome.com.cn/form/carinput/add/31960#pvareaid=3454571" class="athm-btn athm-btn--mini athm-btn--blue-outline">发表口碑</a></p> 
   </div> 
  </div> 

核心类代码:


import scrapy,pymysql,re
from ..mySqlUtils import MySqlUtils
from ..items import SpecItem,SeriesItem
from ..pipelines import SpecPipeline

#请求车系下的车型列表信息,完善
class specSpider(scrapy.Spider):
    name = "specSpider"
    https="https:%s"
    host="https://car.autohome.com.cn%s"
    count=0
    ruleId=2 # 爬取策略:1为只爬取数据库中不存在的,2是全部更新
    chexingIdSet=None #重数据库查出的已爬取的车型id集合

    # 解析车型列表数据,并保存到数据库
    def parseSpec(self, response):
        # 解析
        seriesParams = response.meta['seriesParams']
        specList=self.extractSpecItem(response)
        # 保存到数据库
        for specItem in specList:
            yield specItem

        #当前页面解析完成后判断是否存在分页,若存在分页则继续请求分页链接,再解析到本方法中
        pageData = response.css(".page")
        if pageData:
            #取出nextPage
            pageList = pageData.xpath("a")
            nextPage=pageList[len(pageList) - 1].xpath("@href").extract_first()
            #若存在有效的下一页链接,则继续请求
            if nextPage.find("java") == -1:
                pageLink = self.host % nextPage
                request = scrapy.Request(url=pageLink, callback=self.parseSpec)
                request.meta['seriesParams'] = seriesParams  # (品牌ID,车系id)
                yield request



    #解析车系车型信息
    def parse(self, response):

        seriesItem=SeriesItem()
        seriesParams=response.meta['seriesParams']

        #解析车系概要信息
        seriesData = response.css(".lever-ul").xpath("*")
        #解析车辆级别
        lever=seriesData[0].xpath("string(.)").extract_first() #'级\xa0\xa0别:中型SUV'
        lever=lever.split(":")[1].strip()
        # 解析指导价
        minPrice = 0
        maxPrice = 0
        seriesDataRight = response.css(".main-lever-right").xpath("*")
        price = seriesDataRight[0].xpath("span/span/text()").extract_first()
        if price.find("-") != -1:
            price = price.rstrip("万")
            price = price.split("-")
            minPrice = price[0]
            maxPrice = price[1]
        # 解析用户评分
        userScore = 0
        userScoreStr = seriesDataRight[1].xpath("string(.)").extract_first()
        if re.search(r'\d+', userScoreStr) != None:
            userScore = userScoreStr.split(":")[1]
        #保存车系概要信息到数据库
        seriesItem['minMoney']=minPrice
        seriesItem['maxMoney']=maxPrice
        seriesItem['score']=userScore
        seriesItem['jibie']=lever
        seriesItem['chexiID']=seriesParams[1]
        # self.log(seriesItem)
        yield seriesItem

        # 解析当前车型页面
        specList = self.extractSpecItem(response)
        # self.log(specList)
        # 保存到数据库
        for specItem in specList:
            yield specItem


        #解析车型概要信息
        #爬取逻辑:
        #   1、获取在售、即将销售、停售三种状态
        #   2、依次判断每个状态是否有值,若有值,则判断当前页面是否为直接进入的
        #   3、解析当前状态数据
        #   4、当存在分页时继续请求

        # 1.1 定义三种链接
        sellingLink='-1'   #在售
        sellWaitLink='-1'  #即将销售
        sellStopLink='-1'  #停售
        # 1.2 取出三种状态
        statusData = response.css(".tab-nav.border-t-no")
        statusList = statusData.xpath("ul/li")
        for statusItem in statusList:
            status = statusItem.xpath("a")
            if status:
                statusDes=status.xpath("text()").extract_first()
                link=status.xpath("@href").extract_first()
                if statusDes == '在售':
                    sellingLink=link
                if statusDes == '即将销售':
                    sellWaitLink=link
                if statusDes == '停售':
                    sellStopLink=link
        # self.log("-------------------------->status")
        statusPrint=(sellingLink,sellWaitLink,sellStopLink)
        # self.log(statusPrint)

        # 2.2 判断即将销售状态
        if sellWaitLink != '-1':
            #若在售有值则证明不是直接请求过来的,则发起请求,否则就是直接请求过来的直接解析
            if sellingLink != '-1':
                #发送请求
                request = scrapy.Request(url=self.host % sellWaitLink, callback=self.parseSpec)
                request.meta['seriesParams'] = seriesParams  # (品牌ID,车系id)
                yield request
        # 2.3 判断停售状态
        if sellStopLink != '-1':
            #判断在售状态或即将销售状态是否有值,若有值则证明不是直接请求过来的需要请求后才能解析,若有没有值则直接请求过来的,直接解析即可,
            if sellingLink != '-1' or sellWaitLink != '-1':
                #请求链接
                request = scrapy.Request(url=self.host % sellStopLink, callback=self.parseSpec)
                request.meta['seriesParams'] = seriesParams  # (品牌ID,车系id)
                yield request
            else:
                # 判断是否存在分页,若存在则继续请求
                pageData = response.css(".page")
                if pageData:
                    # 取出nextPage
                    pageList = pageData.xpath("a")
                    nextPage = pageList[len(pageList) - 1].xpath("@href").extract_first()
                    # 若存在有效的下一页链接,则继续请求
                    if nextPage.find("java") == -1:
                        pageLink = self.host % nextPage
                        request = scrapy.Request(url=pageLink, callback=self.parseSpec)
                        request.meta['seriesParams'] = seriesParams  # (品牌ID,车系id)
                        yield request


    def start_requests(self):
        self.chexingIdSet=MySqlUtils.parseToChexingIdSet(MySqlUtils.querySpec())
        # 读取数据库车系表,获取访问车系车型链接
        seriesItems = MySqlUtils.querySeriesLink()
        # seriesItems=["https://car.autohome.com.cn/price/series-4171.html"] # 测试停售
        # seriesItems=["https://car.autohome.com.cn/price/series-4887.html"] # 测试即将销售 具体车型ID:35775
        # 从断点处开始爬取
        waitingCrawlItems = list()
        for id in SpecPipeline.waitingCrawlSeriesIdSet:
            for item in seriesItems:
                if id == item[1]:
                    waitingCrawlItems.append(item)
                    break
        #waitingCrawItems=MySqlUtils.findChexiInChexiSet(seriesItems,SpecPipeline.waitingCrawlSeriesIdSet)

        for item in waitingCrawlItems:
            # 统计已爬取的车系
            SpecPipeline.crawledSeriesCount += 1
            SpecPipeline.crawledSeriesIdSet.add(item[1])
            url=item[2]
            # url = item
            request = scrapy.Request(url=url, callback=self.parse)
            request.meta['seriesParams'] = (item[0], item[1])  # (品牌ID,车系id)
            # request.meta['seriesParams'] = ('122', '4887')  # (品牌ID,车系id)
            yield request







    # 封装取出车型集合
    def extractSpecItem(self,response):
        # 解析
        seriesParams = response.meta['seriesParams']
        specDataGroups = response.css(".interval01-list")
        specList=list()
        for specDataGroup in specDataGroups:
            for specDataItem in specDataGroup.xpath("li"):
                # 车型id
                specId = specDataItem.xpath("@data-value").extract_first()
                specNameData = specDataItem.css("#p" + specId).xpath("a")
                # 车型名称
                specName = specNameData.xpath("text()").extract_first()
                # 车型链接
                specLink = self.https % specNameData.xpath("@href").extract_first()

                specLink=specLink[0:specLink.find("#")-1]
                specItem = SpecItem()
                specItem['pinpaiID'] = seriesParams[0]
                specItem['chexiID'] = seriesParams[1]
                specItem['chexingID'] = specId
                specItem['name'] = specName
                specItem['url'] = specLink
                specItem['sqlType'] = '1'
                # self.log(specItem)
                # 统计新增车型
                if specId not in self.chexingIdSet:
                    SpecPipeline.addSpecCount += 1
                # ruleId等于1时只更新新增的车型,已存在的不会做更新
                if self.ruleId == 1:
                    if specId in self.chexingIdSet:
                        continue

                self.log("yieldCount:%d" % self.count)
                # 保存车型到数据库
                self.count += 1
                specList.append(specItem)

        return specList




    # 批量保存到数据库
    # def parseSellingSpec(self,response):
    #     print(">>>>>>>>>>>>>>>>>>>>>>>>>>>parseSellingSpec")
    #     t=type(response)
    #     self.log(t)
    #     # 解析
    #     seriesParams = response.meta['seriesParams']
    #     specDataGroups = response.css(".interval01-list")
    #     self.log(seriesParams)
    #     self.log(specDataGroups)
    #     specList=list()
    #     for specDataGroup in specDataGroups:
    #         for specDataItem in specDataGroup.xpath("li"):
    #             # 车型id
    #             specId = specDataItem.xpath("@data-value").extract_first()
    #             specNameData = specDataItem.css("#p" + specId).xpath("a")
    #             # 车型名称
    #             specName = specNameData.xpath("text()").extract_first()
    #             # 车型链接
    #             specLink = self.https % specNameData.xpath("@href").extract_first()
    #             pingpaiID=seriesParams[0]
    #             chexiID=seriesParams[1]
    #             chexingID=specId
    #             specItem=(chexingID,pingpaiID,chexiID,specName,specLink)
    #             specList.append(specItem)
    #             self.log(specItem)
    #             # 保存车型到数据库
    #             self.count += 1
    #             self.log("saveCount:%d" % self.count)
    #             # yield specItem  #只有在scrapy.request方法中指定的方法才支持yield
    #     # 使用mysqlUtils将数据保存到数据库
    #     MySqlUtils.insertSpecItemList(specList)

    # def parseScoreAndPrice(self,response):
    #     #获取传递参数,车型对象
    #     specItem=response.meta['specItem']
    #     #解析评分
    #     scoreData = response.css(".koubei-data")
    #     score=0
    #     if scoreData:
    #         score = scoreData.xpath("span/a")[0].xpath("text()").extract_first()
    #         score = score[0:score.find("分")]
    #     #解析指导价
    #     priceData = response.css(".factoryprice")
    #     price=0
    #     if priceData:
    #         price = priceData.xpath("text()").extract_first()
    #         price=price.split(":")[1]
    #     specItem['money']=price
    #     specItem['score']=score
    #     self.log(specItem)
    #     #将车型信息保存到数据库
    #     yield specItem

猜你喜欢

转载自blog.csdn.net/guohan_solft/article/details/85012882