python爬取天极网手机信息代码

记录一些遇到的问题以及学习记录：

1.获取到整个网站结构时中文乱码情况

response.encoding = 'gb2312'

编码根据爬取网站的源代码头部设置即可，我也设置了utf-8但也是乱码情况

2.xpath提取文本时如何过滤掉空格

#使用该方法既可 normalize-space()
xpath('.//dl[@class="nav bdb1"]/dt/a[3]/text()') # " 标题"
xpath('normalize-space(.//dl[@class="nav bdb1"]/dt/a[3]/text())') # "标题"

3.xpath匹配文本查询

一开始我查询是直接定位到某个标签获取数据，但是该网站的商品详情页的table数据都不一样，则结构或者顺序有些改变。导致获取到的数据不能一一对应。例如获取到处理器得到的结果是安卓版本，一开始想参照不同的页面多写几个规则但是这样合并数据也很麻烦。因此直接匹配标题文本从而得到数据

#如果每个详情页面的结构一样可以用

#不一样则使用文本匹配查询

xpath 根据文本匹配

# 等于值
//div[text()="文本"]
# 模糊匹配
//div[contains(text(),"文本")]
# 根据属性值匹配
//div[contains(@class, "")]

xpath 查找父兄子元素

# 后面所有兄弟元素 nextAll
//div[text()="文本"]/following-sibling
# 后面一个兄弟元素 next
//div[text()="文本"]/following-sibling::div[1]

# 前面所有兄弟元素 prevAll
//div[text()="文本"]/preceding-sibling
# 前面一个兄弟元素 prev
//div[text()="文本"]/preceding-sibling::div[1]

# 父级元素
//div[text()="文本"]/..

下面就是爬取手机信息的代码，新手不太会优化，有问题可以提出来一起交流学习

import requests
import parsel
import time
import json

requests.packages.urllib3.disable_warnings()
#1.正确的url地址 分析网页性质
result = []
t1 = time.time()
for page in range(1,182): #循环页数
    print('=====================正在爬取第{}页数据============='.format(page))
    url = "http://product.yesky.com/mobilephone/list{}.html#page".format(page)
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
    #2.发送请求 范数据-》文本数据 图片地址 css js
    response = requests.get(url=url,headers=headers, verify=False)
    response.encoding = 'gb2312'
    html_data = response.text
    # 3.数据解析
    selector = parsel.Selector(html_data)
    lis = selector.xpath('//div[@class="list blue"]') #循环产品数据定位

    for li in lis: #循环列表
        href_url = li.xpath('.//h2/a/@href').get()#产品子页面href
        href_url = href_url+'param.shtml'
        #print(href_url)
        # 2.发送子页面请求
        response = requests.get(url=href_url, headers=headers, verify=False)
        response.encoding = 'gb2312' #解决返回页面乱码 查看你爬取网站的charset 跟它一样即可
        c_html_data = response.text

        # 3.数据解析
        selector = parsel.Selector(c_html_data)
        c_lis = selector.xpath('//body[@class="body"]')#子页面定义最大结构参考位置项
        #print(c_html_data)
        for c_li in c_lis: #循环页面内容  normalize-space()去除text()取出文本的空格 如果是None 则返回为"" 不使用默认返回null
            brand = c_li.xpath('normalize-space(.//dl[@class="nav bdb1"]/dt/a[3]/text())').get()  # 品牌
            name = c_li.xpath('normalize-space(.//div[@class="pro_name gray1"]/h1/text())').get()  # 名称
            name = name.split("(")[0]# 格式化名称
            if brand != '苹果' and name != "${productname}参数":
                type = c_li.xpath('normalize-space(.//div[@class="mainparam"]/table[4]/tr/th[contains(text(),"手机昵称")]/following-sibling::td[1]/text())').get()  # 入网型号
                resolution = c_li.xpath('normalize-space(.//div[@class="mainparam"]/table[2]/tr/th[contains(text(),"分辨率")]/following-sibling::td[1]/text())').get() #分辨率
                system = c_li.xpath('normalize-space(.//div[@class="mainparam"]/table[2]/tr/th[contains(text(),"操作系统版本")]/following-sibling::td[1]/text())').get() #手机系统
                cpu = c_li.xpath('normalize-space(.//div[@class="mainparam"]/table[2]/tr/th[contains(text(),"CPU型号")]/following-sibling::td[1]/text())').get()  # cpu
                ppi = c_li.xpath('normalize-space(.//div[@class="mainparam"]/table[6]/tr/th[contains(text(),"屏幕像素密度")]/following-sibling::td[1]/text())').get()  # ppi
                if ppi == "":
                    ppi = c_li.xpath('normalize-space(.//div[@class="mainparam"]/table[6]/tr/th/a[contains(text(),"屏幕像素密度")]/../following-sibling::td[1]/text())').get()  # ppi
                ram = c_li.xpath('normalize-space(.//div[@class="mainparam"]/table[2]/tr/th[contains(text(),"RAM容量")]/following-sibling::td[1]/text())').get()  # RAM内存
                one = {}
                one['brand'] = brand
                one['name'] = name
                one['type'] = type
                one['resolution'] = resolution
                one['system'] = system
                one['cpu'] = cpu
                one['ppi'] = ppi
                one['ram'] = ram

                result.append(one)
                print('数据下载完毕')
    with open('tjw.json', 'w', encoding='utf-8') as file:
        file.write(json.dumps(result, indent=2, ensure_ascii=False))
    print("耗时：", time.time() - t1)
    time.sleep(10)
    #break  # 跳出 否则出现循环两次 第二次无数据 造成表格会有隔行，json会有全部为null的对象

python爬取天极网手机信息代码

记录一些遇到的问题以及学习记录：

1.获取到整个网站结构时中文乱码情况

2.xpath提取文本时如何过滤掉空格

3.xpath匹配文本查询

猜你喜欢