如何去除网页噪声提取数据(01) —— 去哪儿网

如何去除网页噪声提取数据(01) —— 去哪儿网

1. 需求介绍

  • 今天的目标是爬取 “去哪儿网” 上的数据信息,去哪儿网上的数据是非常珍贵的,所以这个数据被保护的也很严格,不仅仅是原始数据获取较为困难,而且渲染后的数据也加入了大量的混淆。
  • 尽管难度很大,但是作为一直修炼千年的蜘蛛精,是没有爬不下来的数据滴。
  • 下面就看我如何织网,如何捕获猎物……呃,不对,是爬取数据……

2. 环境

  • python 3.6.1
  • 系统:win7
  • IDE:pycharm
  • 安装过chrome浏览器
  • 配置好chromedriver(设置好环境变量)
  • selenium 3.7.0

3. 网站分析

3.1. 分析网页请求

  • 通过请求分析,可以看到网页本身的代码是很少的,数据基本上都来自于ajax请求。
    这里写图片描述
  • 我们再看看ajax返回的json数据:针对其中每条请求返回的json数据,都无法查找到价格的数据,说明信息隐藏的很深很深…
    这里写图片描述
  • 但是,不要灰心不要哭,我们还有最后的杀手锏,绝世好剑:selenium,此剑一出,毁天灭地。

3.2. 分析价格数据本身

  • 既然我们决定采用selenium来爬取数据,那就有必要分析一下渲染后数据的呈现形式了。
  • 第一:通过复制,我们发现这个价格数据是无法被复制的:网页显示3028,但是复制出来的数据是232803,说明数据是被混淆过的。
    这里写图片描述
  • 第二:通过审查元素,查看数据是如何经过混淆的:如下图所示,混淆的策略是,先在坐标点放上4个数字,然后用其他数字取代其中某两个坐标点上的数字,相当于覆盖(叠在上层,让底层的数字不可见),所以我们复制的时候是把所有的数字都复制下来了232803,但是用户看到的数字是3028
    这里写图片描述
  • 混淆过程如下:
    这里写图片描述
  • 通过上面的分析过程,发现对于一个4位数字的机票价格,第一步先用四个 i 标签渲染,再用两个 b 标签去绝对定位偏移量,覆盖故意展示错误的 i 标签,最后在视觉上形成正确的价格…我们知道衣服是怎么穿上去的,那么将这件外衣脱下来,自然是很简单的事情了

4. 代码实现


# 需要特别注意的是,在程序运行时,切记不要手动更改窗口大小
# 因为这样会修改到程序获得的标签的坐标信息,导致数据混乱!

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

# 从价格混淆数据中提取 price信息
def handlePrice(priceSelector):
    # 获取整个价格数据呈现区的起始横坐标
    xStart = priceSelector.location.get("x")
    # 获取数据呈现的宽度
    finalWidth = priceSelector.size.get("width")
    # 初始化一个价格数组,以显示的宽度为长度,初始值为-1,注意这儿为什么要包含finalWidth的值
    originalArray = [-1 for i in range(0, finalWidth + 1)]
    print(f"len = {len(originalArray)}")
    firstLevelLst = priceSelector.find_elements_by_xpath(".//b")
    for firstLevel in firstLevelLst:
        secondLevelLst = firstLevel.find_elements_by_xpath("./i")
        # 如果有下一级,就处理下一级的数字
        if secondLevelLst:
            for secondLevel in secondLevelLst:
                # 获取相对坐标值,也就是在数组中的起始位置
                xStartSecond = secondLevel.location.get("x") - xStart
                # 获取这个字符所占的宽度
                xWidthSecond = secondLevel.size.get("width")
                # 将数组中这个位置的值替换成这个网页中的文本
                originalArray[xStartSecond] = secondLevel.text
                # 这个字符所占剩余部分的空间,全都置成 -1
                for i in range(xStartSecond + 1, xStartSecond + xWidthSecond - 1):
                    originalArray[i] = -1
                print(f"secondX = {xStartSecond}, secondLevel = {secondLevel.text}, xWidthSecond = {xWidthSecond}")
        else:
            # 获取相对坐标值,也就是在数组中的起始位置
            xStartFirst = firstLevel.location.get("x") - xStart
            # 获取这个字符所占的宽度
            xWidthFirst = firstLevel.size.get("width")
            # 将数组中这个位置的值替换成这个网页中的文本
            originalArray[xStartFirst] = firstLevel.text
            # 这个字符所占剩余部分的空间,全都置成 -1
            for i in range(xStartFirst + 1, xStartFirst + xWidthFirst):
                originalArray[i] = -1
            print(f"firstX = {xStartFirst}, firstLevel = {firstLevel.text}, xWidthFirst = {xWidthFirst}")
    # 从originalArray数组中,把标记过的数据取出来
    finalPrice = ""
    for elem in originalArray:
        if elem != -1:
            print(f"elem = {elem}", end=', ')
            finalPrice += str(elem.strip())
    return int(finalPrice) if finalPrice != "" else 0


if __name__ == "__main__":
    chrome_options = webdriver.ChromeOptions()
    # 载入xpath Helper插件,方便调试
    extension_path = 'D:/extension/XPath-Helper_v2.0.2.crx'
    chrome_options.add_extension(extension_path)

    # 启动浏览器,注意这儿是已经配置过chromedriver.exe的环境变量了,才可以直接使用,不填路径
    browser = webdriver.Chrome(chrome_options=chrome_options)
    browser.maximize_window()

    qunaerUrl = "https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E6%8B%89%E8%90%A8&searchArrivalAirport=%E6%B7%B1%E5%9C%B3&searchDepartureTime=2018-05-13&searchArrivalTime=2018-05-18&nextNDays=0&startSearch=true&fromCode=LXA&toCode=SZX&from=near_flight&lowestPrice=null"
    browser.get(qunaerUrl)
    # 为了简化逻辑,此处使用等待10秒
    # 其实更精确的做法是,等待页面中某个标志性元素加载成功
    time.sleep(10)

    # 这个地方尤其需要注意:不要使用page_source来提取数据,因为page_source是指网页源代码
    # 和我们用其他方式requests,scrapy抓下来的页面没有任何区别
    # 千万不要误解成,page_source就是渲染后的页面数据,这是错误的
    # selenium渲染后的是一个DOM对象,目前是没有办法将这种数据弄下来。
    # 一般来说,我们既然选择了用selenium来抓取数据,大部分情况也是因为网页源代码中无法提取到有效数据
    # 而很多培训教程中,总是提到用selenium内置的数据提取器来提取数据速度非常慢,建议使用re,xpath直接从page_source中提取数据
    # 这是需要辩证的去看的。
    # pageSource = browser.page_source
    # print(f"{pageSource}")

    allAirLst = browser.find_elements_by_xpath("//div[@class='b-airfly']")
    for airInfo in allAirLst:
        name = airInfo.find_element_by_xpath(".//div[@class='air']/span").text
        startTime = airInfo.find_element_by_xpath(".//div[@class='sep-lf']/h2").text
        endTime = airInfo.find_element_by_xpath(".//div[@class='sep-rt']/h2").text
        priceSelector = airInfo.find_element_by_xpath(".//span[@class='prc_wp']")
        finalPrice = handlePrice(priceSelector)
        print(f"\n####{name}           {startTime}           {endTime}          price:{finalPrice}")

5. 运行结果

这里写图片描述

这里写图片描述

  • 打印输出:
E:\Miniconda\python.exe E:/PyCharmCode/myDocument/qunaerwang.py
len = 73
secondX = 0, secondLevel = 0, xWidthSecond = 18
secondX = 18, secondLevel = 6, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 4, xWidthSecond = 18
firstX = 18, firstLevel = 0, xWidthFirst = 18
firstX = 54, firstLevel = 8, xWidthFirst = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
firstX = 36, firstLevel = 2, xWidthFirst = 18
elem = 3, elem = 0, elem = 2, elem = 8, 
####西藏航空           09:20           16:30          price:3028
len = 73
secondX = 0, secondLevel = 5, xWidthSecond = 18
secondX = 18, secondLevel = 1, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 6, xWidthSecond = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
firstX = 54, firstLevel = 0, xWidthFirst = 18
firstX = 18, firstLevel = 1, xWidthFirst = 18
elem = 3, elem = 1, elem = 0, elem = 0, 
####中国国航           09:20           16:30          price:3100
len = 73
secondX = 0, secondLevel = 6, xWidthSecond = 18
secondX = 18, secondLevel = 9, xWidthSecond = 18
secondX = 36, secondLevel = 6, xWidthSecond = 18
secondX = 54, secondLevel = 0, xWidthSecond = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
firstX = 18, firstLevel = 1, xWidthFirst = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
elem = 3, elem = 1, elem = 0, elem = 0, 
####山东航空           09:20           16:30          price:3100
len = 73
secondX = 0, secondLevel = 4, xWidthSecond = 18
secondX = 18, secondLevel = 1, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 4, xWidthSecond = 18
firstX = 54, firstLevel = 0, xWidthFirst = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
elem = 3, elem = 1, elem = 0, elem = 0, 
####深圳航空           09:20           16:30          price:3100
len = 73
secondX = 0, secondLevel = 1, xWidthSecond = 18
secondX = 18, secondLevel = 9, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 1, xWidthSecond = 18
firstX = 54, firstLevel = 9, xWidthFirst = 18
firstX = 18, firstLevel = 4, xWidthFirst = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
elem = 1, elem = 4, elem = 0, elem = 9, 
####厦门航空           12:10           20:20          price:1409
len = 55
secondX = 0, secondLevel = 9, xWidthSecond = 18
secondX = 18, secondLevel = 7, xWidthSecond = 18
secondX = 36, secondLevel = 7, xWidthSecond = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
elem = 9, elem = 7, elem = 0, 
####东方航空           13:30           00:10          price:970
len = 55
secondX = 0, secondLevel = 1, xWidthSecond = 18
secondX = 18, secondLevel = 7, xWidthSecond = 18
secondX = 36, secondLevel = 5, xWidthSecond = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
firstX = 0, firstLevel = 9, xWidthFirst = 18
firstX = 18, firstLevel = 2, xWidthFirst = 18
elem = 9, elem = 2, elem = 0, 
####东方航空           17:30           12:10          price:920
len = 73
secondX = 0, secondLevel = 1, xWidthSecond = 18
secondX = 18, secondLevel = 9, xWidthSecond = 18
secondX = 36, secondLevel = 5, xWidthSecond = 18
secondX = 54, secondLevel = 5, xWidthSecond = 18
firstX = 18, firstLevel = 1, xWidthFirst = 18
firstX = 36, firstLevel = 3, xWidthFirst = 18
firstX = 54, firstLevel = 2, xWidthFirst = 18
elem = 1, elem = 1, elem = 3, elem = 2, 
####西部航空           18:10           01:50          price:1132
len = 73
secondX = 0, secondLevel = 7, xWidthSecond = 18
secondX = 18, secondLevel = 0, xWidthSecond = 18
secondX = 36, secondLevel = 7, xWidthSecond = 18
secondX = 54, secondLevel = 5, xWidthSecond = 18
firstX = 0, firstLevel = 1, xWidthFirst = 18
firstX = 36, firstLevel = 3, xWidthFirst = 18
firstX = 54, firstLevel = 7, xWidthFirst = 18
elem = 1, elem = 0, elem = 3, elem = 7, 
####西藏航空           22:00           09:35          price:1037

猜你喜欢

转载自blog.csdn.net/zwq912318834/article/details/80243056