25.爬取去哪儿网的商品数据-1


1.首先分析页面信息
页面地址:http://touch.qunar.com/
爬取度假中的自由行频道信息
可以看到某一城市xhr获取信息:
 
  
 
 

request.url :

https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

这里可以看出url是拼接而成的,%开头的都是中文编译的字符串,这里是被转义后的数据。

实际url:

https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=广州&query=厦门自由行&dappDealTrace=false&mobFunction=扩展自由行&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=厦门自由行&limit=0,24&includeAD=true&qsact=search

这里就分析一下url:

dep参数:表示的是出发地(我在广州,所以定位的是广州)

query和originalquery参数:表示的是目的地

(因此只需要修改请求的这两个参数就能够遍历所有的商品信息,出发地,目的地组合会有不一样的数据呈现)

浏览器打开url真实信息:

2.获取出发点dep参数信息
请求地址:https://touch.dujia.qunar.com/p/public/dep
# 获取城市参数
import
requests url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() for i in dict['data']: for j in dict['data'][i]: print(j)

如图所示:

3.根据出发地获取目的地参数

import  requests
url = 'https://touch.dujia.qunar.com/depCities.qunar'
html = requests.get(url)
# print(html.text)
dict = html.json()
#获取出发地参数
for i in dict['data']:
    for j in dict['data'][i]:
        print(j)
        link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)
        html2 = requests.get(link_url)
        dict2 = html2.json()
        c_list = []
        #获取目的地参数
        for k in dict2['data']:
            for l in k['subModules']:
                for m in l['items']:
                    city = m['query']
            #去重数据
if city not in c_list: c_list.append(city) print(c_list)

可以看到一个出发地对应有很多目的地:

4.获取商品列表信息

dep 和query 参数已经获取,接下来就是请求json加载的数据,分析其url变化及 页面重要的routeCount参数
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

和limit的变化 每次请求是以24的倍数变化,通过获取routeCount参数,加载请求不同url。
import  requests
import urllib
import random,time
url = 'https://touch.dujia.qunar.com/depCities.qunar'
html = requests.get(url)
# print(html.text)
dict = html.json()
#获取出发地参数
for i in dict['data']:
    for j in dict['data'][i]:
        print(j)
        link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)

        #设置随机休眠时间
        time.sleep(random.randint(1,2))

        html2 = requests.get(link_url)
        dict2 = html2.json()
        c_list = []
        #获取目的地参数
        for k in dict2['data']:
            for l in k['subModules']:
                for m in l['items']:
                    city = m['query']
                    if city not  in c_list:
                        c_list.append(city)
        # print(c_list)

        #设置随机休眠时间
        time.sleep(random.randint(1,2))

        #请求数据
        for c in c_list:
            #配置请求url
            url3 = 'https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit=0,24&qsact=scroll'.format(urllib.request.quote(j),urllib.request.quote(city),urllib.request.quote(city))
            A = url3.replace('https://touch.dujia.qunar.com','')
            # print(A)
            headers = {
                'cookie': 'QN48=tc_e1b5f5bb4d76a018_16730073949_ad75; csrfToken=d27163582839d6b8cbcb53110ed67077; QN300=organic; QN1=ezu0pVvzuB9qeVd2w90fAg==; _RF1=119.129.117.7; _RSG=AZ4soQG2oI5YMrcq1P6et8; _RDG=283bf2bcd3461d22ef1d94f9276d7c9b85; _RGUID=54b20906-b2d8-48ca-8de8-1990749b55a2; QN205=organic; QN234=home_free_t; _pk_ref.1.8600=%5B%22%22%2C%22%22%2C1542699072%2C%22http%3A%2F%2Ftouch.qunar.com%2F%22%5D; _pk_ses.1.8600=*; QN57=15427010307400.44337198739421924; QN58=1542701030742%7C1542701078367%7C4; QN233=dujia_hy_destination; _pk_id.1.8600=5f2ca9d25160d431.1542699072.1.1542705039.1542699072.; QN243=165',
                'referer': 'https://touch.dujia.qunar.com/p/list?dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&et=&it=dujia_hy_destination',
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
            }

            html3 = requests.get(url=url3,headers=headers)
            print(url3)
            print(html3.json())
            # # 获取 routeCount 参数
            # num = int(html3.json()['data']['limit']['routeCount'])
            #
            # # 每页只返回 24条数据
            # for n in range(0,num,24):
            #     url4 ='https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit={},24&qsact=scroll,n'
            #
            #     # 设置随机休眠时间
            #     time.sleep(random.randint(1, 2))
            #
            #     html4 = requests.get(url=url4,headers=headers)
            #     result = html4.json()
            #     print(result)

猜你喜欢

转载自www.cnblogs.com/lvjing/p/9990608.html