[Crawler learning four] Python large-scale crawler case: crawling product data of an e-commerce website (1)

Goal: Get the entire "free travel" product list


Need to use links and websites:

E-commerce website:https://www.qunar.com/

Online encoding conversion:https://tool.oschina.net/encode?type=4

The following is the link obtained by observing and analyzing the data (the first part):

Destination corresponding to the origin: https://touch.dujia.qunar.com/golfz/sight/arriveRecommenddep=%E5%8C%97%E4%BA%AC&exclude=&extensionImg=255,175

Link to the starting point list:
https://touch.dujia.qunar.com/depCities.qunar

Products from starting point to destination:

https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%8C%97%E4%BA%AC&query=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=true&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&width=480&height=320&quality=90&limit=0,20&includeAD=true&qsact=search&filterTagPlatform=mobile_touch

One: Observe page characteristics and analyze data

Access in the browser:https://www.qunar.com/

Press F12 to enter developer mode

Then press ctrl + shift + m (or circle red in the dot diagram)

Insert picture description here


Now it enters the wireless terminal mode , and the data returned by the wireless terminal is in JSON format , which is easier to process.

Insert picture description here


(1) After clicking " Free Travel ", and then clicking the search box , you can observe the tree structure , as shown in the figure below

Insert picture description here


(2) Switch to the Header page, there are two important information (circled in red)

Request URL (request link) : access the server through the link to obtain data

Request Method : Determine the function method and upload parameters to be used. Common request methods include GET and POST methods.

Among them: The
GET method only has the authority to query data , and the data can be returned as long as the URL is accessed ;

The POST method requires authorization verification and request content . The server passes the authorization and returns the data requested by the client through the request content . The POST method has the authority to query and modify the data .

Insert picture description here

Can draw:

Request URL (request link) delete the following callback :https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep=%E5%8C%97%E4%BA%AC&exclude=&extensionImg=255,175

Request Method : GET method


(3) Feel free to click on a recommended city (Sanya is clicked here)

Insert picture description here
Observing the Request URL , the string at the beginning of% in the link is a Chinese-compiled string. The server cannot recognize Chinese characters, so it can be submitted to the server after being compiled in a certain encoding method.

Open -line transcoding to decode

https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%8C%97%E4%BA%AC&query=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=true&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&width=480&height=320&quality=90&limit=0,20&includeAD=true&qsact=search&filterTagPlatform=mobile_touch

Insert picture description here
The decoded link can be found by comparing with the browse page

dep = Beijing ( departure place )

query = Sanya free travel ( destination )

originalquery = Sanya free travel ( destination )


(4) If you want to get all "free travel" products, you must first get all the starting points, because starting from different cities, there will be different products.

Choose a place of departure:

Insert picture description here


Find the departure list
Insert picture description here


Get the link to the starting point list:

https://touch.dujia.qunar.com/depCities.qunar

Insert picture description here


Two: Work flow analysis

(1) Get the starting point list

(2) Get a list of tourist attractions

(3) Get a list of classic products

(4) Store data


Three: Build a category tree

(1) Get the starting point list

Tree:

Insert picture description here


import requests

url = 'https://touch.dujia.qunar.com/depCities.qunar' #出发点列表的链接

str = requests.get(url)

dep_dic = str.json()

for dep_item in dep_dic["data"]:
    for dep in dep_dic["data"][dep_item]:
        print(dep)

operation result:

Insert picture description here


(2) Get a list of tourist attractions

Get the destination based on the place of departure and continue to enter the code.

Tree:

Insert picture description here

import requests
import time
url = 'https://touch.dujia.qunar.com/depCities.qunar' #出发点列表的链接
str = requests.get(url)
dep_dic = str.json()
for dep_item in dep_dic["data"]:
    for dep in dep_dic["data"][dep_item]:
        print(dep)

        #新加入的代码
        url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(dep)
        time.sleep(1)
        str = requests.get(url)
        arrive_dic = str.json()
        arrive_city = [] #存放当前出发点能到的所有目的地
        for arr_item in arrive_dic["data"]:
            for arr_item_1 in arr_item["subModules"]:
                for query in arr_item_1["items"]:
                    if query["query"] not in arrive_city: #使得当前出发点对应的目的地不重复
                        arrive_city.append(query["query"])
        print(arrive_city)

operation result:

Insert picture description here


Four: Get the product list

import requests
import time
import pymongo

client = pymongo.MongoClient('localhost',27017) #建立连接

book_qunar = client['qunar'] #建立名为“qunar” 的数据库

sheet_qunar = book_qunar['sheet_qunar'] #在数据库中创建新表 “sheet_qunar”

url = 'https://touch.dujia.qunar.com/depCities.qunar' #出发点列表的链接
str = requests.get(url)
dep_dic = str.json()
for dep_item in dep_dic["data"]:
    for dep in dep_dic["data"][dep_item]:
        print(dep)
        url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(dep)
        time.sleep(1)
        str = requests.get(url)
        arrive_dic = str.json()
        arrive_city = [] #存放当前出发点能到的所有目的地
        for arr_item in arrive_dic["data"]:
            for arr_item_1 in arr_item["subModules"]:
                for query in arr_item_1["items"]:
                    if query["query"] not in arrive_city: #使得当前出发点对应的目的地不重复
                        arrive_city.append(query["query"])

        for item in arrive_city:
            url = 'https://touch.dujia.qunar.com/' \
                  'list?modules=list%2CbookingInfo%2' \
                  'CactivityDetail&dep={}&query={}&' \
                  'dappDealTrace=true&mobFunction=%E' \
                  '6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1' \
                  '%E8%A1%8C&cfrom=zyx&it=dujia_hy_dest' \
                  'ination&date=&needNoResult=true&origina' \
                  'lquery={}&width=480&height' \
                  '=320&quality=90&limit=0,' \
                  '20&includeAD=true&qsact=search&' \
                  'filterTagPlatform=mobile_touch'.format(dep,item,item)

            time.sleep(1)
            str = requests.get(url)

            routeCount = int(str.json()["data"]["limit"]["routeCount"]) #取出产品数

            for limit in range(0,routeCount,20): #获取产品信息
                url = 'https://touch.dujia.qunar.com/' \
                  'list?modules=list%2CbookingInfo%2' \
                  'CactivityDetail&dep={}&query={}&' \
                  'dappDealTrace=true&mobFunction=%E' \
                  '6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1' \
                  '%E8%A1%8C&cfrom=zyx&it=dujia_hy_dest' \
                  'ination&date=&needNoResult=true&origina' \
                  'lquery={}&width=480&height' \
                  '=320&quality=90&limit={},' \
                  '20&includeAD=true&qsact=search&' \
                  'filterTagPlatform=mobile_touch'.format(dep,item,item,limit)

                time.sleep(1)

                str = requests.get(url)

                #产品的数据类型
                result = {
    
    
                    'date': time.strftime('%Y-%m-%d',time.localtime(time.time())),
                    'dep': dep,
                    'arrive': item,
                    'limit': limit,
                    'result': str.json()
                }

                sheet_qunar.insert_one(result)

operation result:

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45260385/article/details/108928548