Goal: Get the entire "free travel" product list
Need to use links and websites:
E-commerce website:https://www.qunar.com/
Online encoding conversion:https://tool.oschina.net/encode?type=4
The following is the link obtained by observing and analyzing the data (the first part):
Destination corresponding to the origin: https://touch.dujia.qunar.com/golfz/sight/arriveRecommenddep=%E5%8C%97%E4%BA%AC&exclude=&extensionImg=255,175
Link to the starting point list:
https://touch.dujia.qunar.com/depCities.qunar
Products from starting point to destination:
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%8C%97%E4%BA%AC&query=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=true&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&width=480&height=320&quality=90&limit=0,20&includeAD=true&qsact=search&filterTagPlatform=mobile_touch
One: Observe page characteristics and analyze data
Access in the browser:https://www.qunar.com/
Press F12 to enter developer mode
Then press ctrl + shift + m (or circle red in the dot diagram)
Now it enters the wireless terminal mode , and the data returned by the wireless terminal is in JSON format , which is easier to process.
(1) After clicking " Free Travel ", and then clicking the search box , you can observe the tree structure , as shown in the figure below
(2) Switch to the Header page, there are two important information (circled in red)
Request URL (request link) : access the server through the link to obtain data
Request Method : Determine the function method and upload parameters to be used. Common request methods include GET and POST methods.
Among them: The
GET method only has the authority to query data , and the data can be returned as long as the URL is accessed ;
The POST method requires authorization verification and request content . The server passes the authorization and returns the data requested by the client through the request content . The POST method has the authority to query and modify the data .
Can draw:
Request URL (request link) delete the following callback :https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep=%E5%8C%97%E4%BA%AC&exclude=&extensionImg=255,175
Request Method : GET method
(3) Feel free to click on a recommended city (Sanya is clicked here)
Observing the Request URL , the string at the beginning of% in the link is a Chinese-compiled string. The server cannot recognize Chinese characters, so it can be submitted to the server after being compiled in a certain encoding method.
Open -line transcoding to decode
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%8C%97%E4%BA%AC&query=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=true&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E4%B8%89%E4%BA%9A%E8%87%AA%E7%94%B1%E8%A1%8C&width=480&height=320&quality=90&limit=0,20&includeAD=true&qsact=search&filterTagPlatform=mobile_touch
The decoded link can be found by comparing with the browse page
dep = Beijing ( departure place )
query = Sanya free travel ( destination )
originalquery = Sanya free travel ( destination )
(4) If you want to get all "free travel" products, you must first get all the starting points, because starting from different cities, there will be different products.
Choose a place of departure:
Find the departure list
Get the link to the starting point list:
https://touch.dujia.qunar.com/depCities.qunar
Two: Work flow analysis
(1) Get the starting point list
(2) Get a list of tourist attractions
(3) Get a list of classic products
(4) Store data
Three: Build a category tree
(1) Get the starting point list
Tree:
import requests
url = 'https://touch.dujia.qunar.com/depCities.qunar' #出发点列表的链接
str = requests.get(url)
dep_dic = str.json()
for dep_item in dep_dic["data"]:
for dep in dep_dic["data"][dep_item]:
print(dep)
operation result:
(2) Get a list of tourist attractions
Get the destination based on the place of departure and continue to enter the code.
Tree:
import requests
import time
url = 'https://touch.dujia.qunar.com/depCities.qunar' #出发点列表的链接
str = requests.get(url)
dep_dic = str.json()
for dep_item in dep_dic["data"]:
for dep in dep_dic["data"][dep_item]:
print(dep)
#新加入的代码
url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(dep)
time.sleep(1)
str = requests.get(url)
arrive_dic = str.json()
arrive_city = [] #存放当前出发点能到的所有目的地
for arr_item in arrive_dic["data"]:
for arr_item_1 in arr_item["subModules"]:
for query in arr_item_1["items"]:
if query["query"] not in arrive_city: #使得当前出发点对应的目的地不重复
arrive_city.append(query["query"])
print(arrive_city)
operation result:
Four: Get the product list
import requests
import time
import pymongo
client = pymongo.MongoClient('localhost',27017) #建立连接
book_qunar = client['qunar'] #建立名为“qunar” 的数据库
sheet_qunar = book_qunar['sheet_qunar'] #在数据库中创建新表 “sheet_qunar”
url = 'https://touch.dujia.qunar.com/depCities.qunar' #出发点列表的链接
str = requests.get(url)
dep_dic = str.json()
for dep_item in dep_dic["data"]:
for dep in dep_dic["data"][dep_item]:
print(dep)
url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(dep)
time.sleep(1)
str = requests.get(url)
arrive_dic = str.json()
arrive_city = [] #存放当前出发点能到的所有目的地
for arr_item in arrive_dic["data"]:
for arr_item_1 in arr_item["subModules"]:
for query in arr_item_1["items"]:
if query["query"] not in arrive_city: #使得当前出发点对应的目的地不重复
arrive_city.append(query["query"])
for item in arrive_city:
url = 'https://touch.dujia.qunar.com/' \
'list?modules=list%2CbookingInfo%2' \
'CactivityDetail&dep={}&query={}&' \
'dappDealTrace=true&mobFunction=%E' \
'6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1' \
'%E8%A1%8C&cfrom=zyx&it=dujia_hy_dest' \
'ination&date=&needNoResult=true&origina' \
'lquery={}&width=480&height' \
'=320&quality=90&limit=0,' \
'20&includeAD=true&qsact=search&' \
'filterTagPlatform=mobile_touch'.format(dep,item,item)
time.sleep(1)
str = requests.get(url)
routeCount = int(str.json()["data"]["limit"]["routeCount"]) #取出产品数
for limit in range(0,routeCount,20): #获取产品信息
url = 'https://touch.dujia.qunar.com/' \
'list?modules=list%2CbookingInfo%2' \
'CactivityDetail&dep={}&query={}&' \
'dappDealTrace=true&mobFunction=%E' \
'6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1' \
'%E8%A1%8C&cfrom=zyx&it=dujia_hy_dest' \
'ination&date=&needNoResult=true&origina' \
'lquery={}&width=480&height' \
'=320&quality=90&limit={},' \
'20&includeAD=true&qsact=search&' \
'filterTagPlatform=mobile_touch'.format(dep,item,item,limit)
time.sleep(1)
str = requests.get(url)
#产品的数据类型
result = {
'date': time.strftime('%Y-%m-%d',time.localtime(time.time())),
'dep': dep,
'arrive': item,
'limit': limit,
'result': str.json()
}
sheet_qunar.insert_one(result)
operation result: