python + fiddler crawler Resources have their nests apartment app

Foreword

As mobile end more convenient, more and more enterprises are no longer open web end in favor of the development of mobile end app, or applets, this is not, in Baidu, China Resources has not seen for a long time looking for the official website of the nest, but found it guy has his own app and applets, how to get their data from their app or applet inside it? Before the data are crawling web pages end, which end should climb app data, it is a fresh attempt to check online app crawled style main method to intercept the server returns data capture by charles fiddler and a way to get that access address and the original json format data, this paper describes how to achieve capture by the fiddler, the saying goes, may quicken the work, starting with the installation setup begins.

fiddler Installation and Setup

  1. fiddler download and install
    into the fiddler's official website https://www.telerik.com/download/fiddler , fill in your application purposes, the mailbox and the state, agreed to user license agreement, click download. Installed, open the fiddler from the Start bar, the following window will pop up, it is a direct yes.
    Here Insert Picture Description
  2. fiddler set
    the main page, click on the menu bar Tools, select Options, and HTTPS Connections has two main settings, select Capture HTTPS CONNECTs and Decrypt HTTPS traffic, and then click Actions on the right side click Trust Root Certificate, install the CA certificate in Connections option inside Fiddler listen to later fill port 8888, expressed listening port is 8888, and the Allow romete computers to connect on the hook, click ok button to confirm, and finally click on the top right side of the home page will appear Online your computer's ip address, remember this behind ip will be used, so far, fiddler pc side be set up, perform the following phone settings.
    Here Insert Picture Description
    Here Insert Picture Description
    Here Insert Picture Description
    Check ip
  3. Phone-side settings
    to use mobile phones capture Fiddler be, first of all to ensure mobile phones and computers in a network within a network can be used to make computers and mobile phones are connected to the same router, show you where the phone side iphone6 settings other similar, mainly wifi enter manually set the proxy set up inside, fill in your ip address and port number just now, and then enter http using a mobile phone browser address bar: // localhost: 8888 Download FiddlerRoot certificate security certificate, iPhone also need to enter the general settings inside the profile trust FiddlerRoot.
    Here Insert Picture Description
    Here Insert Picture Description
    Here Insert Picture Description

Here Insert Picture Description
Here Insert Picture Description
Everything is ready to begin fiddler get caught up, open your phone has a nest app, and you can capture a lot of information of fiddler https, and gradually open to see if there are points that you want, where you can also find the law then you need to filter out non-host.
Here Insert Picture Description
Did you ever think, you crawl the web-side data, you will first analyze the structure of the page to be crawling, crawling found to be the url, layer by layer then parse out the data you want, but not as app web end so easy given directly url in the address bar, which is the crux of the fiddler, get app data corresponding to the url or api interface and see its way to get the request or post, if a post, you have to construct your good the data request form, check after the expiry of these can start writing code.

The complete code

# -*- coding: utf-8 -*-
"""
project_name:youtha_spider
@author: 帅帅de三叔
Created on Thu Sep 19 10:27:16 2019
"""
import requests,json #导入服务器请求模块
from bs4 import BeautifulSoup #导入解析模块
import pymysql #导入数据库模块
db=pymysql.connect("localhost","root","123456","youchao",charset="UTF8MB4") #链接数据库
cursor=db.cursor() #获取操作数据库的游标
cursor.execute("drop table if exists youtha") #以重新写入的方式写入数据库
c_sql="""create table youtha(
          city varchar(8),
          project_name varchar(20),
          address varchar(50))Engine=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=UTF8MB4"""
cursor.execute(c_sql) #创建表apartmentone
url = 'https://ris.crland.com.cn/api/public/app/project/querAllPro' #请求url

header={"Host": "ris.crland.com.cn",
"Content-Type": "application/json",
"Cookie": "dtCookie=2$4039DDB3F6634C1FE36E57EB5E779E43; SESSIONID_HAP=811c1e06-3afc-4968-b3a5-cd4bb67497c8",
"Connection": "keep-alive",
"Accept": "*/*",
"User-Agent": "crlandRent/3.3 (iPhone; iOS 12.4.1; Scale/2.00)",
"Accept-Language": "zh-Hans;q=1",
"Content-Length": "105",
"Accept-Encoding": "br, gzip, deflate"} #构造请求头
for cityid in ["1","9","15","3","11","22","8","2","14","6","32"]: #城市列表
    data_form = {"equipmentId":"C2CCEB9A-4EA2-4F5A-9275-A6D3ECBAF95F","cityId":cityid,"pageSize":"20","token":"","page":"1"} #请求数据表单
    #print(data_form)
    res=requests.post(url,data=json.dumps(data_form),headers=header,verify=False) #post请求
    soup=json.loads(res.text) #字典化
    #print(soup) 
    project_list=soup["data"]["projectAll"] #项目列表
    project_num=len(project_list) #项目总数
    for project in project_list: #对项目进行循环
        city=project["cityName"] #城市
        project_name=project['projectName'] #项目名称
        address=project["address"] #地址
        print(city,project_name,address) #测试
        insert_data=("insert into youtha(city,project_name,address)""values(%s,%s,%s)") #控制插入格式
        available_data=([city,project_name,address]) #待插入数据库数据
        cursor.execute(insert_data,available_data) #执行插入操作
        db.commit() #主动提交执行

Code Reading

You will find the app crawl and crawl web is no different than even web scraping code even simpler, because app mainly through api json format to return the data is already very standardized, while the app grab the key is to analyze the data of the gate url looks like through packet capture tool, construct the request header header ,, the request method is post or get, if it is well constructed post request request form data_form, remember to take advantage of this form json.dumps () function into json format, because there is a limit in the request header inside the "Content-Type": "application / json", after you put the data sheet json of your request back data is json format, you can take directly down, you can be like me such json.loads first use () function of request dictionary and then take out the data, is a personal preference.

Screenshot results

Here Insert Picture Description

Pit and solutions encountered

  1. Open the phone app, displays the network connection fails, you can phone settings -> General -> About -> Certificate Trust Settings -> Options on the fiddler's certificate
    Here Insert Picture Description
  2. Copy the url to the browser when the display error code 405
    because there is a nest post request, first carefully check your request header, need to construct data_form, and the addition of json.loads () function to turn json format
data_form = {"equipmentId":"C2CCEB9A-4EA2-4F5A-9275-A6D3ECBAF95F","cityId":cityid,"pageSize":"20","token":"","page":"1"} #请求数据表单
json.dumps(data_form) #转为json格式

Disclaimer

Python Reptile only learning exchanges, If I have offended, please inform deleted.

Yan Shen read
green guest apartment listings listed by City crawling
excellent customer Yat home listed on the listing crawling
built apartments listed square listings crawling
River apartment rental listings crawling listed
by keyword obtaining micro-blog content
parents help points plate crawling
BIG + Bi centralized international community apartment project crawling
Merchants Shekou apartment project centralized access to information

Published 45 original articles · won praise 12 · views 8664

Guess you like

Origin blog.csdn.net/zengbowengood/article/details/101070190