How to crawl a website based on Flex technology

 

Adobe Flex is based on the Flash platform and covers a series of technology combinations that support the development and deployment of RIA (Rich Internet Applications). Many web-based online games are developed using Flex technology.

The communication between Flash (client) and the server supports both the HTTP protocol (the specific text message format can be text, XML, JSON, AMF, etc.) and the original Sockets, and the HTTP protocol is more common.

For Flex applications that use text, XML, or JSON message formats, the analysis and crawling methods are the same as those of Ajax-based websites:

Analyze the structure (request method, URL, related parameters) and steps of the request through HTTP protocol capture tools (for example, Fiddler, Firebug), and then use the program to simulate the request in the same format, and then analyze and extract the returned data.

This article mainly introduces the capture method of Flex application using AMF message format. Take the "China Agricultural Products Wholesale Market Information Publicity System" (hereinafter referred to as "Agricultural Product System") as an example. Its URL is http://jgsb.agri.gov.cn/flexapps/hqApp.swf . Our goal is to capture this system. The latest agricultural product price information published in real-time.

AMF (Action Message Format, https://en.wikipedia.org/wiki/Action_Message_Format ) is a common binary encoding mode for the communication between Flash and the server. It has high transmission efficiency and can be transmitted at the HTTP level. Many Flash WebGames now use this message format.

We can directly use Fidder to capture the AMF package, but the POST DATA body we see is in garbled form. This is because AMF uses a special encapsulation and encoding method. Only after correct AMF decoding and parsing can we view the "plain text" form. Currently, the capture tool that supports AMF parsing is Charles (like Fiddler, it is also an HTTP protocol capture tool based on the principle of HTTP PROXY). It natively supports parsing AMF messages without installing any additional plug-ins or patches.

The analysis steps are as follows:

1) Start Charles.

2) Visit the "Agricultural Products System" in the browser and click on the next page.

3) We can see the corresponding data packet in Charles, switch to the AMF tab, and we can see that the request/response AMF data is parsed into plain text (similar to the JSON format). Analyze the request and response data in AMF format, understand its structure and the meaning of each parameter, and prepare for the realization of the program in the future.

As shown in the figure below, this is the POST DATA of the HTTP request.

Charlesè§ £ æåºæ ¥ çAMFæ ° æ®

We need to analyze the meaning of the parameters in the request and the structure of the message. The body part is a RemotingMessage type message, and the body of the message has three parameters:

a) The first parameter is a com.itown.kas.pfsc.report.po.HqPara object, which has 4 properties (marketInfo, breedInfoDl, breedInfo, province), the values ​​are all Null, corresponding to the webpage (Flash) Four query options (province, market, category, category name).

b) The second parameter controls the current page number.

c) The third parameter controls the number of bars displayed on each page, the default is 15. We will use 50 in the collection code later.

Skip the analysis process of other non-critical parameters and response data.

PyAMF is a Python implementation of the AMF protocol encoder and decoder. It is a pity that the official website of this project is no longer accessible, and the above documents and sample programs cannot be viewed, which greatly increases the difficulty of development. Fortunately, through Google (yes, not Baidu), you can find some reference materials and examples used in other people's projects.

The source code is directly below:

# coding: utf-8
# agri.gov.cn_amf_client.py
# http://jgsb.agri.gov.cn/flexapps/hqApp.swf数据抓取
# refer: http://www.site-digger.com/html/articles/20160418/121.html  

import urllib2
import uuid
import pyamf
from pyamf import remoting
from pyamf.flex import messaging
from urllib.request import Request, urlopen

class HqPara:
    """查询参数
    """
    def __init__(self):
        self.marketInfo = None
        self.breedInfoDl = None
        self.breedInfo = None
        self.provice = None
# https://en.wikipedia.org/wiki/Action_Message_Format
# registerClassAlias("personTypeAlias", Person);
# 注册自定义的Body参数类型,这样数据类型com.itown.kas.pfsc.report.po.HqPara就会在后面被一并发给服务端(否则服务端就可能返回参数不是预期的异常Client.Message.Deserialize.InvalidType)
pyamf.register_class(HqPara, alias='com.itown.kas.pfsc.report.po.HqPara')

# 构造flex.messaging.messages.RemotingMessage消息
msg = messaging.RemotingMessage(messageId=str(uuid.uuid1()).upper(),
                                clientId=str(uuid.uuid1()).upper(),
                                operation='getHqSearchData',
                                destination='reportStatService',
                                timeToLive=0,
                                timestamp=0)
# 第一个是查询参数,第二个是页数,第三个是控制每页显示的数量(默认每页只显示15条)
msg.body = [HqPara(), '1', '50']
msg.headers['DSEndpoint'] = None
msg.headers['DSId'] = str(uuid.uuid1()).upper()
# 按AMF协议编码数据
req = remoting.Request('null', body=(msg,))
env = remoting.Envelope(amfVersion=pyamf.AMF3)
env.bodies = [('/1', req)]
data = bytes(remoting.encode(env).read())

# 提交请求
url = 'http://jgsb.agri.gov.cn/messagebroker/amf'
headers = {'Content-Type': 'application/x-amf', 'Host': 'jgsb.agri.cn'}
# req = urllib.request.urlopen(url, data, headers={'Content-Type': 'application/x-amf', 'Host': 'jgsb.agri.cn'})

req = Request(url, data=data, headers=headers)

# 解析返回数据
req_data = urlopen(req).read()

# 解码AMF协议返回的数据
resp = remoting.decode(req_data)
for i, record in enumerate(resp.bodies[0][1].body.body[0]):
    print i, record['farmProduceName'], \
          record['marketName'], \
          record['maxPrice'], \
          record['minPrice'], \
          record['averagePrice'], \
          record['producAdd'] and record['producAdd'], \
          record['reportMan']

 Screenshot of running result:

åºäºFlex AMFææ¯ç½ç «æåç¨åºè¿è¡ç» æ

Special note: This article is for technical exchanges, please do not use the involved technology for illegal purposes, otherwise you will bear all the consequences. If you feel that we have violated your legal rights, please contact us to deal with it. 

Guess you like

Origin blog.csdn.net/someby/article/details/108651516