[Python web crawler documenting D: 01] - JS confused encryption

Brief introduction

After learn to crawl static pages of data, the following learning courseCrawls dynamic pagesThe data.

  • What are dynamic pages?

    Sometimes we crawl a page and then requests the time, the results may be different than seen in the browser: you can see the page data normally displayed in the browser, but the results obtained using the requests did not . This is because all requests acquisition of raw HTML documents , and browser pages is the result JavaScript after processing the data generated , there are many sources of these data, may be loaded via Ajax, it may be included in the HTML document, it may be after a specific algorithm and JavaScript generated.

Benpian documenting crawling through which confuse JavaScript encrypted pages to render

Stories Site proper way translation: http://fanyi.youdao.com/

problem found

  • The method generally constructed reptiles
  1. Open chrome developer tools, refresh the page, the page analysis
    Here Insert Picture Description

    Enter 'Meng new' click translation, XHR right column, type the data after a request xhr, click View appears.
    POST request address is: http: //fanyi.youdao.com/translate_o smartresult = dict & smartresult = rule?
    Look again at the request parameter with the past is what?

    Here Insert Picture Description

    OK! ! ! Where i This parameter is what we need to translate content. Then directly copy request header requests and parameters required to construct a request.

  2. Construction of requests Request

    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cookie": "[email protected]; JSESSIONID="
                  "aaaBxpJhsD9bZgfYbsJax; OUTFOX_SEARCH_USER_ID_NCOO=2138649720.2208343;"
                  " ___rl__test__cookies=1581141334922",
        "Referer": "http://fanyi.youdao.com/",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
    }
    url = "http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"
    form_data = {
        "i": '萌新',
        "from": "AUTO",
        "to": "AUTO",
        "smartresult": "dict",
        "client": "fanyideskweb",
        "salt": '15811630953552',
        "sign": '95f6b7ff43ba04c257097dabd115645e',
        "ts": '1581163095355',
        "bv": 'd6c3cd962e29b66abe48fcb8f4dd7f7d',
        "doctype": "json",
        "version": "2.1",
        "keyfrom": "fanyi.web",
        "action": "FY_BY_CLICKBUTTION"
    }
    
    response = requests.post(url=url, headers=headers, data=form_data)
    print(response.text)
    

    Run Results:

    Here Insert Picture Description

    Returned an error code!

analyse problem

  • When we try different translations, we found that the parameters salt, sign, ts, bv will change. Then the parameters in respect of thinking is how to change it?

    Here Insert Picture Description
    Here Insert Picture Description

  • Js files found

    此时发现这个数据请求它指向 fanyi.min.js:1 这个js文件,我们点击这个js文件

    Here Insert Picture Description

    出现这个界面,一堆的JavaScript代码,看也看不懂,我们点击左下角的{}让它展开。

    Here Insert Picture Description

  • 使用chrome的打断点功能

    方框处添加之前数据请求的POST地址:http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule
    然后,重新点击翻译按钮,断点成功屏幕会变为一半为灰色。 发现,行号重新定位在了7570行,展开右边 Call Stack,点击其中的
    t.translate发现有我们需要的参数。

    Here Insert Picture Description
    Here Insert Picture Description
    Here Insert Picture Description

  • 继续深入

    将鼠标放入generateSaltSign上面,出现了如图所示的链接,点击跳转至8363行。

    Here Insert Picture Description

  • 此时我们就找到了需要的参数salt、sign、ts、bv的源头。

    鼠标点击8363行,出现蓝色的标记即为打上一个断点。然后点击上方图中所示按钮再点击一下翻译按钮

    Here Insert Picture Description
    Here Insert Picture Description

    通过上面的操作,我们找出了4个参数的源头,正是通过上图中的JavaScript代码来计算出来的。下面就用python来构造这4个参数

解决问题

  • 参数:ts(JavaScript代码中是这样的:r = “” + (new Date).getTime())
def get_ts():
    ts = str(time.time() * 1000)
    return ts
  • 参数:bv(JavaScript代码中是这样的:t = n.md5(navigator.appVersion))
def get_bv():
    appVersion = "5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
    m = hashlib.md5()
    m.update(appVersion.encode("utf-8"))
    bv = m.hexdigest()
    return bv
  • Parameters: salt (JavaScript code is such that: i = r + parseInt (10 * Math.random (), 10))
def get_salt():
    salt = str(time.time() * 1000) + str(random.random() * 10)
    return salt
  • Parameters: sign (JavaScript code is such that: n.md5 ( "fanyideskweb" + e + i + "n% A-rKaT5fb [Gy; N5 @ Tj?"))
def get_sign(myinput):
    a = "fanyideskweb"
    b = myinput
    c = get_salt()
    d = "n%A-rKaT5fb[Gy?;N5@Tj"
    str_data = a + b + c + d

    m = hashlib.md5()
    m.update(str_data.encode("utf-8"))
    sign = m.hexdigest()
    return sign
  • Complete code:
import requests
import time
import json
import hashlib
import random


def get_ts():
    ts = str(time.time() * 1000)
    return ts


def get_bv():
    appVersion = "5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
    m = hashlib.md5()
    m.update(appVersion.encode("utf-8"))
    bv = m.hexdigest()
    return bv


def get_salt():
    salt = str(time.time() * 1000) + str(random.random() * 10)
    return salt


def get_sign():
    a = "fanyideskweb"
    b = "萌新"
    c = get_salt()
    d = "n%A-rKaT5fb[Gy?;N5@Tj"
    str_data = a + b + c + d

    m = hashlib.md5()
    m.update(str_data.encode("utf-8"))
    sign = m.hexdigest()
    return sign


def get_request():
    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cookie": "[email protected]; JSESSIONID="
                  "aaaBxpJhsD9bZgfYbsJax; OUTFOX_SEARCH_USER_ID_NCOO=2138649720.2208343;"
                  " ___rl__test__cookies=1581141334922",
        "Referer": "http://fanyi.youdao.com/",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
    }
    url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
    form_data = {
        "i": '萌新',
        "from": "AUTO",
        "to": "AUTO",
        "smartresult": "dict",
        "client": "fanyideskweb",
        "salt": get_salt(),
        "sign": get_sign(),
        "ts": get_ts(),
        "bv": get_bv(),
        "doctype": "json",
        "version": "2.1",
        "keyfrom": "fanyi.web",
        "action": "FY_BY_CLICKBUTTION"
    }

    response = requests.post(url=url, headers=headers, data=form_data)
    print("翻译结果是:" + str(json.loads(response.text)['translateResult'][0][0]['tgt']))


if __name__ == '__main__':
    get_request()

Spread

Now, we can find the laws of these parameters, we can be packaged into an executable .exe files make translation software, small partners to share learning exchanges around use! ! !
Reference: the Python will .py simple method .exe files into executable file

Released five original articles · won praise 1 · views 180

Guess you like

Origin blog.csdn.net/Dchanong_/article/details/104227315