Brief introduction
After learn to crawl static pages of data, the following learning courseCrawls dynamic pagesThe data.
-
What are dynamic pages?
Sometimes we crawl a page and then requests the time, the results may be different than seen in the browser: you can see the page data normally displayed in the browser, but the results obtained using the requests did not . This is because all requests acquisition of raw HTML documents , and browser pages is the result JavaScript after processing the data generated , there are many sources of these data, may be loaded via Ajax, it may be included in the HTML document, it may be after a specific algorithm and JavaScript generated.
Benpian documenting crawling through which confuse JavaScript encrypted pages to render
Stories Site proper way translation: http://fanyi.youdao.com/
problem found
- The method generally constructed reptiles
-
Open chrome developer tools, refresh the page, the page analysis
Enter 'Meng new' click translation, XHR right column, type the data after a request xhr, click View appears.
POST request address is: http: //fanyi.youdao.com/translate_o smartresult = dict & smartresult = rule?
Look again at the request parameter with the past is what?OK! ! ! Where i This parameter is what we need to translate content. Then directly copy request header requests and parameters required to construct a request.
-
Construction of requests Request
headers = { "Accept": "application/json, text/javascript, */*; q=0.01", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.9", "Cookie": "[email protected]; JSESSIONID=" "aaaBxpJhsD9bZgfYbsJax; OUTFOX_SEARCH_USER_ID_NCOO=2138649720.2208343;" " ___rl__test__cookies=1581141334922", "Referer": "http://fanyi.youdao.com/", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } url = "http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule" form_data = { "i": '萌新', "from": "AUTO", "to": "AUTO", "smartresult": "dict", "client": "fanyideskweb", "salt": '15811630953552', "sign": '95f6b7ff43ba04c257097dabd115645e', "ts": '1581163095355', "bv": 'd6c3cd962e29b66abe48fcb8f4dd7f7d', "doctype": "json", "version": "2.1", "keyfrom": "fanyi.web", "action": "FY_BY_CLICKBUTTION" } response = requests.post(url=url, headers=headers, data=form_data) print(response.text)
Run Results:
Returned an error code!
analyse problem
-
When we try different translations, we found that the parameters salt, sign, ts, bv will change. Then the parameters in respect of thinking is how to change it?
-
Js files found
此时发现这个数据请求它指向 fanyi.min.js:1 这个js文件,我们点击这个js文件
出现这个界面,一堆的JavaScript代码,看也看不懂,我们点击左下角的{}让它展开。
-
使用chrome的打断点功能
在方框处添加之前数据请求的POST地址:http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule
然后,重新点击翻译按钮,断点成功屏幕会变为一半为灰色。 发现,行号重新定位在了7570行,展开右边 Call Stack,点击其中的
t.translate发现有我们需要的参数。
-
继续深入
将鼠标放入generateSaltSign上面,出现了如图所示的链接,点击跳转至8363行。
-
此时我们就找到了需要的参数salt、sign、ts、bv的源头。
将鼠标点击8363行,出现蓝色的标记即为打上一个断点。然后点击上方图中所示按钮。再点击一下翻译按钮。
通过上面的操作,我们找出了4个参数的源头,正是通过上图中的JavaScript代码来计算出来的。下面就用python来构造这4个参数
解决问题
- 参数:ts(JavaScript代码中是这样的:r = “” + (new Date).getTime())
def get_ts():
ts = str(time.time() * 1000)
return ts
- 参数:bv(JavaScript代码中是这样的:t = n.md5(navigator.appVersion))
def get_bv():
appVersion = "5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
m = hashlib.md5()
m.update(appVersion.encode("utf-8"))
bv = m.hexdigest()
return bv
- Parameters: salt (JavaScript code is such that: i = r + parseInt (10 * Math.random (), 10))
def get_salt():
salt = str(time.time() * 1000) + str(random.random() * 10)
return salt
- Parameters: sign (JavaScript code is such that: n.md5 ( "fanyideskweb" + e + i + "n% A-rKaT5fb [Gy; N5 @ Tj?"))
def get_sign(myinput):
a = "fanyideskweb"
b = myinput
c = get_salt()
d = "n%A-rKaT5fb[Gy?;N5@Tj"
str_data = a + b + c + d
m = hashlib.md5()
m.update(str_data.encode("utf-8"))
sign = m.hexdigest()
return sign
- Complete code:
import requests
import time
import json
import hashlib
import random
def get_ts():
ts = str(time.time() * 1000)
return ts
def get_bv():
appVersion = "5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
m = hashlib.md5()
m.update(appVersion.encode("utf-8"))
bv = m.hexdigest()
return bv
def get_salt():
salt = str(time.time() * 1000) + str(random.random() * 10)
return salt
def get_sign():
a = "fanyideskweb"
b = "萌新"
c = get_salt()
d = "n%A-rKaT5fb[Gy?;N5@Tj"
str_data = a + b + c + d
m = hashlib.md5()
m.update(str_data.encode("utf-8"))
sign = m.hexdigest()
return sign
def get_request():
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cookie": "[email protected]; JSESSIONID="
"aaaBxpJhsD9bZgfYbsJax; OUTFOX_SEARCH_USER_ID_NCOO=2138649720.2208343;"
" ___rl__test__cookies=1581141334922",
"Referer": "http://fanyi.youdao.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
}
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
form_data = {
"i": '萌新',
"from": "AUTO",
"to": "AUTO",
"smartresult": "dict",
"client": "fanyideskweb",
"salt": get_salt(),
"sign": get_sign(),
"ts": get_ts(),
"bv": get_bv(),
"doctype": "json",
"version": "2.1",
"keyfrom": "fanyi.web",
"action": "FY_BY_CLICKBUTTION"
}
response = requests.post(url=url, headers=headers, data=form_data)
print("翻译结果是:" + str(json.loads(response.text)['translateResult'][0][0]['tgt']))
if __name__ == '__main__':
get_request()
Spread
Now, we can find the laws of these parameters, we can be packaged into an executable .exe files make translation software, small partners to share learning exchanges around use! ! !
Reference: the Python will .py simple method .exe files into executable file