Python3 crawler China land market network cracking js encryption detailed explanation

Organized in late October 2020, dedicated to you who are unwilling to be ordinary

For more crawler knowledge, please check: https://blog.csdn.net/weixin_45316122/article/details/109840745

Python commonly used running js library
pyv8 installation record https://my.oschina.net/u/854530/blog/853808

Detailed Node.js installation and configuration https://zhuanlan.zhihu.com/p/77594251

The installation of Pyexecjs and the summary of several ways Python calls JS https://zhuanlan.zhihu.com/p/165585592

Trick: PhantomJS is not compatible with Selenuim, but Selenuim is not the best choice now. Pyppeteer is better than Selenuim. It is inherently asynchronous and it is better to bypass browser detection.

 

First download pyexecjs:

PhantomJS installation 

pip install  PyExecJS 

This collection link: http://www.landchina.com/default.aspx?tabid=226

 

table of Contents

One: Analyze JS encryption

Two: the process of solving JS

Three: overall code


One: Analyze JS encryption

Through fildder packet capture analysis, you can determine the next request link http://www.landchina.com/default.aspx?tabid=226&security_verify_data=313533362c383634 (note that this link is generated by js)

 

Therefore, the approximate process is to first request http://www.landchina.com/default.aspx?tabid=226, obtain js and then parse to obtain cookies and verify_url information, and then use requests.session() to access the link verify_url in js (my It is http://www.landchina.com/default.aspx?tabid=226&security_verify_data=313533362c383634), and finally request http://www.landchina.com/default.aspx?tabid=226 with cookie information,

 

Two: the process of solving JS

Direct request:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
url = 'http://www.landchina.com/default.aspx?tabid=226'
headers = {
           "User-Agent": 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
           }
 
html = requests.get(url, headers=headers, verify=False)
print("txte:",html.text)
 
with open('数据2.html', 'w',encoding='utf8') as f:
    f.write(str(html.text) + '\n')
print("数据.html保存完毕!")

Open'data 2.html' to get js data:

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <meta http-equiv="Cache-Control" content="no-store, no-cache, must-revalidate, post-check=0, pre-check=0"/><meta http-equiv="Connection" content="Close"/>
    <script type="text/javascript">
        function stringToHex(str){
            var val="";for(var i = 0; i < str.length; i++){if(val == "")val = str.charCodeAt(i).toString(16);else val += str.charCodeAt(i).toString(16);}return val;
        }
        function YunSuoAutoJump(){
            var width =screen.width; var height=screen.height; var screendate = width + "," + height;var curlocation = window.location.href;if(-1 == curlocation.indexOf("security_verify_")){ 
                document.cookie="srcurl=" + stringToHex(window.location.href) + ";path=/;";}self.location = "/default.aspx?tabid=226&security_verify_data=" + stringToHex(screendate);
        }
    </script><script>setTimeout("YunSuoAutoJump()", 50);
</script
</head><!--2019-06-20 14:08:07--></html>

You can see document.cookie=******************, this is the code for js to set cookies

Use PyExecJS to execute the stringToHex() function. The code is as follows:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import execjs
jstext = '''
function stringToHex(str){var val="";for(var i = 0; i < str.length; i++){if(val == "")
val = str.charCodeAt(i).toString(16);else val += str.charCodeAt(i).toString(16);}return val;}
'''
ctx = execjs.compile(jstext)# 编译JS代码
a = ctx.call("stringToHex","9999")
print(a)

Or use nodejs

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import execjs
import execjs.runtime_names
jstext = '''
function stringToHex(str){var val="";for(var i = 0; i < str.length; i++){if(val == "")
val = str.charCodeAt(i).toString(16);else val += str.charCodeAt(i).toString(16);}return val;}
'''
node = execjs.get(execjs.runtime_names.Node)
ctx = node.compile(jstext)# 编译JS代码
a = ctx.call("stringToHex","9999")
print(a)

Execution result: 39939939

Continue to execute the YunSuoAutoJump() function:

node = execjs.get(execjs.runtime_names.Node)
ctx = node.compile(jstext)
a = ctx.call("YunSuoAutoJump",)

Run directly and you will find an error: execjs._exceptions.ProgramError: ReferenceError: screen is not defined

Screen is not found, screen is the object of window, and it is found that PhantomJS is not configured (reference: js reverse decryption web crawler + phantomjs installation tutorial + download PhantomJS )

Configure the PhantomJS compilation environment, and execute the code after modifying its js:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import execjs,os
import execjs.runtime_names
# os.environ["EXECJS_RUNTIME"] = "Node"
print("当前环境:",execjs.get().name) # this value is depends on your environment.
 
jstext = '''
function stringToHex(str){var val="";for(var i = 0; i < str.length; i++){if(val == "")
    val = str.charCodeAt(i).toString(16);else val += str.charCodeAt(i).toString(16);}return val;}
 
function YunSuoAutoJump(){ var width =screen.width; var height=screen.height; var screendate = width + "," + height;
    var curlocation = window.location.href;if(-1 == curlocation.indexOf("security_verify_")){ 
    document.cookie="srcurl=" + stringToHex(window.location.href) + ";path=/;";
    fcookie="srcurl=" + stringToHex(window.location.href) + ";path=/;";    //加入一个变量记录cookies
    }self.location = "/default.aspx?tabid=226&security_verify_data=" + stringToHex(screendate);
    return fcookie                                                         //返回cookies
;}'''
os.environ["EXECJS_RUNTIME"] = "PhantomJS"
print("修改环境:",execjs.get().name)
ctx = execjs.compile(jstext)
print(ctx.call("YunSuoAutoJump"))

Note: I have added two codes in the js here:

fcookie="srcurl=" + stringToHex(window.location.href) + ";path=/;";    //加入一个变量记录cookies
 
return fcookie //返回cookies

 

After modifying the code, use requests to initiate a request, the code is as follows:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests,re,execjs,os
url = 'http://www.landchina.com/default.aspx?tabid=226'
headers = {
           "User-Agent": 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
           }
html = requests.get(url, headers=headers, verify=False)
jstext = re.findall(r'<script type="text/javascript">(.+?)</script>',html.text)[0]
print("jstext:",jstext)
os.environ["EXECJS_RUNTIME"] = "PhantomJS"
jstext = jstext.replace('";path=/;";','";path=/;";fcookies="srcurl=" + stringToHex(window.location.href) + ";path=/;";')
jstext = jstext.replace('stringToHex(screendate);','stringToHex(screendate);return fcookies;')
print("jstext:",jstext)
ctx = execjs.compile(jstext)
print(ctx.call("YunSuoAutoJump"))

Output result:

jstext: function stringToHex(str){var val="";for(var i = 0; i < str.length; i++){if(val == "")val = str.charCodeAt(i).toString(16);else val += str.charCodeAt(i).toString(16);}return val;}function YunSuoAutoJump(){ var width =screen.width; var height=screen.height; var screendate = width + "," + height;var curlocation = window.location.href;if(-1 == curlocation.indexOf("security_verify_")){ document.cookie="srcurl=" + stringToHex(window.location.href) + ";path=/;";}self.location = "/default.aspx?tabid=226&security_verify_data=" + stringToHex(screendate);}
jstext: function stringToHex(str){var val="";for(var i = 0; i < str.length; i++){if(val == "")val = str.charCodeAt(i).toString(16);else val += str.charCodeAt(i).toString(16);}return val;}function YunSuoAutoJump(){ var width =screen.width; var height=screen.height; var screendate = width + "," + height;var curlocation = window.location.href;if(-1 == curlocation.indexOf("security_verify_")){ document.cookie="srcurl=" + stringToHex(window.location.href) + ";path=/;";fcookies="srcurl=" + stringToHex(window.location.href) + ";path=/;";}self.location = "/default.aspx?tabid=226&security_verify_data=" + stringToHex(screendate);return fcookies;}
srcurl=66696c653a2f2f2f433a2f55736572732f41444d494e497e312f417070446174612f4c6f63616c2f54656d702f657865636a7375753575783872652e6a73;path=/;

Code snippet found in js: self.location="*******************8" (If you don’t understand, you can see: the usage of location.href in js )

self.location = "/default.aspx?tabid=226&security_verify_data=" + stringToHex(screendate);

Three: overall code

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests,re,execjs,os
session = requests.session()
url = 'http://www.landchina.com/default.aspx?tabid=226'
headers = {
           "User-Agent": 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
           }
html = session.get(url, headers=headers, verify=False)
jstext = re.findall(r'<script type="text/javascript">(.+?)</script>',html.text)[0]
print("正则获取js:",jstext)
os.environ["EXECJS_RUNTIME"] = "PhantomJS"          #设置execjs使用PhantomJS编译
jstext1 = jstext.replace('";path=/;";','";path=/;";fcookies="srcurl=" + stringToHex(window.location.href) + ";path=/;";')
jstext1 = jstext1.replace('stringToHex(screendate);','stringToHex(screendate);return fcookies;')
print("修改js:",jstext1)
ctx = execjs.compile(jstext1)           #编译js
cookie1 = ctx.call("YunSuoAutoJump")    #执行js中 YunSuoAutoJump()函数
print("js解析获得cookie1:",cookie1)
 
jstext2 = jstext.replace('stringToHex(screendate);','stringToHex(screendate);verify_url= "/default.aspx?tabid=226&security_verify_data=" + stringToHex(screendate);return verify_url;')
print("再次修改js获取以verify_url:",jstext2)
ctx = execjs.compile(jstext2)
verify_url = "http://www.landchina.com"+ctx.call("YunSuoAutoJump")
print("verify_url:",verify_url)
 
#分割cookies
cookie1= {item.split('=')[0]:item.split('=')[1] for item in cookie1.split('; ')}
 
html = session.get(verify_url)
# html = requests.get(next_url, headers=headers, verify=False,cookies=cookie1)
print("verify_url请求-html-2:",html.text)
html = session.get(url, headers=headers, verify=False)
print("最后获得界面-长度:",len(html.text))
print("最后获得界面-html:",html.text)

operation result:

After successfully obtaining the interface, there are operations such as data cleaning, url extraction, and storage.

 

Reference article: Crawler-Cracking the website to generate cookies through js encryption (1)

This article is the first record of __Songsong. It is of great significance. It should be the first to sort out the memoirs of the reptiles

 

 

Guess you like

Origin blog.csdn.net/weixin_45316122/article/details/109841844