python crawler-js encryption setCookie

foreword

When crawling some websites, the returned data obtained is not the expected html, but a large string of unformatted js, for example:

var arg1='38B18065C640DD60B8A3AD8BFA4DE2D694EDD37C';
var _0x4818=['\x63\x73\..

Specifically as shown in the figure:
insert image description here

decryption process

Formatting JS

In fact, the characters in js are encrypted by the hexadecimal of \0x50, just paste it to https://tool.lu/js to decrypt it. In
insert image description here
this picture, you can know some principles when requesting the website: When requesting the website to load html, it will detect whether the cookie contains acw_sc__v2attributes. If not, js will call the reload(x) method to execute setCookie() to assign the calculated x to and acw_sc__v2generate a cookie. The webpage will be reloaded and reloaded. If the cookie If not acw_sc__v2, refresh the web page again, repeat the above process until the cookie exists, and then request and correctly return html with this cookie.

find x

Next, we need to find out where reload() is called and pass in x. The search results are as follows

setTimeout('reload(arg2)', 0x2);

find arg2

We already know that x is arg2, and then we can find out how arg2 is generated

var _0x23a392 = arg1_0x55f3('0x19', 'Pg54'); 
arg2 = _0x23a392_0x55f3('0x1b', 'z5O&'); 

Then find the assignment statement of _0x5e8b26:

var _0x5e8b26 = _0x55f3('0x3', 'jS1Y'); 

Seeing the above code, I feel confused. What are these _0x23a392? next step four

Decrypt obfuscation function

The _0x55f3() function is an obfuscation function, and _0x23a392 is an obfuscation variable, which is the same as the function and variable names we usually use. You must have found the method _0x55f3() in the formatted js, but if you want to do it line by line Reading code is not realistic. So I explored a simple method, which is to print the return value of this function on the console of the developer tool

First open the chrome console, turn on the breakpoint debugging function, visit this website, the program will be at the debugger breakpoint, and will not continue to execute
insert image description here

Print the variable value as shown in the figure:
insert image description here

Then, the code in the third step converts to this format

var arg1 = '37EAB765A2E11E6F44CF4E4B95B3EADA60ED2AEB';
arg2 = arg1['unsbox']()['hexXor']('3000176000856006061501533003690027800375')

It means: After arg1 calls unsbox() to obtain the return value, the parameter "300176..." is passed in and hexXor() is called, and then the value of arg2, namely x, is also the value of acw_sc__v2 in the cookie.

unsbox and hexXor

Then the final task is to find the logic of these two methods. We know that arg1 is a string, and we can see the code as shown in js

String['prototype '][_0x55f3('0x14', 'Z*DM')] = function()

The console prints _0x55f3('0x14', 'Z*DM') and the result is 'unbox', the same method to find out the
insert image description here
final code of hexXor is as follows:

// 这种写法等同于String.prototyoe.hexXor
String['prototype']['hexXor'] = function(_0x4e08d8) {
    
    
  var _0x5a5d3b = '';
  for (var _0xe89588 = 0x0; _0xe89588 < this[_0x55f3('0x8', ')hRc')] && _0xe89588 < _0x4e08d8[_0x55f3('0xa', 'jE&^')]; _0xe89588 += 0x2) {
    
    
    var _0x401af1 = parseInt(this[_0x55f3('0xb', 'V2KE')](_0xe89588, _0xe89588 + 0x2), 0x10);
    var _0x105f59 = parseInt(_0x4e08d8[_0x55f3('0xd', 'XMW^')](_0xe89588, _0xe89588 + 0x2), 0x10);
    var _0x189e2c = (_0x401af1 ^ _0x105f59)[_0x55f3('0xf', 'W1FE')](0x10);
    if (_0x189e2c[_0x55f3('0x11', 'MGrv')] == 0x1) {
    
    
      _0x189e2c = '0' + _0x189e2c;
    }
    _0x5a5d3b += _0x189e2c;
  }
  return _0x5a5d3b;
};
String['prototype']['unsbox'] = function() {
    
    
  var _0x4b082b = [0xf, 0x23, 0x1d, 0x18, 0x21, 0x10, 0x1, 0x26, 0xa, 0x9, 0x13, 0x1f, 0x28, 0x1b, 0x16, 0x17, 0x19, 0xd, 0x6, 0xb, 0x27, 0x12, 0x14, 0x8, 0xe, 0x15, 0x20, 0x1a, 0x2, 0x1e, 0x7, 0x4, 0x11, 0x5, 0x3, 0x1c, 0x22, 0x25, 0xc, 0x24];
  var _0x4da0dc = [];
  var _0x12605e = '';
  for (var _0x20a7bf = 0x0; _0x20a7bf < this['length']; _0x20a7bf++) {
    
    
    var _0x385ee3 = this[_0x20a7bf];
    for (var _0x217721 = 0x0; _0x217721 < _0x4b082b[_0x55f3('0x16', 'aH*N')]; _0x217721++) {
    
    
      if (_0x4b082b[_0x217721] == _0x20a7bf + 0x1) {
    
    
        _0x4da0dc[_0x217721] = _0x385ee3;
      }
    }
  }
  _0x12605e = _0x4da0dc['join']('');
  return _0x12605e;
};

Python implementation

Replace the variable name in it, debug by yourself, convert hexadecimal to decimal (such as 0x0 = 0), and implement the two methods of unsbox and hexXor with python, and use regularity to get arg1 every time you request this website, and then Call these two methods to generate acw_sc_v2 and put it in the cookie to request again.

Part of the code is as follows:


arg1 = re.search('arg1=\'[0-9A-Z]+\'', response.text).group().replace('arg1=', '').replace('\'', '')
cookie = hexXor(unsbox(arg1)).replace('0x', '')
cookies = {
    
    'acw_sc__v2': cookie}
response = requests.get(url, headers=header, cookies=cookies)

Knowledge points - Prototype

1. 所有 String 的实例都继承自 String.prototype, 任何String.prototype上的改变都会影响到所有的String实例
2. 例如String.prototype.length:返回了字符串的长度
3. prototype是String类型的所有属性和方法的集合,通过String.prototype.xxx=function(){
    
    }来添加方法,通过str.xxx()来调用方法
4. String.prototyoe.hexXor等同于String['prototype']['hexXor'],str['hexXor']等同于str.hexXor

Other methods

The above are the steps to decrypt js. If you want to get some data of the website, it is not worth spending so much time, then you can also open the console after the browser visits the website, copy its acw_sc_v2 attribute, and put it in the cookie to make the request . The validity period is 30 minutes.
insert image description here
You can use curl test in Linux/Mac environment:

curl http://xxxx -H "Cookie: acw_sc__v2=粘贴的值" -H "User-Agent: Chrome/54.0 (Windows NT 10.0)

epilogue

When decrypting this js at the beginning, I gave up for a day. First, there are relatively few online materials, and second, it is only temporary crawling data, which is irrelevant. Later, my thirst for knowledge led me to find this decryption method by groping a few days later. In the process of developing crawlers, I also encountered font encryption, eval js nested encryption, etc., and I will post more in my spare time later. Share solutions to problems, learn and grow together.
Official account article address

Guess you like

Origin blog.csdn.net/CatchLight/article/details/108473902#comments_26608874