Hands-on JS Reverse Crawler Introduction (1)

The website crawled in this article is as follows (you can find the decryption tool to decode)

aHR0cHM6Ly9uZXdyYW5rLmNuLw==

The crawled content is the news information of the information section of the website

Click the mouse to turn the page, view the request package in the developer tool, it is easy to see the request address and parameters,

The parameters of the post request are as follows:

 The changed parameters are nonce and xyz. Our goal is to find out the encryption principle of these two parameters, which is the encryption function. Then use python code to write a function to generate the encryption parameter, or extract the JS code, then call it in python to generate the encryption parameter, and then pass in the post request to realize the crawler.

Reverse process:
1. Search parameter nonce, there will be the following results:

 2. Select any result, select the beautification code, and find the parameter position in the JS code.

 3. At this time, it is easy to see the position of the two parameters and set a breakpoint. Note that the nonce parameter is i, and the previous line shows that i is the j function

4. Set a breakpoint at line 658 and click Debug. Display the link of the j function, click on it, and you will find the proxy purchase of the j function, which is the principle of generating the nonce parameter.

As long as you have some experience in java or js or C language, it is not difficult to see that this code is to randomly generate a string composed of 9 numbers or letters.

At this point, we can create a new JS file (such as newrank.js) in our own compiler (I use Vscode, or Notepad), and then directly copy the above JS code, which solves the first parameter. (You can also use python to directly write a random function to generate)

 5. Continue debugging to find the encryption principle of the second parameter xyz.
xyz is the d function. Through debugging, the d function is the b function, as shown in the figure below.

 We click to enter the b function, the result is as follows  

 In fact, the function of this function is to encrypt the parameters with md5, that's all!!! If you don't understand, you can also directly extract the JS code, after all, it is a function. When digging, be sure to dig out everything, otherwise it will be an invalid function. You will know this big pit after testing it yourself, especially for novices. We copy the JS code of this function, add it to the newly created JS file above, and save it.

6. Also find the parameters of the encryption function that generates the parameter xyz. Repeat the debugging just now, and you can see that the parameter of the d function that generates xyz is h, and the previous line shows that the parameter h is composed of '/xdnphb/index/getMedia?AppKey=joker&keyword=&pageNumber=page number&pageSize=10' and the string '&nonce =' concatenated with the nonce parameter.

So far, we have found the parameter generation function and the source of its parameters.
Next, we only need to call the corresponding function in our newly created js file, generate encrypted parameters, and then pass in the post request to complete the crawler.

Pay attention to the method of calling the js code, see the complete crawler code for details.

import requests
import pprint,time
import execjs
import hashlib

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64;` rv:47.0) Gecko/20100101 Firefox/47.0",
    "referer": "https://www.newrank.cn/public/news.html?",
  }

with open(r'D:\pythoncode\JS\newrank.js',encoding='utf-8') as f: 
        #上面这个newrank.js文件就是我们新建的js文件,里面放入了从网站JS源码抠出的两个函数。
        js=f.read()
        ctx=execjs.compile(js) 
for page in range(1,21):    
        nonce=ctx.call('j') #调用JS代码中的函数生成第一个加密参数nonce
        xyz=f'/xdnphb/index/getMedia?AppKey=joker&keyword=&pageNumber={page}&pageSize=10&nonce=' + nonce
        xyz=ctx.call('b',xyz) #调用JS代码中的函数生成第二个加密参数xyz
        #xyz参数也可以直接用python的MD5加密实现
        # xyz=hashlib.md5(xyz.encode(encoding='utf-8')).hexdigest()        
        data = {
        'keyword': '',
        'pageNumber': str(page),
        'pageSize': '10',
        'nonce': nonce,
        'xyz': xyz
        }
        # print(nonce, xyz)
        response = requests.post('https://www.newrank.cn/xdnphb/index/getMedia', headers=headers, data=data)
        print(response.status_code)
        # print(response.text)
        response_data=response.json()['value']
        # pprint.pprint(response_data)
        for item in response_data:
                print('资讯标题:',item['title'],'发布时间:',item['public_time'])

Seeing the following crawling results are as follows, I feel quite refreshed. Welcome everyone to leave a message and exchange! Those who need the JS code can also leave a message.

Guess you like

Origin blog.csdn.net/weixin_45387160/article/details/122333002