Notes: JavaScript Reverse Crawler

Browser Debugging Common Skills


Panel introduction

Elements/element panel:

  • It is used to view or modify the attributes of the HTML node of the current webpage, CSS attributes, listen to events, etc.
  • Both HTML and CSS can be modified on the fly and displayed on the fly.

Console/console panel:

  • Used to view debug logs or exception information .
  • You can also enter JavaScript code in the console for easy debugging.

Sources/source code panel:

  • Used to view the HTML file source code, JavaScript source code, and CSS source code of the page .
  • You can also debug JavaScript code in this panel, such as adding and modifying JavaScript breakpoints, observing JavaScript variable changes, etc.

Network/Network panel:

  • It is used to view various network requests during the loading process of the page , including requests, responses, etc.

Performance/performance panel:

  • It is used to record and analyze all the activities of the page at runtime , such as CPU usage and performance analysis results of rendering the page.

Memory/memory panel:

  • It is used to record and analyze all resource information loaded by the website , such as viewing changes in memory usage, viewing memory allocation of JavaScript objects and HTML nodes.

Application/application panel:

  • It is used to record all resource information loaded by the website , such as storage, cache, fonts, pictures, etc., and some resources can also be modified and deleted.

Lighthouse/Audit Panel:

  • Used to analyze web applications and web pages , collect modern performance metrics and provide insights into developer best practices.

View node events

Test site: https://spa2.scrape.center/

(Right-click to check the page turning button) Open the element panel, and you can view the HTML source code of the current page and the page source code corresponding to any content.

  1. Click the styles tab on the right to see the CSS style of the corresponding node.
  2. Click the Computed tab on the right, and you can see the box model of the current node and the CSS style finally calculated by the current node.
  3. Click on the Event Listeners tab on the right, and you can see the events currently bound to each node (all natively supported by JavaScript):
    • change: event triggered when the HTML element changes
    • click: an event that is triggered when the user clicks on an HTML element
    • mouseover: an event that fires when the user moves the mouse over an HTML element
    • mouseout: an event that fires when the user moves the mouse away from an HTML element
    • keydown: an event that is triggered when the user presses a keyboard key
    • load: an event that fires when the browser finishes loading the page

Usually a click event is bound to the button (the processing logic is defined by JavaScript) - click the button, and the corresponding JavaScript code will be executed.

Example: Select the node that switches to the second page, and you will see its bound events in the event listener tab on the right.

chunk-vendors.77daf991.js:7 indicates the file and line number of the corresponding event, click to jump to the Sources panel, corresponding to the corresponding location of the file (the code at this time is often compressed, and the readability is poor. However, the Sources panel provides the function of code beautification, which can be used by clicking "{}" in the lower left corner ).


breakpoint debugging

  1. Break the breakpoint at the desired position (click the line number, a blue arrow will appear), and when the corresponding event is triggered, the browser will automatically stop at the breakpoint and wait for debugging ;
  2. Then select single-step debugging , you can observe the call stack and variable values ​​in the panel, and better track the execution logic of the corresponding location.

example:

  1. Make a breakpoint (an arrow appears at the line number, and the Breakpoints breakpoint tab on the right updates the breakpoint list)
  2. It is known that this breakpoint is used to handle the click event of the page turning button. Click the button on page 2 in the page to trigger the breakpoint mechanism
  3. At this time, a "Paused in debugger" prompt will be displayed on the page - the browser will no longer execute when it reaches the breakpoint
  4. The code stops at line 4446, and the click event corresponding to the callback parameter e is PointerEvent (in the Scope/scope panel on the right, you can see the value of each variable, for example, there are local variables of the current method under the Local/local domain , you can also see the various properties of the variable)
  5. Here you can pay attention to a method o. Under the scope, in addition to the local domain, there is also a Closure closure domain (Jr), where you can see the definition of o and the parameters it receives
  6. You can add the o.apply method in the Watch/monitoring panel (click FunctionLocation after expanding to locate the location of its source code)
  7. You can switch to the Console/console panel, input any JavaScript code, test execution, and output the corresponding results (for example: to see what the first element of the variable arguments is, directly type arguments[0]) as long as the current context can access it variables can be directly referenced and output
  8. Four important buttons in the source code panel (both can do single-step debugging, but have different functions):
    1. Skip the next function call (execute statement by statement, most used), shortcut key: F10
    2. Enter the next function call (enter the internal execution of the method), shortcut key: F11
    3. Jump out of the current function (jump out of the current method), shortcut key: shift+F11
    4. Single-step call, shortcut key: F9

Observe the call stack

During the debugging process, when you click F10 to jump to a new position, you can check the call stack/Call Stack panel on the right to view all the calling process (currently in the ct method, the previous step is ot, and the previous step is pt, Click to jump to the corresponding code location)

Sometimes it is very useful, you can go back to the execution flow of a certain logic, so as to quickly find a breakthrough.


Resume JavaScript execution

Click the blue button to [continue to execute the script] (shortcut key, F8). The browser will execute directly to the location of the next breakpoint, and if there are no other breakpoints, the browser will return to its normal state.


Ajax breakpoint

In addition to manually setting breakpoints through the listener (Listener) of the DOM node for debugging, you can also use the Ajax breakpoint method-the breakpoint can be triggered when an Ajax request is made.

example:

  1. View the logic of the Ajax request in the [Network] panel (here, click the page turning button 2, and see that the content is requested from this URL)
  2. Cancel all the previous breakpoints, switch to the [Sources/source code] panel, expand XHR/fetch Breakpoints (XHR/fetch breakpoints), click the + button, we fill in here such as: /api/movie - intercepted from the front Partial path of the URL
  3. Click the page turning button 3 again to trigger the Ajax request on the third page, and you will find that the page stops at the breakpoint after clicking.
  4. From the [Source Code] panel, we can see that the code stops at the moment when Ajax is finally sent, that is, the send method of the underlying XMLHttpRequest—d is XMLHttpRequest, and its send method is called in the page source code;
  5. Looking back in the [Call Stack], you will find the onFetchData method that contains three parameters: limit, offset, and token;
  6. Cancel breakpoint: Uncheck the breakpoint in the [XHR/Extract Breakpoint] tab

Rewrite JavaScript files

Principle: JavaScript is downloaded from the corresponding server and executed in the browser.

Invalid modification:

Modify directly in the JavaScript file in the browser, and the added code will disappear as soon as it is refreshed 

Effective modification:

  • With the help of browser plug-in ReRes, or proxy server Charles, Fiddler, etc.
  • With the help of the browser's native developer tools - Overrides / replacement:
    1. According to the Ajax breakpoint method, find the corresponding position to construct the Ajax request, and judge that the parameter a received by the callback method contains the result of the Ajax request
    2. Now the goal is to output the result of the response on the console when the Ajax request is successfully responded to;
    3. Enter the replacement panel, click the + button, create and select the folder ChromeOverrides to store all the JavaScript files we want to change;
    4. Copy the entire page (beautified) of the JavaScript file located in 2, add the code in a local text editor, and then copy it back to the JavaScript file (the original file before beautification):
      1. Edit JavaScript file source code in local editor
      2. Unsaved files after modification will be marked with * after the file name
      3. After saving, it will be marked with a small dot under the file icon
      4. At the same time, a new JavaScript file will be generated in the replacement panel to replace the original
    5. At this point, you can cancel all breakpoints, then refresh the page, and you can see the output response result in the console: the variable a is successfully output, and the data field is the Ajax response result, and it will not fail after refreshing
    6. Later, some JavaScript logic can be added, such as sending the result of the variable a to the remote server through the API, and saving the data through the server, which completes the process of directly intercepting the Ajax request and saving the data.

Use of JavaScript Hooks

Hook technology (hook technology), in the process of program running, rewrites one of the methods, and adds our custom code before and after the original method.

Tampermonkey ("oil monkey" browser plug-in), can automatically execute certain JavaScript scripts when the browser loads a page. As long as the functions can be implemented in JavaScript, Tampermonkey can do it, such as automatic crawling, automatic page modification, and automatic response to events, which can be applied to JavaScript reverse analysis to analyze some JavaScript encryption and obfuscation codes (development documentation: Tampermonkey • Documentation )

example:

  1. Login with account + password: https://login1.scrape.center/
  2. Using the network panel, I found that when the "Login" button is clicked, a POST request will be sent to: https://login1.scrape.center/ , and the content is a string of tokens (similar to Base64 encoding), so I can know the website The account password will be encoded and submitted to the server for verification;
  3. Open the source code panel to view the page code, and find that the code is obfuscated. At this time, there are two ways to find the location where the token is generated:

Ajax breakpoint

  1. Since this request happens to be an Ajax request, it can be monitored by setting an XHR breakpoint (just fill in the domain name for the matching content: login1.scrape.center)
  2. Log in again, the breakpoint takes effect, and you can find the code entry step by step in the stack information—in fact, it is in the onSubmit method
  3. After observation, the top of the stack at the breakpoint here also includes some Promise-related content. What I really want to find is the place where the user name and password are processed and then Base64 encoded. These requests and calls are actually not very different from the searched entry. relation(??)

Hook

Background knowledge: In JavaScript, Base64 encoding is implemented through the btoa method, so the Hook btoa method is required here.

  1. Create a new Tampermokey script
  2. // ==UserScript==
    // @name         HookBase64
    // @namespace    https://login1.scrape.center/
    // @version      0.1
    // @description  Hook Base64 encode function
    // @author       Hugo
    // @match        https://login1.scrape.center/
    // @grant        none
    // ==/UserScript==
    
    (function() {
        'use strict'
        function hook(object, attr) {
            var func = object[attr]
            object[attr] = function () {
                console.log('hooked', object, attr)
                var ret = func.apply(object, arguments)
                debugger
                return ret
            }
        }
        hook(window, 'btoa')
    })()
    
    
    // 首先定义了一些UserScript Header,其中比较重要的就是@name和@match,分别表示脚本名称和生效网址
    
    // 接着定义hook方法,其中包含两个位置参数:object和attr。这里的意思就是脚本Hook的目标是object对象的attr参数
    // 如果想要的Hook的是alert方法,那就在object位置传入参数window,在attr位置传入参数'alert'(字符串格式)
    // 但实际上在这个例子里面,需要Hook的是btoa方法,所以调用的时候,需要传入的参数是window和'btoa'(因为在JavaScript中,Base64编码是用btoa方法实现的)
    
    // var func = object[attr]
    // 首先将object[attr]赋值到一个变量,这样后面调用func方法就能实现网站中这个JavaScript本来的功能
    
    // object[attr] = function () {
    // 接着将object[attr]改写成为一个新的方法,
    
    // var ret = func.apply(object, arguments)
    // 在新的方法中,通过func.apply重新调用了原来的方法,以此确保网站原有的功能不受影响,该干嘛干嘛
    
    // console.log('hooked', object, attr)
    // debugger
    // 现在就可以在func方法执行前后加入自己的代码实现一些自定义功能了。比如通过console.log将信息输出到控制台、通过debugger进入断点等。
    // debugger是JavaScript中定义的一个专门用于断点调试的关键字
    
    // hook(window, 'btoa')
    // 最后调用hook方法,传入window对象和'btoa'字符串
  3. Click the file, save it (shortcut key: cmd-s), return to the login page, refresh it, and you can see that the script takes effect on the current page.
  4. Enter the user name and password, log in again, and successfully enter the breakpoint mode. The code is stuck at the position of the debugger line of code, successfully hooked! It shows that the JavaScript code does call the btoa method during execution.
  5. At this time, look at the console panel again, where the window object and btoa method are also output, and the verification is correct.
  6. Check the stack information again, and the Promise-related information will no longer appear, and you can clearly see the process of calling the btoa method layer by layer. It is also possible to find the onSubmit method that has been found with the Ajax breakpoint to process the source code.
  7.  At the same time, you can also see the information of the arguments variable in the scope—arguments refers to the parameters passed to the btoa method (the string of username and password serialized in JSON), and ret is the result returned by the btoa method (that is, Ajax The value of the request parameter token).
  8. Next, add a breakpoint to verify the process. For example, add a breakpoint to the line that calls the encode method, click the blue button to reply to JavaScript execution, skip the current breakpoint position defined by the oil monkey, and click the login button again, the code will stop at The location where the breakpoint is currently added. At this time, you can enter this.form in the Watch panel to verify whether this is the username and password entered in the form.
  9. Click the statement-by-statement execution button (F10), and you will jump to the place where the oil monkey script is hooked. The return value at this time is token 

The principle and bypass of infinite debugger

Basic principle: debugger is a keyword defined in JavaScript specifically for breakpoint debugging, and it will be used by website developers to prevent crawlers from debugging

Test site: https://antispider8.scrape.center/

Website features: Once the developer tool is opened, it will immediately enter the breakpoint mode. Even if the resume script execution button is clicked, it will enter the breakpoint mode in an infinite loop. Looking at the code shows that this is achieved through a setInterval loop, executing the debugger statement every second. (Similarly, there are infinite for loops, infinite while loops, infinite recursive calls, etc.)

Workaround: disable breakpoint + replace file

disable breakpoint

△This method will disable all breakpoints. After disabling, you cannot set breakpoints in other locations for debugging, so it is not a good solution.

Right-click on the line number, select "Never pause here again", and then click the blue resume button, now you will not enter the infinite debugger mode.

In this example, you can also choose to add a conditional breakpoint (stop when the value of a variable is expected to exceed a specific value). Since this is an infinite loop and there is no specific variable to use as a basis for judgment, you can directly write a simple The expression to control (in this example, just use false, which means that the breakpoint here will never be executed, and the effect after setting is the same as the previous [Never pause here]).

replace file

In the replacement panel [Start local replacement], copy the formatted source code to the local editor, directly delete or comment out the keyword debugger, and then copy the modified code to the JavaScript file of the website, after the replacement is completed , refresh the webpage, you will find that you will not enter the infinite debugger mode.


Use Python to simulate JavaScript execution

Because Python does not necessarily have the same class library as JavaScript, it is generally difficult to completely rewrite the code. At this time, you can use Python to directly simulate the execution of JavaScript to get the result.

Test site: https://spa7.scrape.center/

Features: There is an encrypted string on each card, and the encrypted string is associated with the star information, and the encrypted string of each star is different.

Goal: Find out the encryption algorithm of this encrypted string, and use the program to simulate the generation process of the encrypted string.

environment:

  • Install the Python library for executing JavaScript: pyexecjs
  • Install the JavaScript runtime environment: Node.js (official website: Node.js (nodejs.org) )

Confirm the environment:

import execjs
print(execjs.get().name)

# 返回:Node.js (V8)

Website Analysis:

  1. Find the generation logic of the encrypted string in the source code panel (the website framework in the css folder, image resources in the img folder, and JavaScript libraries that need to be referenced by the page except main in the js folder)
  2. In the main.js file, a list containing player information is first declared, and then the encryption algorithm is called to keep the information confidential—the parameter of the getToken method is the information of a single player (that is, the element object in the above list), and then the this.key (a fixed string) is processed into a key. Then extract the player's name and process it with Base64 encoding. Then, the processed name is added with information such as birthday, height, weight, etc., and the key is added for DES encryption, and finally the result is returned.
  3. The encryption algorithm depends on the crypto-js library, and this website directly references the crypto-js library. After executing the JavaScript file corresponding to the crypto-js library, CryptoJS is injected into the browser's global environment, so you can directly use the methods in the CryptoJS object in other methods.

Mock call:

The content that needs to be simulated and executed consists of two parts:

  • Simulate running JavaScript in crypto-js.min.js to declare CryptoJS objects
  • Simulate the definition of the getToken method for declaring the getToken method

First of all, the getToken method is to be simulated. You can copy it, create a new js file, paste it in, and rewrite it.

Original:

After rewriting:

function getToken(player) {
      let key = CryptoJS.enc.Utf8.parse('fipFfVsZsTda94hJNKJfLoaqyqMZFFimwLt')
      const {name, birthday, height, weight} = player
      let base64Name = CryptoJS.enc.Base64.stringify(CryptoJS.enc.Utf8.parse(name))
      let encrypted = CryptoJS.DES.encrypt(`${base64Name}${birthday}${height}${weight}`, key, {
        mode: CryptoJS.mode.ECB,
        padding: CryptoJS.pad.Pkcs7
      })
      return encrypted.toString()
    }

Main changes:

  1. Remove methods and change to function
  2. Add the string as the key

The simulated execution of this method requires the CryptoJS object. If this method is called directly, a CryptoJS undefined error will be reported, so it is necessary to simulate the execution of crypto-js.min.js.

This requires copying all the code in crypto-js.min.js and throwing it into the same JS file as the rewritten getToken code.

Next, you can call the PyExecJS library to simulate execution. The code is as follows:

import execjs
import json

item = {
    'name': '尼科拉-约基奇',
    'image': 'jokic.png',
    'birthday': '1995-02-19',
    'height': '213cm',
    'weight': '128.8KG'
}
# 单独定义一位球员的信息来测试,并赋值为item变量

file = 'crypto.js'
# 导入前面创建JS文件的路径,赋值为file
node = execjs.get()
# 获取JavaScript执行环境,赋值为node
ctx = node.compile(open(file).read())
# compile方法会返回一个JavaScript的上下文对象,赋值给ctx
# 可以理解为ctx对象已经声明好了CryptoJS对象和getToken方法

js = f"getToken({json.dumps(item, ensure_ascii=False)})"
# 定义一个js变量,内含标准的JavaScript方法调用以及字符串格式的球员信息
print(js)
result = ctx.eval(js)
# 调用ctx对象的eval方法并传入js变量,模拟执行JavaScript代码
print(result)

So far, if you run it directly, you will get an error: CryptoJS is not defined.

The problem comes from the first two lines of code in crypto-js.min.js. This declares a JavaScript self-executing method - declares a method and then calls it to execute.

!function(t, e) {
    "object" == typeof exports 
    ? module.exports = exports = e() 
    : "function" == typeof define && define.amd 
    ? define([], e) 
    : t.CryptoJS = e()
    ...
}

//crypto-js.min.js中定义的方法接受t和e两个参数,其中t就是this(浏览器中的window对象),e就是一个function(用于定义CryptoJS的核心内容)

// 在浏览器中运行的时候,环境中没有exports和define这两个对象,所以两个判断语句的结果都是False,最后执行的是t.CryptoJS = e()
// 这里就是把CryptoJS对象挂载到this对象上面,而this就是浏览器中的全局window对象,后面就可以直接用了

// 在本地运行的时候,环境中(基于Node.js的JavaScript环境)包含exports对象(用来将一些对象的定义导出),所以第一个判断语句的结果为True,最后执行的是module.exports = exports = e() 
// 这里相当于把e()作为整体导出,而这个e()其实就对应后面的整个function(里面定义了加密相关的各个实现,其实就指代整个加密算法库)

// 关键是,这就导致没有声明CryptoJS对象!也没有把CryptoJS挂载到全局对象里面!所以后面就会出现未定义的错误。

Make a point! Due to the difference between the browser environment and the local environment, the CryptoJS object is not actually declared in crypto-js.min.js! Also did not mount CryptoJS into the global object! So there will be an undefined error later.

The simplest solution is: directly declare a CryptoJS variable, and then directly assign e() to this variable to complete the initialization of CryptoJS.

var CryptoJS;
!function(t, e) {
CryptoJS = e();
    "object" == typeof exports ? module.exports = exports = e() : "function" == typeof define && define.amd ? define([], e) : t.CryptoJS = e()

Run the Python script again and successfully generate encrypted strings!


Use Node.js to simulate JavaScript execution

mock execution

Copy and save all the contents of crypto-js.min.js on the website (here is crypto2.js) - this is the dependent library.

Also create a new main.js file - this is the script that stores the algorithm mechanism.

const CryptoJS = require("./crypto2");
//直接使用Node.js中的require方法导入crypto.js这个文件,然后赋值为CryptoJS对象,完成Crypto对象的初始化

function getToken(player) {
    let key = CryptoJS.enc.Utf8.parse("fipFfVsZsTda94hJNKJfLoaqyqMZFFimwLt");
    // 调用enc的Utf8方法对字符串进行编码处理,生成一个key
    const { name, birthday, height, weight } = player;
    // 从player字典中提取各字段的内容
    let base64Name = CryptoJS.enc.Base64.stringify(CryptoJS.enc.Utf8.parse(name));
    // 调用enc的Base64和Utf8两个方法对name进行编码处理
    let encrypted = CryptoJS.DES.encrypt(
        `${base64Name}${birthday}${height}${weight}`,
        key,
        // 将处理后的key、name搭上其他球员信息,调用DES的方法进行加密
        {
            mode: CryptoJS.mode.ECB,
            padding: CryptoJS.pad.Pkcs7,
        }
    );
    return encrypted.toString();
    // 返回最终的加密结果(字符串格式)
}

const player = {
    name: "凯文-杜兰特",
    image: "durant.png",
    birthday: "1989-09-29",
    height: "208cm",
    weight: "108.9KG"
}
console.log(getToken(player))
// 插入player参数,调用getToken方法,打印返回信息

Type node main.js on the command line to execute the script, and then you can get the result.

Why did you need to modify the local crypto-js.min.js file when you used Python to simulate JavaScript execution before, but this time you don’t need to use Node.js to simulate execution?

!function(t, e) {
    "object" == typeof exports 
    ? module.exports = exports = e() 
    : "function" == typeof define && define.amd 
    ? define([], e) 
    : t.CryptoJS = e()
    ...
}

//回过头来再看看crypto-js.min.js文件中的头两行

//在浏览器上,这段脚本实际执行的是:t.CryptoJS = e()——基于整个方法e()生成全局对象CryptoJS,因为前两个判断语句都是False(浏览器没有exports和define对象)
//在本地上,这段脚本实际执行的是:module.exports = exports = e() ——将整个方法e()导出,因为第一个判断语句就是True(基于Node.js的JavaScript环境是有exports对象的)

//然而,在Python里面,没有对应的方法接收上述脚本的导出,因此导致脚本内实际上并没有初始化CryptoJS对象,所以后续在Python中调用这个脚本的时候,就会报错——对象没有初始化,为此必须先修改脚本,手动初始化一下CryptoJS对象
//而在Node.js中,有和exports配合的require方法,可以直接调用脚本并将结果赋值给CryptoJS变量,完成CryptoJS的初始化。后面就能调用CryptoJS里面的DES、enc等各个对象的方法来进行加密、编码操作了

Build services, connect Python and Node.js

Ideas:

Use Node.js to expose the algorithm just now as an HTTP service, so that Python can directly call the HTTP service, pass in the corresponding player information through the Request body, and then return the encrypted string through the HTTP Response.

HTTP is implemented through express (the most popular HTTP server framework in Node.js), installation: execute the command line in the same folder as main.js: npm i express

Rewrite main.js code:

const CryptoJS = require("./crypto2");
// 直接使用Node.js中的require方法导入crypto.js这个文件,然后赋值为CryptoJS对象,完成Crypto对象的初始化
const express = require("express");
// 导入express模块(HTTP服务器框架),完成express对象的初始化
const app = express();
// 创建express实例,赋值app
const port = 3000;
// 设置服务器的端口号
app.use(express.json());

function getToken(player) {
    ...加密算法部分不用修改...
}

app.post("/", (req, res) => {
    // 设定服务器机制:接收post请求,解析请求、返回响应
    const data = req.body;
    // 将请求内容的主体赋值data
    res.send(getToken(data));
    // 返回响应内容(调用getToken方法处理data后的结果)
});

app.listen(port, () => {
    // 调用listen方法监听来自3000端口的请求
    console.log(`Example app listening on port ${port}!`)
    // 设置成功后打印日志信息
});

Run the script: node main.js, let the express service run on the local port 3000

Create a Python file, directly use requests to call the API, and pass in the corresponding player data

import requests

data = {
    "name": "凯文-杜兰特",
    "image": "durant.png",
    "birthday": "1989-09-29",
    "height": "208cm",
    "weight": "108.9KG"
}

url = 'http://localhost:3000'
response = requests.post(url, json=data)
print(response.text)

The browser environment simulates the execution of JavaScript

Find an encryption algorithm in the browser and want to obtain the final token. At this time, use Python and Node.js to simulate the execution of JavaScript. The key lies in the following two steps:

  • Download all dependent libraries locally
  • Use PyExecJS or Node.js to load dependent libraries and simulate calling encryption methods

But there are often two problems:

  • Environmental differences: There is no global window object in Node.js, and the global object is used instead. If there is any method that refers to the window object in the JavaScript file, it cannot be run directly in the Node.js environment (it needs to be rewritten into a global object first);
  • Dependent library lookup: If you want to completely strip out the JavaScript library required by the encryption method, it will still take a lot of time, because as long as one is missing, the encryption method cannot run normally locally.

——Think about it, all the logic, environment, and dependent libraries that the encryption method depends on have actually been loaded into the browser. Wouldn’t it be great if JavaScript scripts (that is, the encryption method) could be simulated and executed directly in the browser environment?

Environment: You need to use playwright to implement browser-assisted reverse engineering

Install the playwright library: pip install playwright

Install the kernel browser: playwright install

Test site: https://spa2.scrape.center/

Website Analysis:

  1. Through the network panel, it is found that when the website turns pages, it will request the URL containing the path of /api/movie;
  2. Select the page turning button, right click and select check;
  3. Enter the event listener and select click to select the file address behind ul.el-pager;
  4. Enter the (index) file under the js folder under the source code panel, and add an XHR breakpoint on the right tab: /api/movie;
  5. Click the next page on the web page to activate the breakpoint mechanism, and the source code stops at: d.send(u);
  6. Call the stack panel in the right tab, and find back: onFetchData (this code contains limit, offset, token, three fields closely related to the page turning function);
  7. Cancel the XHR breakpoint, set a breakpoint here, and turn the page again after release;
  8. The breakpoint mechanism is activated again, and the source code stops at line 170 where the breakpoint was set just now;
  9. At this time, you can put the mouse on different variables to view the specific values, for example, this.page is 3, and this.limit is 10. Click to call the next function (F10);
  10. this.$store.state.url.index is the string "/api/movie" and a is 20. Click again to call the next function (F10);
  11. At this point, we can do a little analysis:
    • this.limit is also limit, which is a constant of 10, which means that each page contains 10 pieces of data
    • a is offset, which is a variable used as an offset for turning pages: the offset of the first page is 0, the offset of the second page is 10, and so on. Calculation formula: a = (this.paga - 1) * this.limit
    • e is token, the calculation formula: e = Object(i["a"])(this.$store.state.url.index, a), because both parameters are known, so it can be concluded that Object(i ["a"]) is the core encryption logic
  12. Next, you can track the i["a"] method
  13. code show as below:
        "7d92": function(t, e, r) {
            "use strict";
            r("6b54");
            var n = r("3452");
            function i() {
                for (var t = Math.round((new Date).getTime() / 1e3).toString(), e = arguments.length, r = new Array(e), i = 0; i < e; i++)
                    r[i] = arguments[i];
                r.push(t);
                var o = n.SHA1(r.join(",")).toString(n.enc.Hex)
                  , c = n.enc.Base64.stringify(n.enc.Utf8.parse([o, t].join(",")));
                return c
            }
            e["a"] = i
        },
  14. So far, you can roughly see that the encryption logic is mixed with various operations such as time, SHA1, Base64, and lists, which is relatively complicated. If you want to analyze it in depth, it will take more time.

Solutions:

  • Now that the browser has successfully loaded the context and dependent libraries, it is completely possible to simulate calling local methods;
  • To simulate calling a local method, you only need to mount the local method on the global window object;
  • In this case, the easiest way to mount is to directly change the source code;
  • The source code is already running in the browser. At this time, use playwright's Request Interception mechanism to replace any file you want to replace.

Actual combat:

Copy the entire JavaScript file where onFetchData is located (named chunk.js), and modify the code:

    ...
var a = (this.page - 1) * this.limit
  , e = Object(i["a"])(this.$store.state.url.index, a);
window.encrypt = Object(i["a"]);
    ...

// 将Object(i["a"])挂载到全局window对象下的encrypt(名称自定义)属性
// 之后调用window.encrypt就相当于调用了Object["a"]方法了

Write a Python script:

from playwright.sync_api import sync_playwright
import requests

BASE_URL = 'https://spa2.scrape.center'
INDEX_URL = BASE_URL + '/api/movie?limit={limit}&offset={offset}&token={token}'
# 设定请求地址模板(来自网络面板)
MAX_PAGE = 5
LIMIT = 10

context = sync_playwright().start()
browser = context.webkit.launch()
# 使用playwright创建一个无头浏览器
page = browser.new_page()
# 创建一个新页面
page.route(
    "/js/chunk-10192a00.243cb8b7.js",
    lambda route: route.fulfill(path="./3index.js")
)
# 定义一个关键路由:第一个参数是原本加载的文件路径,第二个参数是利用route的fufill方法指定本地的JS文件
page.goto(BASE_URL)


def get_token(offset):
    result = page.evaluate('''() => {
        return window.encrypt("%s", "%s")
    }''' % ('/api/movie', offset))
    return result
# 在playwright环境中额外执行JavaScript代码
# 模拟执行方法需要传入两个参数,第一个是固定值/api/movie,第二个是变值
# 使用page对象的evaluate方法,传入JavaScript字符串
# 这个字符串是一个方法,代表返回window.encrypt方法的执行效果
# 最后赋值给result,然后返回即可


for i in range(MAX_PAGE):
    offset = i * LIMIT
    token = get_token(offset)
    # 指定遍历10页,构造offset变量,传给get_token方法获取token
    index_url = INDEX_URL.format(limit=LIMIT, offset=offset, token=token)
    # 基于请求地址模板,补全参数,构建请求链接
    response = requests.get(index_url)
    print('response', response.json())

Run the Python script to get web page data.

Guess you like

Origin blog.csdn.net/weixin_58695100/article/details/123581341