Capture website data in the browser console and export it as an Excel file (leading to the content security policy CSP)

Table of contents

1. Target content

2. Observe the structure of the web page

3. Realize data crawling and downloading in the browser console

4. It is forbidden to load external script files (CSP)

5. For more information about CSP:


1. Target content

        The target website is a lottery website; obtain the lottery date and lottery number of each period, organize the data into a table format, and then export it to an Excel table.

        There are 55 - 26 + 1 = 30 items of data in one page, and it is planned to obtain five pages of data.

2. Observe the structure of the web page

         The document structure of the lottery date can be obtained by using tbody > tr > second childNode;

        The winning numbers (each winning number) can be obtained sequentially using .qiu > .qiu-item-small;

        Observe the event handling of the next page at the bottom of the page.

 

        It is found that when clicking a certain page or clicking the next page, the address path is not changed, but a piece of js code is executed to re-render the content of the table. At this time, in order to obtain the data of each page, you can obtain the elements of the specified page or the source of the next page, and then construct and trigger the click event to achieve the effect.

         Here we use the element to get the next page, and then construct and trigger the click event in turn to get the data content of several consecutive pages.

3. Realize data crawling and downloading in the browser console

 

// 引入alasql,可以用于数据处理,数据导出为Excel
let alasqlScript = document.createElement('script');
alasqlScript.src = "https://cdn.jsdelivr.net/npm/[email protected]/dist/alasql.min.js";
document.body.appendChild(alasqlScript);

// 存放最终的数据
let resultArr = [];

// 主要执行函数,当前页,总共需要多少页,回调函数(导出数据)
function action(currentPage, needPage, callback){
  	boxArr = document.querySelectorAll('tbody .qiu');
    line = document.querySelectorAll("tbody tr");

    for(let i = 0 ; i < boxArr.length ; i ++){
        let item = boxArr[i];
      	// 构造单行数据对象
        let obj = {
            time:line[i].childNodes[1].innerText,
            ball_1:boxArr[i].childNodes[0].innerText,
            ball_2:boxArr[i].childNodes[1].innerText,
            ball_3:boxArr[i].childNodes[2].innerText,
            ball_4:boxArr[i].childNodes[3].innerText,
            ball_5:boxArr[i].childNodes[4].innerText,
            ball_6:boxArr[i].childNodes[5].innerText,
            ball_7:boxArr[i].childNodes[6].innerText,
        }
        resultArr.push(obj);
    };
  	// 没有达到所需的数据数量要求,继续获取数据
  	if(needPage > currentPage){
      	// 自动触发点击事件,返回下一页的数据
      	document.querySelector('.layui-laypage-next').click();
      	// 等页面加载完后,在收集数据,放置页面加载过慢,数据没有回来或回来的是旧的数据
      	setTimeout(()=>{
		    action(currentPage + 1, needPage, callback);
        },100);
    }else{
      	// 执行回调函数导出数据
        callback();
    }
  	
}

// 获取 5 页的数据
action(0, 4, ()=>{
    // 将resultArr中的数据输出成表格xls,表格名称为ballData.xls
  	alasql('select * into XLS("ballData.xls",{headers:true}) from ?',[resultArr]);
})

        Click to execute the code, download the target address of the pop-up form, and select a location to store it.

        Open the form file and trust the contents of the form.

 

4. It is forbidden to load external script files (CSP)

        Looking back at the above method of grabbing webpage data in the browser console and saving it locally, a key part is to load the external JS file, that is, import the alasql library to help us organize and save the data locally.

        Of course, without introducing other libraries, after getting the final resultData, output it directly on the console, and then manually copy and paste it into a local file. However, this can also be more cumbersome.

        How to prohibit users from importing external script files in the console to restrict users' freedom? This can be achieved by configuring CSP (Content Security Policy) in the website.

CSPThe essence of the system is the whitelist system . The developer clearly tells the client which external resources can be loaded and executed. Its implementation and execution are all done by the browser, and developers only need to provide configuration.

        There are two ways to configure CSP, one is to set through the response header Content-Security-Policy, and the other is to configure in the meta tag of the web page. The two methods are equivalent.

        For the first method, the way to set the response header (such as Nginx configuration, response header configuration returned by the backend, etc.):

// 只能加载当前域名的js文件,不信任任何的插件资源
Content-Security-Policy: script-src 'self'; object-src 'none';

        For the second type, it is the configuration performed by the front end when developing web pages.

// 只允许加载当前域名下的脚步文件,运行加载任何图片资源
// default-src 设置各个配置选项的默认值, * 为允许任何资源,该配置的优先级最低
<meta http-equiv="Content-Security-Policy" content="default-src *; script-src 'self'; img-src *;">

        Manually set the <meta> tag for the website, configure the CSP policy, and then make a simple attempt with the console:

let meta = document.createElement('meta');
meta.httpEquiv = "Content-Security-Policy";
meta.content = `default-src 'self';script-src 'self';object-src 'none';`;
document.head.appendChild(meta);


let alasqlScript = document.createElement('script');
alasqlScript.src = "https://cdn.jsdelivr.net/npm/[email protected]/dist/alasql.min.js";
document.body.appendChild(alasqlScript);

alasql;

        

 

5. For more information about CSP:

[1]  CSP ( Content Security Policy ) Content Security Policy in HTTP - Nuggets

[2] Analysis of Web content security strategy - Zhihu

Guess you like

Origin blog.csdn.net/hao_13/article/details/130755627