Use nodejs with python to crack X-Ca-Signature and grab blog points data

The most scarce resource in the world is time.

At a certain moment, I was horrified to realize that one-third of the 30,000 days of my life had passed, and I thought I was too young to do anything, so I wasted it and felt ashamed. So far, I have made no achievements, and a sentence comes to my mind, "Don't regret for wasting your time, and don't be ashamed for doing nothing."

Now I am a boy fanatic, interested in everything, knowing that the past cannot be persuaded, and the future can still be pursued. There is a great future in the chest, but the road is long and the road is long, and I feel powerless. I can only choose one thing and write some blogs! Come to make up for the disappointing teenage years a little bit.

Prepare

Install nodejs, install fiddler, and configure crawling https (Later, I thought that I could directly use chrome's network to capture it. Fiddler's capture of https is mainly used for mobile phones.

nodejs online tutorial, very easy to install, fiddle configuration https is a bit complicated, you need to create a fiddler certificate in bytes, there are tutorials on the Internet, I have configured it for a long time before I succeeded, and I can only catch the https of chrome, mozilla browser still does not work

The reason for using nodejs is that I have already installed it, and it is more convenient to download cropto-js with npm in node. This library can be used to simulate the X-Ca-Signature parameter of csdn. The algorithm used by X-Ca-Signature is HmacSHA256. If you can know the principle, you can also write it directly through python. It is also worth writing another blog for research on HmacSHA256. But if I have it, I'll just use it.

Install crypto-js somewhere in the system , and then create a js to simulate X-Ca-Signature. The path I created isD:\nodejsProject\csdn_blog_static

npm install crypto-js

insert image description here


Integral interface

Log in to CSDN, enter Data Gazing -> Blog Datainsert image description here

After configuring fiddler's https capture (Chrome's network can also), enter the blog data of Data Gazing, and the packet capture result is shown in the figure
insert image description here

Ordinary websites only need to copy the message header to simulate the login status to obtain interface data, but there are 2 values ​​in the CSDN message that change at any time and can only be used once. X-Ca-Nonce and X-Ca-Signature, so these two values ​​need to be calculated by yourself each time you request the interface.

After continuous keyword matching through chrome's devtool, it finally matches the key js file calculated by these two values. app.js (It is estimated that the business code , chunk_vendors.js It is estimated that some libraries are merged and packaged) . These two js are compressed and obfuscated. We can use the Pretty print that comes with chrome to make it easier to see, but the obfuscated variable and method names cannot be restored, so we can only work hard.

After some debugger, the generation method of X-Ca-Nonce is
insert image description here
the method of X-Ca-Signature is
insert image description here

The b method finally calls

var R = s.a.HmacSHA256(h, a)S = R.toString(s.a.enc.Base64);

The parameter h of HmacSHA256 is a part of the message header plus X-Ca-Nonce, a is a fixed value of appSecret, and it is estimated that it will change later. After trying, the corresponding js code was successfully written:
insert image description here

At this point, we can start writing our crawling code.


the code

nodejs

csdn_HmacSHA256.js:

var Crypto = require('crypto-js');


var nonceFunc = function() {
    
    
    return "xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx".replace(/[xy]/g, (function(e) {
    
    
            var n = 16 * Math.random() | 0
              , t = "x" === e ? n : 3 & n | 8;
            return t.toString(16)
        }
    ));
}
var nonce = nonceFunc(), appSecret = "9znpamsyl2c7cdrr9sas0le9vbc3r6ba", h="";


h += "".concat("GET", "\n");
h += "".concat("application/json, text/plain, */*", "\n");
h += "".concat("", "\n");
h += "".concat("", "\n");
h += "".concat("", "\n");

h += "x-ca-key:203803574\n";//目前看来也是固定值
h += "x-ca-nonce:" + nonce + "\n";
h += "/blog-console-api/v1/data/blog_statistics"


var hash = Crypto.HmacSHA256(h, appSecret);
var hashInBase64 = Crypto.enc.Base64.stringify(hash);
console.log(nonce);
console.log(hashInBase64);

python

CsdnKiramario.py:

import urllib.request
import urllib.parse
import os
import gzip
import json
import datetime
from openpyxl import load_workbook

class CsdnKiramario(object):

    def __init__(self):
   		#执行nodejs
        output = os.popen('node D:\\nodejsProject\\csdn_blog_static\\csdn_HmacSHA256.js')
		#读取nodejs种console输出,不是很严谨
        nonce = output.readline()
        signature = output.readline()
		
		#报文
        targetUrl = 'https://bizapi.csdn.net/blog-console-api/v1/data/blog_statistics'
        headers = {
    
    
            'Host': ' bizapi.csdn.net',
            'Connection': 'keep-alive',
            'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
            'X-Ca-Signature-Headers': 'x-ca-key,x-ca-nonce',
            'X-Ca-Signature': signature.strip(),
            'X-Ca-Nonce': nonce.strip(),
            'sec-ch-ua-mobile': '?0',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
            'Accept': ' application/json, text/plain, */*',
            'X-Ca-Key': ' 203803574',
            'Origin': 'https://mp.csdn.net',
            'Sec-Fetch-Site': 'same-site',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Dest': 'empty',
            'Referer': 'https//mp.csdn.net/console/dataWatch/analysis/allarticle',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            "Cookie": "替换成自己的cookie"
        }
        self.req = urllib.request.Request(url=targetUrl,method='GET', headers=headers)

    def run(self):
        res = ""
        # 爬取
        with urllib.request.urlopen(self.req) as f:
            res = f.read()
		
        ret = gzip.decompress(res).decode("utf-8")
        ret =  json.loads(ret)


        data = ret['data']
        analysDict = {
    
    }
        for statistic in data:
            name = (statistic['name'])
            num = statistic['num']
            analysDict[name] = num
		
		# 存入博客xlsx
        analysHeader = ["日期", "文章总数", "粉丝数", "点赞数", "评论数", "访问量", "积分", "收藏数", "总排名"]
        sourcePath = "D:\\麦芒\\私域计划\\博客\\博客数据.xlsx"
        wb = load_workbook(sourcePath)
        ws_active = wb['Sheet']
        row = []
        for headerName in analysHeader:
            if headerName == "日期":
                row.append(datetime.date.today().strftime("%Y/%m/%d"))
            else:
                row.append(analysDict[headerName])
        ws_active.append(row)
        wb.save(sourcePath)


if __name__ == "__main__":
    instance = CsdnKiramario()
    instance.run()

after execution
insert image description here


more ways

If you don't want to execute it manually, you can add the python script to the windows execution plan and execute it every day.

Refer to blog
https://blog.csdn.net/DylanYuan/article/details/81533105
https://www.cnblogs.com/ruiy/p/6422586.html

Guess you like

Origin blog.csdn.net/kiramario/article/details/115630765