08-02 sweet potato cracks novel network crawling novel network (6 pocketing layer, js involving encryption, dynamic rendering js, css Fanba etc.)

 

08-02 potato crawling novel network (6 pocketing layer, js involving encryption, dynamic rendering js, css Fanba etc.)

 

Sweet potatoes novel network encryption cracking

1.1 destination site URL

https://www.hongshu.com/content/3052/3317-98805.html

Taking a chapter specific articles, for example, we used to crack the encryption of this website crawling to all of the content of the novel

1.2 Site Analysis

1.2.1 target resource analysis

Our aim is to be the content of the novel, the first look at the direct request response https://www.hongshu.com/content/3052/3317-98805.html data obtained will not have what I want

1573733901637

You will be disappointed found that there is no response body content of the article

I would have to think about, and where the content of the novel come from? Most probably initiate secondary ajax request to get the json format data is then rendered on the page.

With this idea, went to look json data hold

1573734192571

1573734310997

Oh Yeah, I **, I found two suspicious responses:

第一个响应里的可疑字段:
'key':动动脑子都能猜到,这东西绝对有用

第二个响应里的可疑字段:
content:内容加密,瞅这个英语单词就知道加密内容小说内容有关
other:内容也是加密的,虽然还猜不到它到底有什么用,但八九不离十和小说内容有一腿

We look at how these two requests simulation

1573735460331

1573735494824

Are post request, form-data is also very simple, bid, jid, cid in the url

https://www.hongshu.com/content/3052/3317-98805.html
https://www.hongshu.com/content/{bid}/{jid}-{cid}.html

In addition, I have not logged in before, so the request is not required in the cookie

1.2.2 decryption algorithm cracked

Decryption of English words is what was it, decrypt, full-text search search and see. (Of course, you can also try other keywords, such as content, other, bookajax.do)

1573736405145

Very happy, there are two results match. We go one by one point, specific look at the code

1573736061949

One by one to find, you will find the following code:

1573736490542

1573736934003

We found familiar faces data.content, data.otheras well askey

It is clear, and it is undoubtedly the decryption algorithm

To further certification, we can make a break a look look

1573736701134

Can now be confirmed, it is the content decryption algorithm of the.

In the form of break points, the js decryption algorithm, all buckle up, as follows:

1573785251464

All buckle up will get the following function (the code content is too long, I do not write here):

function base64decode(str) {
    .......
}


function hs_decrypt(str, key) {
    .......
}


function long2str(v, w) {
    .......
}


function utf8to16(str) {
    .......
}

Then, we'll take a look at the results of a decrypted look:

1573736776127

We run console console look at this code to see results

1573736818692

Content out, very happy.

1.2.3 文字缺失破解

不过并没有开心很久,细心的同学会发现,哎呀,小说内容少字啊

这时候我们就不得不再去怀疑了,这一定是other里做了手脚

解密other后会发现,解密后的结果是一堆js代码

1573737093030

不出意外,这些js代码,和我们缺失的字一定是有关系的.

很可能是js操作了html,然后把内容渲染上去的.

那怎么验证这个事呢,很容易,我们自己手动新建一个html文件,然后把html代码和js代码都拷贝进去,运行看看结果.

不过这里有个问题,contemt解密出来的是html占位符,而不是标签.所以,首先得把html占位符转化成html标签,很容易,把content交给浏览器渲染一遍就好了.

1573784604040

运行后的结果如下,会得到html标签

1573784650516

好,下面我们来验证other和content的关系:

1573784814299

1573785044102

然后运行,会得到如下结果:

1573785086038

内容全部出来了

1.2.4 css反扒破解--js注入

不过!!!虽然我们知道了other的作用,内容渲染出来了.但是!!!这里还是有问题,细心的同学会发现,之前缺失的文字,被放在了:befor标签的css属性里.

这种情况下,我直接调用解析库去解析页面的话,根本得不到缺失的文字.

这怎么办呢?不要慌,我们还有js注入的手段

在我们调用浏览器内核对content和other进行渲染的时候,我们可以注入一段js代码,代码逻辑如下:

1.通过js定位到所有的:befor标签
2.然后,获取到css属性的值(缺失的文字)
3.把缺失的文字插入到标签之间(innerText)

基于此,可以写出如下js代码用于js注入:

var element_list = document.querySelectorAll('#divChpContent span')
for(var i=0;i<element_list.length;i++){
    var content = window.getComputedStyle(
        element_list[i],':before'
    ).getPropertyValue('content')
    element_list[i].innerText = content.trim('"');
}

1.3 代码部分

1.3.1 导包

from requests_html import HTMLSession,HTML    #HTML是用来解析本地html代码的
import execjs   #python中调用node.js执行js代码的模块

1.3.2 定义爬虫类

class Spider():
    def __init__(self):
        self.session = HTMLSession()
        self.book_info_api = 'https://www.hongshu.com/bookajax.do'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',
        }
        
        
         #以谋篇具体文章为例
        self.book_url = 'https://www.hongshu.com/content/3052/3317-98805.html'
        bool_url_search = HTML(html=self.book_url).search('https://www.hongshu.com/content/{bid}/{jid}-{cid}.html')
        self.bid = bool_url_search['bid']
        self.jid = bool_url_search['jid']
        self.cid = bool_url_search['cid']

1.3.3获取解密所需的key

 def get_key(self):
        data = {
            'method':'getchptkey',
            'bid':self.bid,
            'cid':self.cid,
        }

        r = self.session.post(url=self.book_info_api,data=data,headers=self.headers)
        res = r.json()
        if res.get('msg') == '获取章节内容成功':
            return res.get('key')
        else:
            print('key获取失败')

1.3.4 获取加密的小说内容,content,other

def get_book_info(self):
    data = {
        'method': 'getchpcontent',
        'bid': self.bid,
        'cid': self.cid,
        'jid': self.jid,
    }

    r = self.session.post(url=self.book_info_api, data=data, headers=self.headers)
    res = r.json()
    if res.get('msg') == '获取章节内容成功':
        return {'content':res.get('content'),'other':res.get('other')}
    else:
        print('内容获取失败')

1.3.5 创建解密js文件

项目目录下,新建一个decrypt.js的js文件,里面写上扣下来的解密函数:

function base64decode(str) {
    var c1, c2, c3, c4, base64DecodeChars = new Array(-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,62,-1,-1,-1,63,52,53,54,55,56,57,58,59,60,61,-1,-1,-1,-1,-1,-1,-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,-1,-1,-1,-1,-1,-1,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,-1,-1,-1,-1,-1);
    var i, len, out;
    len = str.length;
    i = 0;
    out = "";
    while (i < len) {
        do {
            c1 = base64DecodeChars[str.charCodeAt(i++) & 0xff];
        } while (i < len && c1 == -1);if (c1 == -1)
            break;
        do {
            c2 = base64DecodeChars[str.charCodeAt(i++) & 0xff];
        } while (i < len && c2 == -1);if (c2 == -1)
            break;
        out += String.fromCharCode((c1 << 2) | ((c2 & 0x30) >> 4));
        do {
            c3 = str.charCodeAt(i++) & 0xff;
            if (c3 == 61)
                return out;
            c3 = base64DecodeChars[c3];
        } while (i < len && c3 == -1);if (c3 == -1)
            break;
        out += String.fromCharCode(((c2 & 0XF) << 4) | ((c3 & 0x3C) >> 2));
        do {
            c4 = str.charCodeAt(i++) & 0xff;
            if (c4 == 61)
                return out;
            c4 = base64DecodeChars[c4];
        } while (i < len && c4 == -1);if (c4 == -1)
            break;
        out += String.fromCharCode(((c3 & 0x03) << 6) | c4);
    }
    return out;
}


function hs_decrypt(str, key) {
    if (str == "") {
        return "";
    }
    var v = str2long(str, false);
    var k = str2long(key, false);
    var n = v.length - 1;
    var z = v[n - 1]
      , y = v[0]
      , delta = 0x9E3779B9;
    var mx, e, q = Math.floor(6 + 52 / (n + 1)), sum = q * delta & 0xffffffff;
    while (sum != 0) {
        e = sum >>> 2 & 3;
        for (var p = n; p > 0; p--) {
            z = v[p - 1];
            mx = (z >>> 5 ^ y << 2) + (y >>> 3 ^ z << 4) ^ (sum ^ y) + (k[p & 3 ^ e] ^ z);
            y = v[p] = v[p] - mx & 0xffffffff;
        }
        z = v[n];
        mx = (z >>> 5 ^ y << 2) + (y >>> 3 ^ z << 4) ^ (sum ^ y) + (k[p & 3 ^ e] ^ z);
        y = v[0] = v[0] - mx & 0xffffffff;
        sum = sum - delta & 0xffffffff;
    }
    return long2str(v, true);
}


function long2str(v, w) {
    var vl = v.length;
    var sl = v[vl - 1] & 0xffffffff;
    for (var i = 0; i < vl; i++) {
        v[i] = String.fromCharCode(v[i] & 0xff, v[i] >>> 8 & 0xff, v[i] >>> 16 & 0xff, v[i] >>> 24 & 0xff);
    }
    if (w) {
        return v.join('').substring(0, sl);
    } else {
        return v.join('');
    }
}


function utf8to16(str) {
    var out, i, len, c;
    var char2, char3;
    out = "";
    len = str.length;
    i = 0;
    while (i < len) {
        c = str.charCodeAt(i++);
        switch (c >> 4) {
        case 0:
        case 1:
        case 2:
        case 3:
        case 4:
        case 5:
        case 6:
        case 7:
            out += str.charAt(i - 1);
            break;
        case 12:
        case 13:
            char2 = str.charCodeAt(i++);
            out += String.fromCharCode(((c & 0x1F) << 6) | (char2 & 0x3F));
            break;
        case 14:
            char2 = str.charCodeAt(i++);
            char3 = str.charCodeAt(i++);
            out += String.fromCharCode(((c & 0x0F) << 12) | ((char2 & 0x3F) << 6) | ((char3 & 0x3F) << 0));
            break;
        }
    }
    return out;
}

//最后定义一个函数,把所有的解密函数整合一下
function decrypt(str,key){
    return utf8to16(hs_decrypt(base64decode(str), key))
}

1.3.6 定义解密函数

def decrypt(self,string,key):
    with open('decrypt.js','wt',encoding='utf-8') as f:
        js_code = f.read()
    js_obj = execjs.compile(js_code)
    res = js_obj.call('decrypt',string,key)
    return res

1.3.7 渲染出::befor,注入js,获取文章内容

    def render_content(self,content,js):
        html = '<html><head><script>'+js+'</script></head><body>'+content+'</body></html>'
        r = HTML(html=html)
        r.render(script='''
var span_list = document.getElementsByTagName("span")
for (var i=0;i<span_list.length;i++){
    var content = window.getComputedStyle(
        span_list[i], ':before'
    ).getPropertyValue('content');
    span_list[i].innerText = content.replace('"',"").replace('"',"");
}
        ''',reload=False,)   #调用浏览器内核渲染,并注入js代码
        print(r.find('body',first=True).text)   #解析出小说内容

1.3.8 定义执行函数

def run(self):
    key = self.get_key()
    book_info = self.get_book_info()
    r1 = self.decrypt(book_info['content'],key)   
    #解密后得到是&lt;p&gt;&lt;span class='context_kw9'&gt;这样的html占位符
    r1 = HTML(html=r1).text   #处理占位符,得到<span>这样的标签
    r2 = self.decrypt(book_info['other'],key)
    self.render_content(r1,r2)

1.3.9 运行

if __name__ == '__main__':
    hongshu_spider = Spider()

 

红薯小说网加密破解

1.1 目标站点网址

https://www.hongshu.com/content/3052/3317-98805.html

以某篇具体文章为例,咱来破解这个网站的加密,爬取到所有的小说内容

1.2 站点分析

1.2.1 目标资源分析

我们的目的是要小说内容,那么先来看一看直接请求https://www.hongshu.com/content/3052/3317-98805.html得到的响应会不会有我要的数据

1573733901637

你会失望的发现,响应体里面没有文章内容

那就得去思考了,小说内容来自哪里?最大的可能是ajax发起二次请求,拿到json格式的数据再渲染到页面上.

带着这个思路,就去找找json数据憋

1573734192571

1573734310997

诶呀,我**,找到了两个很可疑的响应:

第一个响应里的可疑字段:
'key':动动脑子都能猜到,这东西绝对有用

第二个响应里的可疑字段:
content:内容加密,瞅这个英语单词就知道加密内容小说内容有关
other:内容也是加密的,虽然还猜不到它到底有什么用,但八九不离十和小说内容有一腿

我们再来看看,这两个请求如何模拟

1573735460331

1573735494824

都是post请求,form-data也很简单,bid,jid,cid就在url上

https://www.hongshu.com/content/3052/3317-98805.html
https://www.hongshu.com/content/{bid}/{jid}-{cid}.html

另外,咱没有登录过,所以请求里的cookie不需要

1.2.2 解密算法破解

解密的英文单词是啥来着,decrypt,全文搜搜看吧.(当然,你也可以尝试其它的关键词,比如content,other,bookajax.do)

1573736405145

很高兴,有两个结果匹配.我们挨个点进去,具体看看代码

1573736061949

一个个的找,会找到如下代码:

1573736490542

1573736934003

我们发现了熟悉的面孔,data.content,data.other还有key

很明显了,它就是解密算法无疑了

为了进一步认证,我们可以打个断点瞅一瞅

1573736701134

现在可以确认了,它就是content的解密算法了.

通过打断点的形式,把js解密算法,全部扣出来,如下:

1573785251464

全部扣出来会得到下面的函数(代码内容太长,这里就不写了):

function base64decode(str) {
    .......
}


function hs_decrypt(str, key) {
    .......
}


function long2str(v, w) {
    .......
}


function utf8to16(str) {
    .......
}

然后,我们再瞅一瞅解密后的结果:

1573736776127

我们在console控制台运行一下这个代码,看结果

1573736818692

内容出来了,很开心.

1.2.3 文字缺失破解

不过并没有开心很久,细心的同学会发现,哎呀,小说内容少字啊

这时候我们就不得不再去怀疑了,这一定是other里做了手脚

解密other后会发现,解密后的结果是一堆js代码

1573737093030

不出意外,这些js代码,和我们缺失的字一定是有关系的.

很可能是js操作了html,然后把内容渲染上去的.

How to verify that this thing, it is easy, we own manually create a html file, html code and then copied into js code, run and see the results.

But here there is a problem, contemt decrypted out of the html placeholder, rather than the label. So, first of all have to be converted into html html placeholder tag, it is easy to put content to the browser rendering it again just fine.

1573784604040

After running the following results, get html tags

1573784650516

Well, let's verify the relationship and the other content of:

1573784814299

1573785044102

Then run, you will get the following results:

1573785086038

All content out

1.2.4 css pocketing crack --js injection

But !!! Although we know that the role of other, content rendering out !!! but it was still a problem, attentive students will find the missing words before being placed:. Css property befor tag in.

In this case, I called directly parsing library to parse the page, then do not receive the missing text.

This is how to do it? Do not panic, we have the means to inject js

We invoke the browser kernel and other content rendering, we can inject some js code, the code logic is as follows:

1.通过js定位到所有的:befor标签
2.然后,获取到css属性的值(缺失的文字)
3.把缺失的文字插入到标签之间(innerText)

Based on this, you can write the following js js code for injection:

var element_list = document.querySelectorAll('#divChpContent span')
for(var i=0;i<element_list.length;i++){
    var content = window.getComputedStyle(
        element_list[i],':before'
    ).getPropertyValue('content')
    element_list[i].innerText = content.trim('"');
}

1.3 Code section

1.3.1 guide package

from requests_html import HTMLSession,HTML    #HTML是用来解析本地html代码的
import execjs   #python中调用node.js执行js代码的模块

1.3.2 the definition of reptiles

class Spider():
    def __init__(self):
        self.session = HTMLSession()
        self.book_info_api = 'https://www.hongshu.com/bookajax.do'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',
        }
        
        
         #以谋篇具体文章为例
        self.book_url = 'https://www.hongshu.com/content/3052/3317-98805.html'
        bool_url_search = HTML(html=self.book_url).search('https://www.hongshu.com/content/{bid}/{jid}-{cid}.html')
        self.bid = bool_url_search['bid']
        self.jid = bool_url_search['jid']
        self.cid = bool_url_search['cid']

1.3.3 obtain the decryption key required

 def get_key(self):
        data = {
            'method':'getchptkey',
            'bid':self.bid,
            'cid':self.cid,
        }

        r = self.session.post(url=self.book_info_api,data=data,headers=self.headers)
        res = r.json()
        if res.get('msg') == '获取章节内容成功':
            return res.get('key')
        else:
            print('key获取失败')

1.3.4 novel obtain encrypted content, content, other

def get_book_info(self):
    data = {
        'method': 'getchpcontent',
        'bid': self.bid,
        'cid': self.cid,
        'jid': self.jid,
    }

    r = self.session.post(url=self.book_info_api, data=data, headers=self.headers)
    res = r.json()
    if res.get('msg') == '获取章节内容成功':
        return {'content':res.get('content'),'other':res.get('other')}
    else:
        print('内容获取失败')

1.3.5 Creating js file decryption

Under the project directory, create a decrypt.js js file, which is written on the buckle down and decryption functions:

function base64decode(str) {
    var c1, c2, c3, c4, base64DecodeChars = new Array(-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,62,-1,-1,-1,63,52,53,54,55,56,57,58,59,60,61,-1,-1,-1,-1,-1,-1,-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,-1,-1,-1,-1,-1,-1,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,-1,-1,-1,-1,-1);
    var i, len, out;
    len = str.length;
    i = 0;
    out = "";
    while (i < len) {
        do {
            c1 = base64DecodeChars[str.charCodeAt(i++) & 0xff];
        } while (i < len && c1 == -1);if (c1 == -1)
            break;
        do {
            c2 = base64DecodeChars[str.charCodeAt(i++) & 0xff];
        } while (i < len && c2 == -1);if (c2 == -1)
            break;
        out += String.fromCharCode((c1 << 2) | ((c2 & 0x30) >> 4));
        do {
            c3 = str.charCodeAt(i++) & 0xff;
            if (c3 == 61)
                return out;
            c3 = base64DecodeChars[c3];
        } while (i < len && c3 == -1);if (c3 == -1)
            break;
        out += String.fromCharCode(((c2 & 0XF) << 4) | ((c3 & 0x3C) >> 2));
        do {
            c4 = str.charCodeAt(i++) & 0xff;
            if (c4 == 61)
                return out;
            c4 = base64DecodeChars[c4];
        } while (i < len && c4 == -1);if (c4 == -1)
            break;
        out += String.fromCharCode(((c3 & 0x03) << 6) | c4);
    }
    return out;
}


function hs_decrypt(str, key) {
    if (str == "") {
        return "";
    }
    var v = str2long(str, false);
    var k = str2long(key, false);
    var n = v.length - 1;
    var z = v[n - 1]
      , y = v[0]
      , delta = 0x9E3779B9;
    var mx, e, q = Math.floor(6 + 52 / (n + 1)), sum = q * delta & 0xffffffff;
    while (sum != 0) {
        e = sum >>> 2 & 3;
        for (var p = n; p > 0; p--) {
            z = v[p - 1];
            mx = (z >>> 5 ^ y << 2) + (y >>> 3 ^ z << 4) ^ (sum ^ y) + (k[p & 3 ^ e] ^ z);
            y = v[p] = v[p] - mx & 0xffffffff;
        }
        z = v[n];
        mx = (z >>> 5 ^ y << 2) + (y >>> 3 ^ z << 4) ^ (sum ^ y) + (k[p & 3 ^ e] ^ z);
        y = v[0] = v[0] - mx & 0xffffffff;
        sum = sum - delta & 0xffffffff;
    }
    return long2str(v, true);
}


function long2str(v, w) {
    var vl = v.length;
    var sl = v[vl - 1] & 0xffffffff;
    for (var i = 0; i < vl; i++) {
        v[i] = String.fromCharCode(v[i] & 0xff, v[i] >>> 8 & 0xff, v[i] >>> 16 & 0xff, v[i] >>> 24 & 0xff);
    }
    if (w) {
        return v.join('').substring(0, sl);
    } else {
        return v.join('');
    }
}


function utf8to16(str) {
    var out, i, len, c;
    var char2, char3;
    out = "";
    len = str.length;
    i = 0;
    while (i < len) {
        c = str.charCodeAt(i++);
        switch (c >> 4) {
        case 0:
        case 1:
        case 2:
        case 3:
        case 4:
        case 5:
        case 6:
        case 7:
            out += str.charAt(i - 1);
            break;
        case 12:
        case 13:
            char2 = str.charCodeAt(i++);
            out += String.fromCharCode(((c & 0x1F) << 6) | (char2 & 0x3F));
            break;
        case 14:
            char2 = str.charCodeAt(i++);
            char3 = str.charCodeAt(i++);
            out += String.fromCharCode(((c & 0x0F) << 12) | ((char2 & 0x3F) << 6) | ((char3 & 0x3F) << 0));
            break;
        }
    }
    return out;
}

//最后定义一个函数,把所有的解密函数整合一下
function decrypt(str,key){
    return utf8to16(hs_decrypt(base64decode(str), key))
}

1.3.6 decryption functions defined

def decrypt(self,string,key):
    with open('decrypt.js','wt',encoding='utf-8') as f:
        js_code = f.read()
    js_obj = execjs.compile(js_code)
    res = js_obj.call('decrypt',string,key)
    return res

1.3.7 rendering :: befor, injection js, get article content

    def render_content(self,content,js):
        html = '<html><head><script>'+js+'</script></head><body>'+content+'</body></html>'
        r = HTML(html=html)
        r.render(script='''
var span_list = document.getElementsByTagName("span")
for (var i=0;i<span_list.length;i++){
    var content = window.getComputedStyle(
        span_list[i], ':before'
    ).getPropertyValue('content');
    span_list[i].innerText = content.replace('"',"").replace('"',"");
}
        ''',reload=False,)   #调用浏览器内核渲染,并注入js代码
        print(r.find('body',first=True).text)   #解析出小说内容

1.3.8 defined function is executed

def run(self):
    key = self.get_key()
    book_info = self.get_book_info()
    r1 = self.decrypt(book_info['content'],key)   
    #解密后得到是&lt;p&gt;&lt;span class='context_kw9'&gt;这样的html占位符
    r1 = HTML(html=r1).text   #处理占位符,得到<span>这样的标签
    r2 = self.decrypt(book_info['other'],key)
    self.render_content(r1,r2)

1.3.9 Run

if __name__ == '__main__':
    hongshu_spider = Spider()

 

Guess you like

Origin www.cnblogs.com/cherish937426/p/11955396.html
Recommended