Crawler js reverse encryption parameter cracking-the pit of Douyin third-party data analysis platform

Crawler js reverse series
I will classify and display all the js reverse problems encountered in the process of making crawlers, and analyze them in terms of phenomenon, solution ideas, and code implementation, for your reference

Crawler cognition
Among all the directions of programmers, crawlers are the one closest to money. Do you understand? In addition, crawlers can develop in many directions, such as big data and artificial intelligence before they can be transferred to the back end. And if the crawler is doing well, the required technology stack is quite comprehensive. If you are interested in crawlers, welcome to add V: 13809090874, and communicate together

Disclaimer:
This content is for learning and communication only, not for commercial use, if it involves infringement, contact the author to delete

1. Analyze the request header


The content of the request header is as follows:

POST /api/tiktok/ranking/tiktok_goods_sales_rank?ts=1602072436550&he=wqdvbhcTvwD2hXibw4L5RyPHeCsEZ8f7wrD2w4Sh&sign=5e1501c104122454 HTTP/1.1
Host: api.douchacha.com
Connection: keep-alive
Content-Length: 258
dcc-href: https://www.douchacha.com/cable
d-v: NCxaZGJRd3BDVFBzZmlaa1Z4WkhibXdvUUtVc2ZldzZsZ2hkZkR3b09UdWpWeGVPUk13NzlUd3F2VG54SHJLaFlVTkhibUU4YkVHa3BsdzYlMkZVcDhmYVNzZkN3NjhWajhmMkhpY1ROUThNZGtEd1prclV2MlEyVmlNJTJGQ2RiZXc1WkpSMzNVdVNjVHJIZkt3NVlUbzhiYlprZDVaaENVclZ3TWRrVk13cktVb3NiMFprUE13NXAlMkJ3cmNVdmglM0QlM0Q=
Authorization: eyJhbGciOiJIUzI1NiJ9.eyJ0eXBlIjoiUEMiLCJleHAiOjE2MDI2Njg2NDksInVzZXJJZCI6MTI5OTIyNjYzOTc0OTc1MDc4NCwiY3JlYXRlRGF0ZSI6IjIwMjAtMTAtMDcgMTc6NDQ6MDkifQ.ppW18NruEU9gpgRJgbYIaEUkvD7cTmfcRBXgOQ8vamE
Content-Type: application/json;charset=UTF-8
Accept: application/json, text/plain, */*
s-id: 371
dcc-r: https://www.douchacha.com/cable
d-t: 1602072436550
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36
j-id: sem
Origin: https://www.douchacha.com
Sec-Fetch-Site: same-site
Sec-Fetch-Mode: cors
Sec-Fetch-Dest: empty
Referer: https://www.douchacha.com/
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
  1. There are two encryption parameters he and ts in the url, and ts looks like a timestamp (millisecond level)
  2. There are several encryption parameters in the request header: dv, Authorization, dt

Have you noticed a problem? This website does not use cookies. Yes, it does not use cookies for identity verification, but uses Authorization for user verification, so we must find the encryption method of Authorization.

2. Find where the encryption parameters are encapsulated

For finding the encryption parameters of the target website, there are several ideas to share with you:

  1. Search for keywords directly, we are looking for encrypted parameters such as he and dv, and search directly in js to see if they can be found. This is also the most common and direct way
  2. For some completely confused js, it is impossible to search directly, so just look at its call stack and enter layer by layer breakpoints.

For the website I want to crawl, it is fairly friendly. I searched he directly, and dv found the corresponding code part, as shown below


From this code, we can find that all the encrypted parameters of the request header are encapsulated here. Code on line 6489:

t.url = "".concat(t.url, "?ts=").concat(e, "&he=").concat(s, "&sign=").concat(c);
1

This is the encapsulation parameter of url, ts and he, ts=e, he=s, sign=c

return i && (t.headers.common["Authorization"] = i),
            t.headers.common["d-t"] = e,
            u && (t.headers.common["j-id"] = u),
            d && (t.headers.common["s-id"] = d),
            g && (t.headers.common["dcc-href"] = g),
            t.headers.common["dcc-r"] = document.referrer || "",
            t

This is the encapsulation parameter of the request header.
After we find here, we have to find out how these values ​​are generated.
ts:

e = (new Date).getTime() + 1 * sessionStorage.getItem("diffDate") || 0
1

ts is the current timestamp + a time difference in the session. This time difference can be found by reading the source code. This is the time when the page is loaded. You can randomly generate he yourself
:

i = JSON.parse(localStorage.getItem("token"))
s = window.he(i ? "uid" : "dt")
12

Judge whether the token in localStorage exists, call window.he(uid) if it exists, call window.he("dt") if it does not exist, there is no he method in the native method of the window object, then he() is their custom We will focus on this method later.
sign:

e = (new Date).getTime() + 1 * sessionStorage.getItem("diffDate") || 0
n = t.url.split("https://api.douchacha.com")[1];
o = n + e
c = window.sh(o)
1234

It can also be seen from here that the sign value encryption method is mainly to find the sh generation method. o Parameters, we can simulate them.
dv:

l = window.btoa(window.v() + "," + window.hi("dt"));
t.headers.common["d-v"] = l
12

The dv parameter is mainly to find the declaration of the two methods v() and hi().
Authorization:

i = localStorage.getItem("token")
return i && (t.headers.common["Authorization"] = i),
12

Token is the account verification information when logging in

3. Continue to dive in and find the encryption method generation method

We said before that the main encryption places of this website are app.js and s.js. By searching the method name, we can quickly find the declaration of these methods


We searched for this method, but here are all hexadecimal encryption, which is very troublesome to crack. This js file is okay, there are only more than 400 lines, and it is okay to crack it completely, not so laborious. But if it is tens of thousands of lines, it will be deadly.
We can solve this problem by hooking or black box, there is no need to figure out how it is encrypted. Download this js and use the python execjs package to directly call the js function. You can solve the problem.
Note: When running the execjs package, you need to install nodejs first, otherwise there will be no js running environment
execjs package installation:

pip install PyExecJs
1

python code:

import execjs
js = execjs.compile(open(r"s_my_press.js").read())
url = 'https://api.douchacha.com/api/tiktok/ranking/user_list_gain'
diffdate = 0
n = str(int(round(time.time()*1000)) + diffdate)
# n = '1599562900780'
print('n:'+str(n))
e = url.split('https://api.douchacha.com')[1]
o = e+str(n)
r = js.call('he','1299226639749750784')
r = encode_b64_url(r)
print('he',r)
s = js.call('sh',o)
url = url+'?ts={}&he={}&sign={}'.format(n,r,s)
hi = encode_b64_url(js.call('hi',n))
print('hi:'+hi)
d_v = base64.b64encode((str(4) + ',' + hi).encode())
print('d_v'+str(d_v))

headers ={
                'Host': 'api.douchacha.com',
                'Connection': 'keep-alive',
                'Content-Length': '101',
                'dcc-href': 'https://www.douchacha.com/uppoint',
                'd-v': d_v,
                'Authorization': "eyJhbGciOiJIUzI1NiJ9.eyJ0eXBlIjoiUEMiLCJleHAiOjE1OTk4MjA1NjQsInVzZXJJZCI6MTI5OTIyNjYzOTc0OTc1MDc4NCwiY3JlYXRlRGF0ZSI6IjIwMjAtMDktMDQgMTg6MzY6MDQifQ.KCrYYx4hEqzv6CTJw2NlvD8pp-iRMw7IBgud_XwHHRE",
                'Content-Type': 'application/json;charset=UTF-8',
                'Accept': 'application/json, text/plain, */*',
                's-id': '371',
                'dcc-r': 'https://www.douchacha.com/uppoint',
                'd-t': str(n),
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
                'j-id': 'sem',
                'Origin': 'https://www.douchacha.com',
                'Sec-Fetch-Site': 'same-site',
                'Sec-Fetch-Mode': 'cors',
                'Sec-Fetch-Dest': 'empty',
                'Referer': 'https://www.douchacha.com/uppoint',
                'Accept-Encoding': 'gzip, deflate, br',
                'Accept-Language': 'zh-CN,zh;q=0.9'
            }
    data = {"page_no":1,"page_size":20,"params_data":{"label_name":"","period":"DAY","period_value":"20200907"}}
    r1 = sess.post(url,json=data, headers=headers, verify=False)
    print(r1.text)
1234567891011121314151617181920212223242526272829303132333435363738394041424344

In this way, the data can be successfully captured.
Of course there will be some small holes in it.

  • The window object in js will report an error when running in execjs, because execjs does not have a browser object, so we must extract the methods in the window object and execute it with python, such as window.btoa, which is the base64 encryption method in js. Must use python's base64 to regenerate
  • The default encoding method in js is iso-8859-1, but not in python. When calling base64 encryption in python, the encoding method must be indicated as iso-8859-1

So far, our previous cracking work has been completed. As for the login problem of this website, it is relatively simple, so I won't write it. It is also easy to get the token. Will be updated in the future, remember to pay attention!

This article is reprinted, the copyright belongs to the author, if there is any infringement, please contact the editor to delete it!

Original address: https://blog.csdn.net/happiness0617/article/details

Click here for the complete project source code

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/108996614