Advanced crawler, teach you how to grab the interface
The target website to be crawled today is a minimalist wallpaper, first put a picture, this is the home page we want to crawl,
Because the website prohibits right-clicking—>check, F12,
Choose Elemets and try to locate a picture at random,
As you can see, this is a thumbnail, and what we want to crawl is the high-definition original image,
Go to Network to check and find that the returned html does not contain the image data we want
<div class="view-body" :class="{
'view-body-classify':config.page.active == 'classify'}">
<div :id="'box_'+j.i" v-for="(j,index) in json.view" class="img-box">
<img :id="j.i" v-lazy="getUrl(j.i,j.t)" data-type="img-box" :data-index="index" v-if="j.t != 'ad'" :key="getUrl(j.i,j.t)" width="100%" alt="" @click="showFull(index)">
<img :id="j.i" src="img/ad.png" v-if="j.t == 'ad'" style="width:100%;z-index:-1000" alt="" transform="translate(-50%, -50%)" onload='loadAdsense("box",this)'>
</div>
<div v-if="config.page.active == 'like' && json.likes.length == 0" class="nolikemsg center">
<span>
您还没有收藏喜欢的图片<br>
点击图片上的小红心试试 <span class="heart iconfont iconheart"></span>
</span>
</div>
</div>
Obviously the data is first uploaded through js, so the next step is to grasp the interface, how to grasp it?
Open the Network, refresh the page, select XHR (filter ajax requests), you will find a getJson request, click in and take a look, isn’t this the data we want?
With the data, it is simple. As long as you send a request to the getJson interface, can you get the data? The key is how to send the request and simulate the key parameters of the request. So the first step, we should see what the request has parameter,
accept: */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9,en;q=0.8
access: e7809d01583f1f91da7ad087fd736c97e5df2780557bc50f54a4e80ba438cf9c
cache-control: no-cache
content-length: 30
content-type: application/json
location: bz.zzzmh.cn
origin: https://bz.zzzmh.cn
pragma: no-cache
referer: https://bz.zzzmh.cn/
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
sign: ea04368c4c168320af527f08a6501345
timestamp: 1603903026787
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36
Step-by-step testing has found that the sign parameter is only changed when the browser is changed, and access and timestamp change every time. Obviously timestamp represents the timestamp, so the key parameter is related to the access. You can guess that the access should be encrypted. The passed string feels like md5, so how can this be solved?
Anyway, first copy the access parameter and search for it globally. If there is none, what should we do?
Since there is no front-end, it must be generated by js, so the next idea is to find js code, how to start?
Observing the returned front-end code, I found that most of the referenced js is cdn, which obviously does not have core logic.
Go in and see
This is the js code. What is the bunch of code on the 26th line, which obviously does not conform to the js syntax. I guess it should be confused. Why should it be confused? The answer is obvious (with joy in my heart), and it seems that I am looking in the right direction.
Now that it is confused, it should be resolved. There are a lot of online js de-obfuscation websites, I will not demonstrate it, and I will go to Baidu.
Get the js code after de-obfuscation, search for the key parameter access, and suddenly realize.
Sure enough, md5 encryption is correct. As long as we imitate the encryption method and generate the parameter access, can we request data from the backend?
After some tossing, I did get the data, but how should I use the data?
It can be found that the front-end picture links are all generated by the getUrl function, then we can search in js,
Sure enough, if you guessed it, one is the original image and the other is a thumbnail link. Just generate the link according to this method. The general idea is that. The blogger has stopped the code. If you find it useful, please like it. Support is my biggest motivation!