Advanced crawler, teach you how to grab the interface

Advanced crawler, teach you how to grab the interface

The target website to be crawled today is a minimalist wallpaper, first put a picture, this is the home page we want to crawl,

Insert picture description here

Because the website prohibits right-clicking—>check, F12,

Insert picture description here

Choose Elemets and try to locate a picture at random,

Insert picture description here

As you can see, this is a thumbnail, and what we want to crawl is the high-definition original image,

Insert picture description here

Go to Network to check and find that the returned html does not contain the image data we want

<div class="view-body" :class="{
     
     'view-body-classify':config.page.active == 'classify'}">
                <div :id="'box_'+j.i" v-for="(j,index) in json.view" class="img-box">
                    <img :id="j.i" v-lazy="getUrl(j.i,j.t)" data-type="img-box" :data-index="index" v-if="j.t != 'ad'" :key="getUrl(j.i,j.t)" width="100%" alt="" @click="showFull(index)">
                    <img :id="j.i" src="img/ad.png" v-if="j.t == 'ad'" style="width:100%;z-index:-1000" alt="" transform="translate(-50%, -50%)" onload='loadAdsense("box",this)'>
                </div>
                <div v-if="config.page.active == 'like' && json.likes.length == 0" class="nolikemsg center">
                    <span>
                        您还没有收藏喜欢的图片<br>
                        点击图片上的小红心试试&nbsp;<span class="heart iconfont iconheart"></span>
                    </span>
                </div>
            </div>

Obviously the data is first uploaded through js, so the next step is to grasp the interface, how to grasp it?

Open the Network, refresh the page, select XHR (filter ajax requests), you will find a getJson request, click in and take a look, isn’t this the data we want?

Insert picture description here

With the data, it is simple. As long as you send a request to the getJson interface, can you get the data? The key is how to send the request and simulate the key parameters of the request. So the first step, we should see what the request has parameter,

accept: */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9,en;q=0.8
access: e7809d01583f1f91da7ad087fd736c97e5df2780557bc50f54a4e80ba438cf9c
cache-control: no-cache
content-length: 30
content-type: application/json
location: bz.zzzmh.cn
origin: https://bz.zzzmh.cn
pragma: no-cache
referer: https://bz.zzzmh.cn/
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
sign: ea04368c4c168320af527f08a6501345
timestamp: 1603903026787
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36

Step-by-step testing has found that the sign parameter is only changed when the browser is changed, and access and timestamp change every time. Obviously timestamp represents the timestamp, so the key parameter is related to the access. You can guess that the access should be encrypted. The passed string feels like md5, so how can this be solved?

Anyway, first copy the access parameter and search for it globally. If there is none, what should we do?

Insert picture description here

Since there is no front-end, it must be generated by js, so the next idea is to find js code, how to start?

Observing the returned front-end code, I found that most of the referenced js is cdn, which obviously does not have core logic.

Insert picture description here

Go in and see

Insert picture description here

This is the js code. What is the bunch of code on the 26th line, which obviously does not conform to the js syntax. I guess it should be confused. Why should it be confused? The answer is obvious (with joy in my heart), and it seems that I am looking in the right direction.

Now that it is confused, it should be resolved. There are a lot of online js de-obfuscation websites, I will not demonstrate it, and I will go to Baidu.

Get the js code after de-obfuscation, search for the key parameter access, and suddenly realize.

Insert picture description here

Sure enough, md5 encryption is correct. As long as we imitate the encryption method and generate the parameter access, can we request data from the backend?

Insert picture description here

After some tossing, I did get the data, but how should I use the data?

Insert picture description here

It can be found that the front-end picture links are all generated by the getUrl function, then we can search in js,

Insert picture description here

Sure enough, if you guessed it, one is the original image and the other is a thumbnail link. Just generate the link according to this method. The general idea is that. The blogger has stopped the code. If you find it useful, please like it. Support is my biggest motivation!

Guess you like

Origin blog.csdn.net/m0_48769739/article/details/109349646