Sesame HTTP: How to Find Crawler Entry

Find the reptile entrance       

        1. The entrance of this task

        A better entry for this crawler is the search engine we usually use. Although there are many kinds of search engines, they are actually doing one thing, including web pages, processing them, and then providing search services. In the process of normal use, we usually search directly after entering keywords, but there are actually many search techniques. For example, for this task, as long as we search in this way, we can get the data we want. .

website: zybang.com

Now let's try it on Baidu, Google, Sogou, 360, and Bing:

From the above figure, it can be found that the amount of data returned is in the millions or even tens of millions.

So it is obviously better to use this data as the entry point for this task. As for the measures to deal with anti-reptiles, it will test the basic skills of individuals.

  2. Other entrances

      (1)  Mobile terminal entrance

        Obtaining data through the mobile portal of the website can obtain data better and faster.

        The easiest way to find the mobile portal is to open the developer mode of Google Chrome, click on the mobile phone below, and then refresh it.

 This method is not a panacea. Sometimes we can send the URL to our mobile phone, and then open the mobile phone browser to see if the format displayed on the mobile phone is different from that on the computer. You can copy the URL of your mobile browser and send it to your computer.

     (2) Sitemap

        Sitemap refers to the convenience for webmasters to inform search engines which pages are available for crawling on their website, so through these sitemaps, it is more efficient and convenient to obtain some URLs as next-level entrances.

     (3) Modify the value in the URL

       First of all, this technique is not a panacea.

       This technique is mainly to obtain the required data from a request to the maximum extent through the values ​​of some fields in the URL, reduce the number of requests, reduce the risk of being banned by the website, and improve the efficiency of the crawler. Here's an example:

        When crawling all the music data of a certain singer of QQ Music, the format obtained by capturing the package is as follows:

https://xxxxxxxxx&singermid=xxxx&order=listen&begin={begin}&num={num}&songstatus=1

The returned packets are as follows:

Some of the field values ​​were replaced by xxx. Please pay attention to the num field here. Usually, when a singer has many songs, the data is displayed on the next page, so the begin here should be the first on each page. The corresponding value of the bar, and num is how many pieces of data this page has. Usually, we can get data page by page, the default value of QQ Music is 30. So do we have to request at least 4 times to get the complete data?

      Of course not, in fact, at this time, we can try to change some values ​​in the URL, whether the returned results will send changes. Here, we change the values ​​of num and begin, where num is the value of the number of songs of a certain singer, and begin is 0. At this time, if you re-request the modified URL, you can get the following data:

 As you can see from the above, 96 pieces of data are returned.

       In this way, we can get all the data through 2 requests. The first request gets the total number, and then the URL is modified and the request is made again, so that all the data can be obtained.

       A similar field is pagesize.

Summarize

        The above tips for finding crawler entrances can make us do more with less, and sometimes get data at the least cost.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325806450&siteId=291194637