Ways to improve crawler efficiency

1. About reptiles
Reptiles are a program that automatically crawls Internet information according to certain rules. The essence is to use the program to obtain data that is beneficial to us.

Anti-reptiles never completely eliminate reptiles; instead, try to limit the amount of reptile visits to an acceptable range and not make it too frequent.
Ways to improve crawler efficiency

2. A method to improve the efficiency of crawlers
. Using coroutines to allow multiple crawlers to work together can greatly improve efficiency.

multi-Progress. Using multiple cores of the CPU, using a few cores can increase several times.

Multithreading. Divide tasks into multiple and execute them concurrently (alternatively).

Distributed crawler. By allowing multiple devices to run the same project, the efficiency can also be greatly improved.

Packaging technology. You can package the python file into an executable exe file and let it execute in the background.

other. For example, use a network with good internet speed and so on.

3. Anti-crawler measures
limit the request header, namely the request header. Solution: We can fill in the user-agent to declare our identity, and sometimes we need to fill in the origin and referer to declare the source of the request.

Restrict login, that is, you cannot access without logging in. Solution: We can use cookies and session knowledge to simulate login.

Complex interactions, such as setting a "verification code" to block login. This is more difficult to do. Solution 1: We use Selenium to manually enter the verification code; Method 2: We use some image processing libraries to automatically recognize the verification code (tesserocr/pytesserart/pillow).

ip restrictions. If this IP address, the frequency of crawling the website is too high, then the server will temporarily block requests from this IP address. Solution: Use time.sleep() to limit the crawler speed, establish an IP proxy pool or use IPIDEA to avoid IP bans.

Guess you like

Origin blog.51cto.com/14957272/2541625