Seven tricks to get you started quickly

Crawlers also have a lot of reuse in the development process. Here is a summary and can save some things in the future.

   1. Basically crawl web pages

    get method

image

 

    post method

image

 

   2. Use proxy IP

In the process of developing crawlers, IP is often blocked. At this time, proxy IP is needed;

There is a ProxyHandler class in the urllib2 package, through which you can set up a proxy to access web pages, as shown in the following code snippet:

 

image

 

   3. Cookies processing

Cookies are data (usually encrypted) that some websites store on the user's local terminal in order to identify the user's identity and perform session tracking. Python provides the cookielib module to process cookies. The main function of the cookielib module is to provide objects that can store cookies. , In order to use in conjunction with the urllib2 module to access Internet resources.

code segment:

image

 

The key is CookieJar(), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The entire cookie is stored in the memory, and the cookie will be lost after the CookieJar instance is garbage collected. All processes do not need to be operated separately.

Manually add cookies:

 

image

 

   4. Disguise as a browser

Some websites dislike the crawler's visit, so they refuse all requests. Therefore, when using urllib2 to directly access the website, HTTP Error 403: Forbidden often occurs.

Pay special attention to some headers. The server side will check these headers:

1. User-Agent Some Server or Proxy will check this value to determine whether it is a Request initiated by the browser.

2. Content-Type When using the REST interface, the Server will check this value to determine how to parse the content in the HTTP Body.

At this time, it can be achieved by modifying the header in the http package. The code snippet is as follows:

 

image

 

   5. Processing of verification codes

For some simple verification codes, simple identification can be performed. We have only performed some simple verification code recognition, but some anti-human verification codes, such as 12306, can be manually coded through the coding platform. Of course, this is a fee.

 

   6, gzip compression

Have you ever encountered some webpages, no matter how you transcode them, they are messy codes. Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the large amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web services, because the compression rate of XML data can be very high.

But the general server will not send you compressed data unless you tell the server that you can handle the compressed data.

So you need to modify the code like this:

 

image

 

This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data.

Then is to decompress the data:

 

image

 

   7. Multi-threaded concurrent crawling

If a single thread is too slow, you need multiple threads. Here is a simple thread pool template. This program simply prints 1-10, but it can be seen that it is concurrent.

Although Python's multithreading is very tasteless, it can still improve efficiency to a certain extent for the frequent type of crawlers.

 

image

I still want to recommend the Python learning group I built myself : 721195303 , all of whom are learning Python. If you want to learn or are learning Python, you are welcome to join. Everyone is a software development party and share dry goods from time to time (only Python software development related), including a copy of the latest Python advanced materials and zero-based teaching compiled by myself in 2021. Welcome friends who are in advanced and interested in Python to join!

Guess you like

Origin blog.csdn.net/aaahtml/article/details/112916384