Crawlers also have a lot of reuse in the development process. Here is a summary and can save some things in the future.
1. Basically crawl web pages
get method
post method
2. Use proxy IP
In the process of developing crawlers, IP is often blocked. At this time, proxy IP is needed;
There is a ProxyHandler class in the urllib2 package, through which you can set up a proxy to access web pages, as shown in the following code snippet:
3. Cookies processing
Cookies are data (usually encrypted) that some websites store on the user's local terminal in order to identify the user's identity and perform session tracking. Python provides the cookielib module to process cookies. The main function of the cookielib module is to provide objects that can store cookies. , In order to use in conjunction with the urllib2 module to access Internet resources.
code segment:
The key is CookieJar(), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The entire cookie is stored in the memory, and the cookie will be lost after the CookieJar instance is garbage collected. All processes do not need to be operated separately.
Manually add cookies:
4. Disguise as a browser
Some websites dislike the crawler's visit, so they refuse all requests. Therefore, when using urllib2 to directly access the website, HTTP Error 403: Forbidden often occurs.
Pay special attention to some headers. The server side will check these headers:
1. User-Agent Some Server or Proxy will check this value to determine whether it is a Request initiated by the browser.
2. Content-Type When using the REST interface, the Server will check this value to determine how to parse the content in the HTTP Body.
At this time, it can be achieved by modifying the header in the http package. The code snippet is as follows:
5. Processing of verification codes
For some simple verification codes, simple identification can be performed. We have only performed some simple verification code recognition, but some anti-human verification codes, such as 12306, can be manually coded through the coding platform. Of course, this is a fee.
6, gzip compression
Have you ever encountered some webpages, no matter how you transcode them, they are messy codes. Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the large amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web services, because the compression rate of XML data can be very high.
But the general server will not send you compressed data unless you tell the server that you can handle the compressed data.
So you need to modify the code like this:
This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data.
Then is to decompress the data:
7. Multi-threaded concurrent crawling
If a single thread is too slow, you need multiple threads. Here is a simple thread pool template. This program simply prints 1-10, but it can be seen that it is concurrent.
Although Python's multithreading is very tasteless, it can still improve efficiency to a certain extent for the frequent type of crawlers.