Precautions and techniques commonly used by Python crawlers

Take the urllib library as an example to introduce
all the code in the library file name that needs attention:urllib 、cookielib 、httplib 、StringIO 、gzip 、Thread 、Queue

1. Basically crawl web pages

get method

import urllib2

url = "http://www.baidu.com"
respons = urllib2.urlopen(url)
print(response.read())

post method

import urllib
import urllib2

url = "http://abcde.com"
form = {
    
    'name': 'abc', 'password': '1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print(response.read())

2. Use proxy IP

In the process of developing crawlers, IP is often blocked. At this time, you need to use proxy IP; there are classes
in the urllib2package ProxyHandler, through which you can set proxy to access web pages, as shown in the following code snippet:

import urllib2

proxy = urllib2.ProxyHandler({
    
    'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print(response.read())

3. Cookies processing

Cookies are data (usually encrypted) that some websites store on the user's local terminal in order to identify the user's identity and perform session tracking. Python provides a cookielib module to process cookies. The main function of the cookielib module is to provide objects that can store cookies. , In order to use with the urllib2 module to access Internet resources.

code segment:

import urllib2
import cookielib

cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

The key is CookieJar(), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The entire cookie is stored in the memory, and the cookie will be lost after garbage collection of the CookieJar instance, and all processes do not need to be operated separately.

Add cookies manually

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)

4. Disguise as a browser

Some websites dislike the crawler's visit, so they refuse all requests. Therefore, using urllib2 to directly access the website often results in HTTP Error 403: Forbidden.

  • Pay special attention to some headers, the server side will check these headers

  • User-Agent Some Server or Proxy will check this value to determine whether it is a Request initiated by the browser

  • Content-Type When using the REST interface, the server will check this value to determine how to parse the content in the HTTP body.

This can be achieved by modifying the header, the code snippet is as follows:

import urllib2

headers = {
    
    
    'User-Agent': 'Mozilla/5.0'
}
request = urllib2.Request(
    url='http://my.oschina.net/jhao104/blog?catalog=3463517',
    headers=headers
)
print(urllib2.urlopen(request).read())

5. Page analysis

  • The most powerful for page parsing is of course regular expressions, which are different for different users on different websites
  • Parsing library, there are two commonly used lxml and BeautifulSoup
  • For these two libraries, they are both HTML/XML processing libraries. Beautifulsoup is implemented in pure python, which is inefficient, but has practical functions. For example, you can use search results to obtain the source code of an HTML node; lxml is a C language code, efficient, and supports Xpath
  • Personally prefer to use xpath for page analysis

6. Processing of verification codes

  • For some simple verification codes, simple identification can be performed. However, some anti-human verification codes, such as 12306, can be manually coded through the coding platform. Of course, this is a fee.

7, gzip compression

Have you ever encountered some webpages, no matter how you transcode them, they are messy codes. Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the large amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web services, because the compression rate of XML data can be very high,

but generally the server will not send you compressed data unless you tell the server that you can handle the compressed data

So you need to modify the code like this:

import urllib2
import httplib

request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)

This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data

Then is to decompress the data:

import StringIO
import gzip

compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print(gzipper.read())

8. Multi-threaded concurrent crawling

  • If a single thread is too slow, you need multiple threads. Here is a simple thread pool template. This program simply prints 1-10, but it can be seen that it is concurrent
  • Although python's multi-threading is very tasteless, it can still improve the efficiency to a certain extent for the frequent type of crawlers.
from threading import Thread
from Queue import Queue
from time import sleep

# q是任务队列
# NUM是并发线程总数
# JOBS是有多少任务
q = Queue()
NUM = 2
JOBS = 10


# 具体的处理函数,负责处理单个任务
def do_somthing_using(arguments):
    print(arguments)


# 这个是工作进程,负责不断从队列取数据并处理
def working():
    while True:
        arguments = q.get()
    do_somthing_using(arguments)
    sleep(1)
    q.task_done()


# fork NUM个线程等待队列
for i in range(NUM):
    t = Thread(target=working)
    t.setDaemon(True)
    t.start()
# 把JOBS排入队列
for i in range(JOBS):
    q.put(i)
# 等待所有JOBS完成
q.join()

ps:High-strength crawlers will put a lot of pressure on the server, configure the request speed when using it, and use it reasonably to avoid adverse effects

Guess you like

Origin blog.csdn.net/qq_43562262/article/details/106426954