Understand the principle of Python distributed crawler

The python video tutorial section introduces the principle of distributed crawlers.

Free recommendation: python video tutorial

First of all, let's take a look at how to obtain web content if it is a normal human behavior.

(1) Open the browser, enter the URL, and open the source webpage

(2) Select the content we want, including title, author, abstract, body and other information

(3) Store in the hard disk

The above three processes are mapped to the technical level, which are actually: network request, grab structured data, and data storage.

We use Python to write a simple program to achieve the above simple capture function.

#!/usr/bin/python

#-*- coding: utf-8 -*-

'''''

Created on 2014-03-16

  

@author: Kris

'''

import urllib2, re, cookielib

  

def httpCrawler(url):

  '''''

  @summary: 网页抓取

  '''

  content = httpRequest(url)

  title = parseHtml(content)

  saveData(title)

  

def httpRequest(url):

  '''''

  @summary: 网络请求

  ''' 

  try:

    ret = None

    SockFile = None

    request = urllib2.Request(url)

    request.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)')

    request.add_header('Pragma', 'no-cache')

    opener = urllib2.build_opener()

    SockFile = opener.open(request)

    ret = SockFile.read()

  finally:

    if SockFile:

      SockFile.close()

      

  return ret

  

def parseHtml(html):

  '''''

  @summary: 抓取结构化数据

  '''

  content = None

  pattern = ''

  temp = re.findall(pattern, html)

  if temp:

    content = temp[0]

    

  return content

    

def saveData(data):

  '''''

  @summary: 数据存储

  '''

  f = open('test', 'wb')

  f.write(data)

  f.close()

    

if __name__ == '__main__':

  url = 'http://www.baidu.com'

  httpCrawler(url)

It looks very simple, yes, it is a basic program for getting started with crawlers. Of course, in realizing a collection process, it is nothing more than the basic steps above. But to implement a powerful acquisition process, you will encounter the following problems:

(1) You need to bring cookie information to visit. For example, most social software basically requires users to log in to see valuable things. In fact, it is very simple. We can use the cookielib module provided by Python to achieve each All visits are made with the cookie information given by the source website, so as long as we successfully simulate the login and the crawler is in the login state, then we can collect all the information that the logged-in user sees. The following is the modification of the httpRequest() method using cookies:

ckjar = cookielib.MozillaCookieJar()

cookies = urllib2.HTTPCookieProcessor(ckjar)     #定义cookies对象

def httpRequest(url):

  '''''

  @summary: 网络请求

  ''' 

  try:

    ret = None

    SockFile = None

    request = urllib2.Request(url)

    request.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)')

    request.add_header('Pragma', 'no-cache')

    opener = urllib2.build_opener(cookies)    #传递cookies对象

    SockFile = opener.open(request)

    ret = SockFile.read()

  finally:

    if SockFile:

      SockFile.close()

      

  return ret

(2) Coding problem. The website currently has two encodings at most: utf-8, or gbk. When we collect back the source website encoding and the encoding stored in our database are inconsistent, for example, the encoding of 163.com uses gbk, and we need to store utf -8 encoded data, then we can use the encode() and decode() methods provided in Python to convert, for example:

content = content.decode('gbk', 'ignore')   #将gbk编码转为unicode编码

content = content.encode('utf-8', 'ignore')  #将unicode编码转为utf-8编码

The unicode encoding appeared in the middle, we need to convert to the intermediate encoding unicode in order to convert to gbk or utf-8.

(3) The tags in the webpage are incomplete. For example, some source codes have start tags but no end tags. Incomplete HTML tags will affect us to capture structured data. We can use the BeautifulSoup module of Python to first correct The source code is cleaned, and then analyzed to obtain the content.

(4) Some websites use JS to survive web content. When we directly looked at the source code, we found that it was a bunch of JS code that made people a headache. You can use mozilla, webkit and other toolkits that can parse browsers to parse js and ajax, although the speed will be a little slower.

(5) The picture is in the form of flash. When the content in the picture is a character composed of text or numbers, this is easier to handle. As long as we use ocr technology, we can realize automatic recognition, but if it is a flash link, we store the entire URL.

(6) A webpage has multiple webpage structures, so if we only have a set of crawling rules, it will definitely not work, so we need to configure multiple sets of simulations to assist in crawling.

(7) Respond to the monitoring of the source website. Crawling other people's things is not a good thing after all, so generally websites will have restrictions on crawlers.
A good collection system should be that no matter where our target data is, as long as the user can see it, we can collect it back. WYSIWYG non-blocking collection, no matter whether you need to log in, the data can be collected smoothly. Most valuable information generally needs to be logged in to see, such as social networking sites. In order to cope with the logged-in website, a crawler system that simulates user login is required to obtain data normally. However, social websites hope to form a closed loop, and they are unwilling to put data outside the site. This kind of system will not be as open as news and other content. Most of these social websites will adopt some restrictions to prevent the robot crawler system from crawling data. Generally, an account will be detected and banned after crawling. Is it that we can't crawl the data of these websites? This is certainly not the case. As long as the social web site does not close web page access, we can also access the data that normal people can access. In the final analysis, it is to simulate the normal behavior of a person. Professionally, it is called "anti-monitoring".

The source website generally has the following restrictions:

1. The number of times a single IP visits within a certain period of time, a normal user visits a website, unless it is a random click to play, otherwise it will not visit a website too quickly for a period of time, and the duration will not be too long. This problem is easy to handle. We can use a large number of irregular proxy IPs to form a proxy pool, randomly select a proxy from the proxy pool, and simulate access. There are two types of proxy IP, transparent proxy and anonymous proxy.

2. The number of times a single account is accessed within a certain period of time. If a person is accessing a data interface 24 hours a day and the speed is very fast, it may be a robot. We can use a large number of accounts with normal behavior. Normal behavior is how ordinary people operate on social networking sites, and the number of URLs accessed per unit time is minimized. There can be a period of time between each visit. This time interval can be a random value , That is, every time you visit a URL, you will sleep randomly for a period of time, and then visit the next URL.

If you can control the account and IP access policies, there will be basically no problem. Of course, the opposing party's website will also have an operation and maintenance meeting to adjust the strategy. In a contest between the enemy and us, the crawler must be able to perceive that the opponent's anti-monitoring will affect us and notify the administrator to deal with it in time. In fact, the most ideal is to be able to intelligently realize anti-monitoring confrontation through machine learning, and realize uninterrupted capture.

The following is a distributed crawler architecture diagram that I am designing recently, as shown in Figure 1:
Insert picture description here

It is purely a clumsy work. The preliminary idea is being realized. The communication between the server and the client is being built. The Socket module of Python is mainly used to realize the communication between the server and the client. If you are interested, you can contact me alone to discuss and complete a better plan.
This article comes from php Chinese website: python video tutorial column https://www.php.cn/course/list/30.html

Understand the principle of Python distributed crawler

Guess you like