Web Crawler

1. Crawler flow chart

2. Simple crawler for the content of the entire web page

--python2

import urllib2

response = urllib2.urlopen("http://www.baidu.com")
html = response.read()
print(html)

 

3. Chinese garbled processing

 

# coding:utf-8
import re  
# import requests  
import sys  
import codecs
#python2
import urllib2

#Set encoding   
reload(sys)  
sys.setdefaultencoding( ' utf-8 ' )  
 #Get system encoding format   
type = sys.getfilesystemencoding()  

# response = urllib2.urlopen("http://www.baidu.com")
req=urllib2.Request("http://www.baidu.com")
response=urllib2.urlopen(req)
html = response.read().decode('utf-8').encode(type)  
print(html)

 

4. Disguise the request [disguised as a browser] User-Agent header

 

# coding:utf-8

import sys
 import urllib2 #Set
 encoding reload   
(sys)  
sys.setdefaultencoding( ' utf-8 ' )  
 #Get system encoding format   
type = sys.getfilesystemencoding()  

url = "http://www.baidu.com"
user_agent = "Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0;"

headers={
    'User-Agent':user_agent
}
req=urllib2.Request(url,headers=headers)
response = urllib2.urlopen(req)
html=response.read().decode("utf-8").encode(type)
print(html)

 

5. Parse web content

  5.1 Regular import re

Create a regular expression object: pattern = re.comple(' \d+\.\d+ ', re.S)  

Default matches no line

re.S the entire document

import re

pattern = re.compile("\d+\.\d+")

s1="1.234 dsa frwr 4235.324 432423"
rs = pattern.findall(s1)
print(rs)

 

r"dsa\dsf\sd" treat escape characters as normal characters

 

  5.2 DOM parsing

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325064852&siteId=291194637