1. Crawler flow chart
2. Simple crawler for the content of the entire web page
--python2
import urllib2 response = urllib2.urlopen("http://www.baidu.com") html = response.read() print(html)
3. Chinese garbled processing
# coding:utf-8 import re # import requests import sys import codecs #python2 import urllib2 #Set encoding reload(sys) sys.setdefaultencoding( ' utf-8 ' ) #Get system encoding format type = sys.getfilesystemencoding() # response = urllib2.urlopen("http://www.baidu.com") req=urllib2.Request("http://www.baidu.com") response=urllib2.urlopen(req) html = response.read().decode('utf-8').encode(type) print(html)
4. Disguise the request [disguised as a browser] User-Agent header
# coding:utf-8 import sys import urllib2 #Set encoding reload (sys) sys.setdefaultencoding( ' utf-8 ' ) #Get system encoding format type = sys.getfilesystemencoding() url = "http://www.baidu.com" user_agent = "Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0;" headers={ 'User-Agent':user_agent } req=urllib2.Request(url,headers=headers) response = urllib2.urlopen(req) html=response.read().decode("utf-8").encode(type) print(html)
5. Parse web content
5.1 Regular import re
Create a regular expression object: pattern = re.comple(' \d+\.\d+ ', re.S)
Default matches no line
re.S the entire document
import re pattern = re.compile("\d+\.\d+") s1="1.234 dsa frwr 4235.324 432423" rs = pattern.findall(s1) print(rs)
r"dsa\dsf\sd" treat escape characters as normal characters
5.2 DOM parsing