First, reference materials
1. "Python network data collection" Turing Press
2. "Mastering Python Reptile framework Scrapy" People Post Press
3. [Scrapy official tutorial] (http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html)
4. [Python3 crawler] (http://blog.csdn.net/c406495762/article/details/72858983
Second, the premise of knowledge
url, http protocol, web front-end: html \ CSS \ JS, ajax, re, Xpath, xml
Third, the basics
INTRODUCTION 1. reptile
Reptile Definition: web crawler (also known as web spider, web robot, in the FOAF community, more often called web Chaser) is a follow certain rules to automatically crawl the Web information program or script. Two from some infrequently used names include ants, automatic indexing, or simulation programs such as worms.
2. Two features
(1) can download the data required by author or content
(2) flows on the network can automatically
3. Step Three
(1) download page;
(2) to extract the correct information
(3) according to certain rules to automatically skip execution on another page content on a two-step
4. reptile classification
(1)-class Reptilia
(2) specific reptiles
5.Python network packet Profile
Python2:urllib\urllib2\urllib3\httplib\httplib2\requests
Python3.x:urllib\urllib3\httplib2\requests
And wherein python2 urllib in conjunction with urllib2, or requests
Python3 is to use urllib.requests
6.urllib
It contains a module
urllib.requests: open and read urls
urllib.error: Common errors include urllib.requests generated using try catch
urllib.parse: instant method comprises the url
urllib.robotparse: file parsing roobs.txt
from urllib Import Request "" " Use urllib, request requesting a web page content, and to print out the contents " "" IF __name__ == " __main__ " : url = " https://mp.weixin.qq.com/cgi-bin / Home? t = Home / index & lang = zh_CN & token = 984 602 018 " # to open the corresponding url and the corresponding page as a return rsp = request.urlopen (url) # returns the result read out HTML = rsp.read () Print (of the type (HTML )) # #Bytes in type HTML = html.decode () Print (HTML)
7. The page encoding chardet package uses analytically
from urllib Import Request Import the chardet "" " Use urllib, request requesting a web page content, and to print out the contents " "" IF __name__ == " __main__ " : url = " https://mp.weixin.qq.com/cgi -bin / Home? t = Home / index & lang = zh_CN & token = 984 602 018 " # to open the corresponding url and the corresponding page as a return rsp = request.urlopen (url) # returns the result read out HTML = rsp.read () Print (of the type (HTML)) # #Bytes in type Print ( "=========================") CS = chardet.detect (HTML) # use this page chardet used to detect what encoding Print (CS) Print (type (CS)) # using the get method is to avoid an error if the value is not taken, the program the collapse of the HTML = html.decode (cs.get ( " encoding " , " utf-8 " )) # take cs dictionary encoding attribute, if not taken, then use utf-8
Fourth, the source
Reptile1_SimpleAnalysis.py
https://github.com/ruigege66/PythonReptile/blob/master/Reptile1_SimpleAnalysis.py
2.CSDN: https: //blog.csdn.net/weixin_44630050 (Xi Jun Jun Moods do not know - Rui)
3. Park blog: https: //www.cnblogs.com/ruigege0000/
4. Welcomes the focus on micro-channel public number: Fourier transform public personal number, only for learning exchanges, backstage reply "gifts" to get big data learning materials