1-urllib.request Python crawler serial packet and use chardet

First, reference materials

1. "Python network data collection" Turing Press

2. "Mastering Python Reptile framework Scrapy" People Post Press

3. [Scrapy official tutorial] (http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html)

4. [Python3 crawler] (http://blog.csdn.net/c406495762/article/details/72858983

Second, the premise of knowledge

url, http protocol, web front-end: html \ CSS \ JS, ajax, re, Xpath, xml

Third, the basics

INTRODUCTION 1. reptile

Reptile Definition: web crawler (also known as web spider, web robot, in the FOAF community, more often called web Chaser) is a follow certain rules to automatically crawl the Web information program or script. Two from some infrequently used names include ants, automatic indexing, or simulation programs such as worms.

2. Two features

(1) can download the data required by author or content

(2) flows on the network can automatically

3. Step Three

(1) download page;

(2) to extract the correct information

(3) according to certain rules to automatically skip execution on another page content on a two-step

4. reptile classification

(1)-class Reptilia

(2) specific reptiles

5.Python network packet Profile

Python2:urllib\urllib2\urllib3\httplib\httplib2\requests

Python3.x:urllib\urllib3\httplib2\requests

And wherein python2 urllib in conjunction with urllib2, or requests

Python3 is to use urllib.requests

6.urllib

It contains a module

urllib.requests: open and read urls

urllib.error: Common errors include urllib.requests generated using try catch

urllib.parse: instant method comprises the url

urllib.robotparse: file parsing roobs.txt

 

 

from urllib Import Request 

"" " 

Use urllib, request requesting a web page content, and to print out the contents 

" "" 

IF  __name__ == " __main__ " : 

    url = " https://mp.weixin.qq.com/cgi-bin / Home? t = Home / index & lang = zh_CN & token = 984 602 018 " 

    # to open the corresponding url and the corresponding page as a return 

    rsp = request.urlopen (url) 

    # returns the result read out 

    HTML = rsp.read () 

    Print (of the type (HTML )) # #Bytes in type 

    HTML = html.decode () 

    Print (HTML)

7. The page encoding chardet package uses analytically

 

from urllib Import Request 

Import the chardet 

"" " 

Use urllib, request requesting a web page content, and to print out the contents 

" "" 

IF  __name__ == " __main__ " : 

    url = " https://mp.weixin.qq.com/cgi -bin / Home? t = Home / index & lang = zh_CN & token = 984 602 018 " 

    # to open the corresponding url and the corresponding page as a return 

    rsp = request.urlopen (url) 

    # returns the result read out 

    HTML = rsp.read () 

    Print (of the type (HTML)) # #Bytes in type 

    Print ( "=========================") 
    CS

 = chardet.detect (HTML) # use this page chardet used to detect what encoding 

    Print (CS) 

    Print (type (CS)) 

    # using the get method is to avoid an error if the value is not taken, the program the collapse of the 

    HTML = html.decode (cs.get ( " encoding " , " utf-8 " )) # take cs dictionary encoding attribute, if not taken, then use utf-8

 

Fourth, the source

Reptile1_SimpleAnalysis.py

https://github.com/ruigege66/PythonReptile/blob/master/Reptile1_SimpleAnalysis.py​

2.CSDN: https: //blog.csdn.net/weixin_44630050 (Xi Jun Jun Moods do not know - Rui)

3. Park blog: https: //www.cnblogs.com/ruigege0000/

4. Welcomes the focus on micro-channel public number: Fourier transform public personal number, only for learning exchanges, backstage reply "gifts" to get big data learning materials

 

Guess you like

Origin www.cnblogs.com/ruigege0000/p/12169312.html