Wu Yuxiong --python study notes: Reptile

Import the chardet
 Import the urllib.request 

Page = the urllib.request.urlopen ( ' http://photo.sina.com.cn/ ' ) # open web 
HTMLCode page.read = () # Get page source 

Print (chardet.detect ( HTMLCode)) # print returns encoding of web pages
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
= htmlCode.decode Data ( ' UTF-. 8 ' )
 Print (Data) # print page source
Open = PageFile ( ' E: \\ pageCode.txt ' , ' wb ' ) # to write the way open pageCode.txt 
pageFile.write (HTMLCode) # write 
pageFile.close () # open remember to close
Obtain additional information 

to open pageCode.txt file (also available directly from the original page F12 debug) to view the need to obtain tag information data. 

For example, I now bring the picture 

to write regular expression picture: REG = r ' src = "(+ \ JPG.?.)" '  

Explain it - to match src = " at the beginning and then by one or more of any character (non-greedy) to .jpg " end of the string. For example in FIG string src double quotes the red frame is a link matching. 

Then we have to do is string the regular expression to get it hot from a long string of string get_html method returns. 

List used in python re library re.findall (str) it returns a string consisting of matching meet
Import Re
 Import the chardet
 Import the urllib.request 
 
Page = the urllib.request.urlopen ( ' http://www.meituba.com/tag/juesemeinv.html ' ) # open web 
HTMLCode page.read = () # Get Page Source 
 
# print (chardet.detect (htmlCode)) # View encoding 
Data = htmlCode.decode ( ' UTF-. 8 ' )
 # Print (Data) to print the page source # 
 
# PageFile = Open ( 'pageCode.txt', 'WB') # to write the way open pageCode.txt 
# pageFile.write (HTMLCode) # write 
# pageFile.close () # open remember to close 
 
REG = r ' src = "(. +? \. JPG)"' # Regular expression 
reg_img = re.compile (REG) # compiler that run faster 
imglist = reg_img.findall (the Data) # match 
for img in imglist:
     Print (img)
http://ppic.meituba.com:83/uploads3/181201/3-1Q20111553V11.jpg
http://ppic.meituba.com:83/uploads2/180622/3-1P62215532D61.jpg
http://ppic.meituba.com:83/uploads2/180605/3-1P6051000144I.jpg
http://ppic.meituba.com:83/uploads2/170511/8-1F5110URc35.jpg
http://ppic.meituba.com:83/uploads/160322/8-1603220U50O23.jpg
http://ppic.meituba.com:83/uploads2/180317/3-1P31F91U1X9.jpg
http://ppic.meituba.com:83/uploads/160718/7-160GQ51G0b4.jpg
http://ppic.meituba.com:83/uploads2/170517/8-1F51G50301Q3.jpg
http://ppic.meituba.com:83/uploads/161010/7-1610101A202B0.jpg
http://ppic.meituba.com:83/uploads2/171102/7-1G102093511F7.jpg
http://ppic.meituba.com:83/uploads2/170901/7-1FZ1100545438.jpg
http://ppic.meituba.com:83/uploads/160625/8-160625093044631.jpg
http://ppic.meituba.com:83/uploads/160419/7-160419161553153.jpg
http://ppic.meituba.com:83/uploads2/170323/7-1F323103404A2.jpg
http://ppic.meituba.com:83/uploads2/170322/7-1F322105R1255.jpg
http://ppic.meituba.com:83/uploads2/170211/7-1F21110040Y63.jpg
http://ppic.meituba.com:83/uploads2/170110/7-1F110102005930.jpg
http://ppic.meituba.com:83/uploads/160618/8-16061Q04450391.jpg
http://ppic.meituba.com:83/uploads2/170330/3-1F3301HI6138.jpg
http://ppic.meituba.com:83/uploads2/161230/4-161230100U5V8.jpg
Then downloaded to the local picture 

in urllib library has a urllib.request.urlretrieve (link name) method, which is based on the content of the role of the second parameter is the name of the download link, let's try
x = 0
for img in imglist:
    print(img)
    urllib.request.urlretrieve('http://ppic.meituba.com/uploads/160322/8-1603220U50O23.jpg', '%s.jpg'  % x)
    x += 1
Import Re
 Import urllib.request 
 
DEF getGtmlCode (): 
    HTML = the urllib.request.urlopen ( " http://www.quanshuwang.com/book/44/44683 " ) .read () # get page source code 
    html = html. decode ( " GBK " ) # converted into the format sites 
    REG R & lt = ' <Li> <a href="(.*?)" title=".*?"> (. *?) </a> </ Li > '  # according to the website of the regular pattern matching: (.? *) matches all things, we need to add brackets to the 
    REG = re.compile (REG) 
    urls = re.findall (REG, HTML)
     for url in urls:
        #Print (URL) 
        chapter_url URL = [0] # section path 
        CHAPTER_TITLE URL = [. 1] # chapter title 
        chapter_html = the urllib.request.urlopen (chapter_url) .read () # acquire the full text of the code section 
        chapter_html = chapter_html.decode ( " GBK " ) 
        chapter_reg = r ' </ Script> & nbsp; & nbsp; & nbsp; & nbsp;.? * <br /> (*) <Script of the type =.?" text / JavaScript "> '  # match the content of the article 
        chapter_reg = Re. the compile (chapter_reg, re.S) 
        chapter_content = the re.findall (chapter_reg,
        chapter_html)
        for content inchapter_content: 
            Content = content.replace ( " & nbsp; & nbsp; & nbsp; & nbsp; " , "" ) # spaces instead of 
            Content = content.replace ( " <br /> " , "" ) # spaces instead of 
            Print (Content) 
            F = Open ( ' E:. {} \\ \\ TXT AA ' .format (CHAPTER_TITLE), ' W ' ) # saved locally 
            f.write (Content) 
            
getGtmlCode ()

 

Guess you like

Origin www.cnblogs.com/tszr/p/11954977.html