Import the chardet Import the urllib.request Page = the urllib.request.urlopen ( ' http://photo.sina.com.cn/ ' ) # open web HTMLCode page.read = () # Get page source Print (chardet.detect ( HTMLCode)) # print returns encoding of web pages
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
= htmlCode.decode Data ( ' UTF-. 8 ' ) Print (Data) # print page source
Open = PageFile ( ' E: \\ pageCode.txt ' , ' wb ' ) # to write the way open pageCode.txt pageFile.write (HTMLCode) # write pageFile.close () # open remember to close
Obtain additional information to open pageCode.txt file (also available directly from the original page F12 debug) to view the need to obtain tag information data. For example, I now bring the picture to write regular expression picture: REG = r ' src = "(+ \ JPG.?.)" ' Explain it - to match src = " at the beginning and then by one or more of any character (non-greedy) to .jpg " end of the string. For example in FIG string src double quotes the red frame is a link matching. Then we have to do is string the regular expression to get it hot from a long string of string get_html method returns. List used in python re library re.findall (str) it returns a string consisting of matching meet
Import Re Import the chardet Import the urllib.request Page = the urllib.request.urlopen ( ' http://www.meituba.com/tag/juesemeinv.html ' ) # open web HTMLCode page.read = () # Get Page Source # print (chardet.detect (htmlCode)) # View encoding Data = htmlCode.decode ( ' UTF-. 8 ' ) # Print (Data) to print the page source # # PageFile = Open ( 'pageCode.txt', 'WB') # to write the way open pageCode.txt # pageFile.write (HTMLCode) # write # pageFile.close () # open remember to close REG = r ' src = "(. +? \. JPG)"' # Regular expression reg_img = re.compile (REG) # compiler that run faster imglist = reg_img.findall (the Data) # match for img in imglist: Print (img)
http://ppic.meituba.com:83/uploads3/181201/3-1Q20111553V11.jpg http://ppic.meituba.com:83/uploads2/180622/3-1P62215532D61.jpg http://ppic.meituba.com:83/uploads2/180605/3-1P6051000144I.jpg http://ppic.meituba.com:83/uploads2/170511/8-1F5110URc35.jpg http://ppic.meituba.com:83/uploads/160322/8-1603220U50O23.jpg http://ppic.meituba.com:83/uploads2/180317/3-1P31F91U1X9.jpg http://ppic.meituba.com:83/uploads/160718/7-160GQ51G0b4.jpg http://ppic.meituba.com:83/uploads2/170517/8-1F51G50301Q3.jpg http://ppic.meituba.com:83/uploads/161010/7-1610101A202B0.jpg http://ppic.meituba.com:83/uploads2/171102/7-1G102093511F7.jpg http://ppic.meituba.com:83/uploads2/170901/7-1FZ1100545438.jpg http://ppic.meituba.com:83/uploads/160625/8-160625093044631.jpg http://ppic.meituba.com:83/uploads/160419/7-160419161553153.jpg http://ppic.meituba.com:83/uploads2/170323/7-1F323103404A2.jpg http://ppic.meituba.com:83/uploads2/170322/7-1F322105R1255.jpg http://ppic.meituba.com:83/uploads2/170211/7-1F21110040Y63.jpg http://ppic.meituba.com:83/uploads2/170110/7-1F110102005930.jpg http://ppic.meituba.com:83/uploads/160618/8-16061Q04450391.jpg http://ppic.meituba.com:83/uploads2/170330/3-1F3301HI6138.jpg http://ppic.meituba.com:83/uploads2/161230/4-161230100U5V8.jpg
Then downloaded to the local picture
in urllib library has a urllib.request.urlretrieve (link name) method, which is based on the content of the role of the second parameter is the name of the download link, let's try
x = 0 for img in imglist: print(img) urllib.request.urlretrieve('http://ppic.meituba.com/uploads/160322/8-1603220U50O23.jpg', '%s.jpg' % x) x += 1
Import Re Import urllib.request DEF getGtmlCode (): HTML = the urllib.request.urlopen ( " http://www.quanshuwang.com/book/44/44683 " ) .read () # get page source code html = html. decode ( " GBK " ) # converted into the format sites REG R & lt = ' <Li> <a href="(.*?)" title=".*?"> (. *?) </a> </ Li > ' # according to the website of the regular pattern matching: (.? *) matches all things, we need to add brackets to the REG = re.compile (REG) urls = re.findall (REG, HTML) for url in urls: #Print (URL) chapter_url URL = [0] # section path CHAPTER_TITLE URL = [. 1] # chapter title chapter_html = the urllib.request.urlopen (chapter_url) .read () # acquire the full text of the code section chapter_html = chapter_html.decode ( " GBK " ) chapter_reg = r ' </ Script> & nbsp; & nbsp; & nbsp; & nbsp;.? * <br /> (*) <Script of the type =.?" text / JavaScript "> ' # match the content of the article chapter_reg = Re. the compile (chapter_reg, re.S) chapter_content = the re.findall (chapter_reg, chapter_html) for content inchapter_content: Content = content.replace ( " & nbsp; & nbsp; & nbsp; & nbsp; " , "" ) # spaces instead of Content = content.replace ( " <br /> " , "" ) # spaces instead of Print (Content) F = Open ( ' E:. {} \\ \\ TXT AA ' .format (CHAPTER_TITLE), ' W ' ) # saved locally f.write (Content) getGtmlCode ()