Well, write first. Certainly can not grab the VIP section because it violated, (in fact, I will not)
Today a friend told me that when he read the novel, because as typos film into a film, often could not stand watching it, and asked me what's the solution
I said, you go to the legitimate website can not see just fine ......
He said, you will not reptiles you, you help me climb down ......
I said, just my dish * level, the general chapter of it ......
Here is the code, inadequacies also please the great God who, to be corrected!
Many thanks!
1 import re 2 import urllib2 3 import time 4 5 def spiders_Qidian(url): 6 header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'} 7 request=urllib2.Request(url,headers=header) 8 page_code=urllib2.urlopen(request).read().decode('utf-8') 9 find='<li data-rid = ". *?"> <a href = "(. *?)" target = ". *?" data-eid = ". *?" data-cid = ". *?" title = . "*?"> (. *?) </a>. *? </ li> ' 10 context = re.findall (the Find, The page_code) 11 # to add a title bar 12 find2 = ' <h1> <EM> (. *?) </ em > <span> <a class=".*?" href=".*?" target=".*?" data-eid=".*?">. *? </ .? A> * </ span> </ h1> ' 13 context2 = re.findall (find2, The page_code) 14 for the X- in context [0:]: 15 # to skip all the VIP section, do not ask me how to get, I not .....汗! 16 f ' vipreader ' not in x [0]: 17 = URL ' HTTP: ' + X [0] 18 is the try : . 19 The page_code = urllib2.urlopen (URL, timeout =. 5) .read () decode (. ' UTF-. 8 ' ) 20 is find_code = ' <div class = "Read (.? *) -content j_readContent "> </ div> ' 21 context = re.findall (find_code, the page_code, re.S) 22 the except Exception, E: 23 # because the occasional problem of individual chapters like it blank the skip 24 Print STR (X [. 1] .replace ( ' & nbsp ' , '').encode(' UTF-. 8 ' )) + ' [] failed to fetch ' 25 Continue 26 is with Open (context2 [0] + ' .txt ' , ' A + ' ) File AS: 27 a file.write ( ' \ n- ' ) 28 # Some occasional removal space (in fact, with nothing) 29 a file.write (X [. 1] .replace ( ' & nbsp ' , '' ) .encode ( ' UTF-. 8 ' )) 30 a file.write ( ' \ n- ') 31 is the try : 32 # If there is a blank section, the problem occurs list pointer, TU, skip 33 is a file.write (context [0] .replace ( ' ' , '' ) .replace ( ' <P> ' , ' \ n- ' ) .encode ( ' UTF-. 8 ' )) 34 is the except Exception, E: 35 Print STR (X [. 1] .replace ( ' & nbsp ' , '' ) .encode ( ' UTF-. 8 ' )) + ' [written] failure ' 36 file.write(' \ N- ' ) 37 [ Continue 38 is Print (X [. 1] .replace ( ' & nbsp ' , '' ) .encode ( ' UTF-. 8 ' )) 39 File.close () 40 # so that he will take a break, TU 0.5 sec, 41 is the time.sleep (0.5 ) 42 is 43 is 44 is IF the __name__ == ' __main__ ' : 45 # here is a directory of any novel starting point for the novel, yes, you did wrong point here must be a directory, a URL and then input came 46 # If you enter a URL enter discovered how to jump to the page, uh, after pasting the URL plus spaces okay ..... (do not hit me) 47 = raw_input url ( ' Enter the address of the current page directory you want to novel (must be a directory page oh) ' ) 48 spiders_Qidian (url)
You ask me what the hell directory page?
Is a novel enter the page, click on its directory ...... (I was wrong, I did not say clear)
like this:
Reproduced in: https: //www.cnblogs.com/HapyyHao1314/p/7429012.html