Gripping novel from the Chinese network

Original link: http://www.cnblogs.com/HapyyHao1314/p/7429012.html

Well, write first. Certainly can not grab the VIP section because it violated, (in fact, I will not)

Today a friend told me that when he read the novel, because as typos film into a film, often could not stand watching it, and asked me what's the solution

I said, you go to the legitimate website can not see just fine ......

He said, you will not reptiles you, you help me climb down ......

I said, just my dish * level, the general chapter of it ......

Here is the code, inadequacies also please the great God who, to be corrected!

Many thanks!

 1 import re
 2 import urllib2
 3 import time
 4 
 5 def spiders_Qidian(url):
 6     header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
 7     request=urllib2.Request(url,headers=header)
 8     page_code=urllib2.urlopen(request).read().decode('utf-8')
 9     find='<li data-rid = ". *?"> <a href = "(. *?)" target = ". *?" data-eid = ". *?" data-cid = ". *?" title = . "*?"> (. *?) </a>. *? </ li> ' 
10      context = re.findall (the Find, The page_code)
 11      # to add a title bar 
12      find2 = ' <h1> <EM> (. *?) </ em > <span> <a class=".*?" href=".*?" target=".*?" data-eid=".*?">. *? </ .? A> * </ span> </ h1> ' 
13      context2 = re.findall (find2, The page_code)
 14      for the X- in context [0:]:
 15          # to skip all the VIP section, do not ask me how to get, I not .....汗! 
16          f  ' vipreader '  not  in x [0]:
 17             = URL ' HTTP: ' + X [0]
 18 is              the try :
 . 19                  The page_code = urllib2.urlopen (URL, timeout =. 5) .read () decode (. ' UTF-. 8 ' )
 20 is                  find_code = ' <div class = "Read (.? *) -content j_readContent "> </ div> ' 
21                  context = re.findall (find_code, the page_code, re.S)
 22              the except Exception, E:
 23                  # because the occasional problem of individual chapters like it blank the skip 
24                  Print STR (X [. 1] .replace ( ' & nbsp ' , '').encode(' UTF-. 8 ' )) + ' [] failed to fetch ' 
25                  Continue 
26 is              with Open (context2 [0] + ' .txt ' , ' A + ' ) File AS:
 27                  a file.write ( ' \ n- ' )
 28                  # Some occasional removal space (in fact, with nothing) 
29                  a file.write (X [. 1] .replace ( ' & nbsp ' , '' ) .encode ( ' UTF-. 8 ' ))
 30                  a file.write ( ' \ n- ')
31 is                  the try :
 32                      # If there is a blank section, the problem occurs list pointer, TU, skip 
33 is                      a file.write (context [0] .replace ( '  ' , '' ) .replace ( ' <P> ' , ' \ n- ' ) .encode ( ' UTF-. 8 ' ))
 34 is                  the except Exception, E:
 35                      Print STR (X [. 1] .replace ( ' & nbsp ' , '' ) .encode ( ' UTF-. 8 ' )) + ' [written] failure '
36                     file.write(' \ N- ' )
 37 [                      Continue 
38 is                  Print (X [. 1] .replace ( ' & nbsp ' , '' ) .encode ( ' UTF-. 8 ' ))
 39                  File.close ()
 40              #      so that he will take a break, TU 0.5 sec, 
41 is              the time.sleep (0.5 )
 42 is  
43 is  
44 is  IF  the __name__ == ' __main__ ' :
 45      # here is a directory of any novel starting point for the novel, yes, you did wrong point here must be a directory, a URL and then input came 
46      # If you enter a URL enter discovered how to jump to the page, uh, after pasting the URL plus spaces okay ..... (do not hit me) 
47     = raw_input url ( ' Enter the address of the current page directory you want to novel (must be a directory page oh) ' )
 48      spiders_Qidian (url)

 You ask me what the hell directory page?

Is a novel enter the page, click on its directory ...... (I was wrong, I did not say clear)

like this:

http://book.qidian.com/info/1010136878#Catalog

Reproduced in: https: //www.cnblogs.com/HapyyHao1314/p/7429012.html

Guess you like

Origin blog.csdn.net/weixin_30709061/article/details/94791925