Automatic crawling script based ComicReaper comic Python implementation

Really speaking, reading comic books flip phone will always be residual hand encounter ads on the page Well ...

If you can only need to specify a home page URL comics I will be able to return to full comic just fine ...

Or the use of Python is implemented, is called a name ... ComicReaper (comic reaper) it!

 

First we make a specific analysis of the process we are going to do

We want to get the title and URL of the current comics of all the chapters (named for the late title is stored in a folder, URL used to jump to the beginning of the current chapter page) and stored in the list

Python will use the two libraries, Re and urllib

1  Import Re         # import regular expressions 
2  Import urllib.request    # import urllib.request

 

Of the press in the browser [F12] to open developer tools to analyze the key chapter page comic

We can see that there are many pages in the chapter, which is chapter skip links, each link in <a> tag title and URL have exactly what we need, namely <a> tag title attribute href attribute, we will use the dictionary to store it

First do not panic the way forward, taking into account the entire HTML There are a lot of links, then it means that the page has a lot of <a> label, if we just simply filter out <a> from HTML tag, so we will <a> get a lot we do not need labels, it is not wise, we just have to filter out the chapter skip links <a> label , careful observation, we found chapter skip links <a> who has a label feature, that is they both have class attributes and attribute values "Fixed-a-es" , which can be positioned to find a basis for a chapter <a> label, put this into our regular expression matching rules in go with

Now you can define a regular expression matching the string ( What are regular expressions? () Online regular expression practice ):

pat = r'<a class="fixed-a-es" href="(.*?)" title="(.*?)"'

Why wrote:

  1. In Python, at the beginning of the string constants plus a 'r' indicates that this string '\' character will not be used as escape character retains its original meaning, that is, the backslash character
  2. In the regular expression, '.' Character used to match any character (with a 're.S' remark was established when the flag when the match, but otherwise the only addition to the '\ n' match any character)
  3. In the regular expression, '*' character for describing the number of occurrences of the matching characters to the left of it once or several times 0
  4. In a regular expression, '(. *?)' Is used to denote a combination of greedy match (and will be captured), as to what is greedy match, you can see the bloggers of this article

The use of regular expressions, can be matched to the contents of title attribute and attribute value of an href attribute of a double quotes

 

 

 

Specific implementation is chapterIndexReaper function is mainly used to "harvest" all sections of the current comics and stored as a list of dictionaries

code show as below :

1  # acquires a directory comic all chapters 
2  DEF chapterIndexReaper (url_host, header):
 . 3      # define a temporary dictionary section for temporarily storing a title and URL 
. 4      dic_temp = {
 . 5          ' the Title ' : '' ,
 . 6          ' the Url ' : '' 
. 7          }
 . 8      # chapter list of dictionaries, the current storage sections comic all dictionaries 
. 9      set_dic = []
 10      # Construction Request object 
. 11      REQ = urllib.request.Request (URL = url_host, headers = header)
 12 is      #Req requested read utf-8 encoded and used for decoding, resulting string is assigned to HTML 
13 is      HTML = the urllib.request.urlopen (req) .read (). Decode ( ' utf-8 ' )
 14      # crawling comics section headings and url regular expression 
15      pat = R & lt ' <a class = "Fixed-a-ES" the href = "(. *?)" title = "(. *?)" ' 
16      # use pat matching performed in html (re.S parameter is to make. "" in addition to matching the predetermined character itself, 
17      # additionally also match "\ n"), returns a list of results RES 
18 is      RES = the re.findall (PAT , HTML, re.S)
 . 19      for I in RES:
 20 is          dic_temp [ ' the Title '] = i[1]
21         dic_temp [ ' the Url ' ] = url_head + I [0]
 22 is          # adding a new chapter to the end of the current chapter in the dictionary list, it is noted here to use shallow copy 
23          # (because dic_temp is a temporary variable, it needs to be created and a copy appended to go set_dic, 
24          # or when dic_temp refresh set_dic the elements will change accordingly) 
25          set_dic.append (dic_temp.copy ())
 26      return set_dic

 

 

 

Guess you like

Origin www.cnblogs.com/Laplacedoge/p/11828622.html