Python reptiles (nine) _ Case: using regular expressions reptiles

Now we have a regular expression that the magic weapon, we can be to crawl to the full page source code screened.

Let's try to piece together what crawled connotation website:
http://www.neihan8.com/article/list_5_1.html

After opening, a very easy to see which has the connotation of a piece, when you turn the pages, and pay attention to changes url address:

    • The first page url: http: //www.neihan8.com/article/list_5_1 .html
    • The second page url: http: //www.neihan8.com/article/list_5_2 .html
    • The third page url: http: //www.neihan8.com/article/list_5_3 .html
    • The fourth page url: http: //www.neihan8.com/article/list_5_4 .html

So that our url law found, to crawl all the scripts, only you need to modify a parameter.
Step by step we started to climb all of the scripts to remove come.

The first step: get the data

1. In accordance with our previous usage, we need a way to load the page.

Here we define a unified class, as a member of the url request processing method.
We created a file called duanzi_spider.py
then define a Spider class, and a method of adding members to load the page.

Import urllib2 

class Spider:
     "" " 
        connotation piece reptiles 
    " "" 
    DEF loadPage (Self, Page):
         "" " 
            method @brief define a url request web 
            page needs a few @param page request 
            @returns page returned url 
        "" " 
        URL = " http://www.neihan8.com/article/list_5_ " + STR (Page) + " .html " 
        # User - Agent header- 
        user_agent = " the Mozilla / 5.0 (compatible; MSIE 9.0; the NT6 the Windows. . 1; Trident / 5.0 " 
        headers = { "User-Agent":user_agent}
        req = urllib2.Request(url, headers = headers)
        response = urllib2.urlopen(req)
        print html

Achieve more loadPage thought surely we should be familiar with the need to pay attention to the definition of python class member method requires adding an extra parameter self.

2. Test Method loadPage write a main function

IF  __name__ == " __main__ " :
     "" " 
        ===================== 
            connotation of scripts small reptiles 
        ============== ======= 
    "" " 
    Print ( " press Enter start " ) 
    the raw_input () 
    
    # define a Spider objects 
    mySpider = Spider () 
    mySpider.loadPage ( . 1)
  • Normal execution, we will print the first page of all the html code of connotation piece on Pim. But we found, html in the Chinese section of the display may be garbled.
    So we need to get a simple web page source code to handle it:
DEF loadPage (Self, Page):
     "" " 
        method @bridf define a url request web 
        page needs a few @param page request 
        @returns page returned HTML 
    " "" 

    url = " http://www.neihan8.com / Article This article was / list_5_ " + STR (Page) + " .html " 
    # User-Agent header 
    User-Agent = " the Mozilla / 5.0 (compatible; MSIE 9.0; the Windows NT6.1; Trident / 5.0 " 
    headers = { " the User-- Agent " : user- Agent} 
    REQ = urllib2.Request (URL, headers = headers) 
    Response = urllib2.urlopen(req)
    html = response.read()
    gbk_html = html.decode("gbk").encode("utf-8")

    return gbk_html

Note: For each site on the Chinese coding of each different, so html.decode ( "gbk") wording is not common, according to the site-specific coding.

Step two: screening data

Then we've got the data of the entire page. However, we do not care about a lot of content, so the next step we need to filter the data. How to filter, they use the one about regular expressions

  • First of all
import re
  • Then, gbk_html we get is filtered by match.

We need a matching rule

We can open the connotation scripts page, right mouse click "View Source" You will be surprised to find that every piece of content we are in need of a <div>label, and each divlabel has a propertyclass="f18 mb20"

Based on regular expressions, we can calculate a formula is:

<div.*?class="f18 mb20">(.*?)</div>
  • This expression is actually matches all divthe class="f18 mb20"contents inside (you can see the specific described earlier)
  • Then the regular application to the code, we get the following code:
DEF loadPage (Self, Page):
     "" " 
        @brief way to define a web page's url request 
        the first few pages @param page requires requested 
        page html @returns returned 
    " "" 
    url = " http://www.neihan8.com / Article This article was / list_5_ " + STR (Page) + " .html " 
    # the User-- Agent header 
    User-Agent = " the Mozilla / 5.0 (compatible; MSIE 9.0; the Windows NT6.1; Trident / 5.0 "   

    headers = { " the User-- Agent " : user- Agent} 
    REQ = urllib2.Request (URL, headers = headers) 
    Response =urllib2.urlopen (REQ) 

    HTML = response.read () 

    gbk_html = html.decode ( " GBK " ) .encode ( " UTF-. 8 " ) 

    # find all scripts content <div class = "f18 mb20" > </ div > 
    # re.S If not re.S, there is only one line matching string does not conform to the rules, if not the re-match next line matching 
    # If you add all the strings re.S, sucked by a whole matching 
    pattern = the re.compile (R & lt ' <div. *? class = "F18 MB20"> (. *?) </ div> ' , re.S) 
    item_list = pattern.findall (gbk_html) 

    return item_list 

DEF printOnePage ( Self, item_list, Page):
     "" "
        @brief treating the resulting piece list
        @param item_list get a list of scripts 
        @param page deal with the first few pages 
    "" " 

    Print ( " ********* page% d, crawling completed ****** ... " % Page) 

    for Item in item_list:
         Print ( " =============== " )
         Print ITE
  • It should be noted that one is re.Sa parameter regular expression matching.
  • If you do not re.S is only one line matching the string does not meet the rules, if not then the next line re-match.
  • If all coupled re.S sucked by a whole string matching, findall matched to the result of all encapsulated into a list.
  • If we write a traversal item_listof a method printOnePage(). ok wrote this program, we execute it again.
python duanzi_spider.py

我们第一页的全部段子,不包含其他信息全部的打印了出来.

  • You will find many scripts <p>, </p>very uncomfortable, in fact, this is a html paragraph tag.
  • In the browser point of view does not come out, but if the text will be printed in the <p>present, then we just need to get rid of our content.
  • We can modify the following simple printOnePage ()
DEF printOnePage (Self, item_list, Page):
     "" " 
        @brief treating the resulting piece is a list 
        @param item_list get a list of scripts 
        @param page deal with the first few pages 
    " "" 
    Print ( " ****** page% d crawling completed ***** " % Page)  
     for Item in item_list:
         Print ( " ============ " ) 
        Item = item.replace ( " <P> " , " " ) .replace ( " </ P> " , "" ) .replace ( "<br />", "")
        print item

 

The third step: to save data

  • We can put all the scripts stored in the file. For example, each item we will not get printed, but on a file called duanzi.txt in the can.
DEF the writeToFile (Self, text):
     "" " 
        @brief append the data written into the file 
        @param text file content 
    " "" 

    myFile = Open ( " ./duanzi.txt " , " A " )   # A is added in the form of open files   
    myFile.write (text) 
    myFile.write ( " ------------------------- " ) 
    myFile.close ()
  • Then we will all print statements rewritten as writeToFile (), all scripts for the current page there is a local duanzi.txt file.
    DEF printOnePage (Self, item_list, Page):
         "" " 
            @brief treating the resulting piece list 
            @param item_list piece list obtained 
            @param page processing the first few pages 
        " "" 
    
        Print ( " *** page% d crawling complete **** " % Page)
         for Item in item_list: 
            Item = item.replace ( " <P> " , " " ) .replace ( " </ P> " , " " ) .replace ( " <br /> " . " " ) 
    
            Self.writeToFile(item)

Step Four: Display Data

  • Then we passed through parameters on page superposition to traverse the connotation piece it all scripts content.

  • Only the outer layer can be treated with some logic.

DEF doWork (Self):
     "" " 
        allow crawlers to work 
    " "" 
    the while self.enable:
         the try : 
            item_list = self.loadPage (self.page)
         the except urllib2.URLError, E:
             Print e.reason
             the Continue 

    # The resulting piece item_list processing 
    self.printOnePage (item_list, self.page) 
    self.page +. 1 =
     Print  " press eNTER to continue .... " 
    Print  " input quit exit " 

    Command = the raw_input ()
     IF (Command == "quit"):
        self.enable = False
        break

 

Guess you like

Origin www.cnblogs.com/moying-wq/p/11569958.html
Recommended