Analytical and analytical methods reptiles of objects

# 1. etree 
# download: the install PIP lxml 
# leader packet: from lxml import etree

# Convert html or xml document into a document etree object, and then call the method to find the specified object node 
  # 2.1 local file: tree = etree.parse (filename) 
tree.xpath ( " xpath expression " )
  # 2.2 Network data: tree = etree.HTML (web content string) 
tree.xpath ( " xpath expression " )

# 2. 使用Selector
# from scrapy import Selector
html_selector = Selector(text=response)
        STR = html_selector.xpath ( " / HTML / body / div [2] / div [. 1] / div [2] / div [. 1] / P [. 1] / A / B / text () " ) .extract_first ()
         Print (str)   # outside plum blossoms fall wind-sun

# 3. xpath 
1 Attribute Positioning:
     # find a div tag class attribute value of the song 
    // div [@ class = " song " ]
    
2. & hierarchical index locating:
     # # 3. immediate children of a tag label li in the second sub-class attribute values found in the tang of the div sub-tab immediate UL 
    // div [@ class = " tang " ] / UL / Li [2] / A
    
4 logic operations:.
     # Found empty and href attribute value class attribute value of a tag du 
    // a [@ href = ""  and @ class = " du " ]
    
5 . Fuzzy Match:
     # class contained ng of class = "xxx" is a class of this class 
    // div [the contains (@ class , " ng " )]   # the contains contains 
    // div [Soho starts-with (@ class , " TA " )]   # Soho starts at the beginning of what
    
6 Take text:
     # / retrieves the text contents of a tag 
    # // representing the text contents of the text contents of obtaining a label and all sub-tabs 
    [div // @ class = " Song " ] / the p-[ . 1] / text ()   # take immediate text 
    // div [@ class = " Tang " ] // text ()   # take all text 
   # / text returns a list of lists 
    # // text () returns a list of a plurality of
    
7 Take properties:
     // div [@ class = " Tang " ] // Li [2] / A / @href
     // A [text () = ' Next ' ] / @href

# 4. BeautifulSoup 
# Operation process:        
    - guide package: from BS4 Import BeautifulSoup
     - use: a html document can be converted to BeautifulSoup object, and then by a method or property of an object to find the content of the specified node
        ( 1 ) conversion of the local file:
              - the BeautifulSoup Soup = (Open ( ' local files ' ), ' lxml ' )
        ( 2 ) Network File Conversion:
              - the BeautifulSoup Soup = ( ' a string or byte type ' , ' lxml ' )
        ( 3 ) Print soup objects displayed content is content html file

# Consolidation: 
    (1 ) find under the label name
         - soup.a    # only find the first label to meet the requirements of 
    (2 ) to obtain property
         - soup.a.attrs   # get a all attributes and attribute values, returns a dictionary 
        - soup.a.attrs [ ' href ' ]    # Get href attribute 
        - soup.a [ ' href ' ]    # may be abbreviated as such form 
    (3 ) acquires the content
         - soup.a.string   # returns the text string immediate data that there is no such direct cross-grade li in ul 
                        # li inside of a label is not a direct 
        - soup.a.text   # is likely to return to the list of non-immediate text data 
        - soup.a.get_text ()   #Is likely to return to the list of non-immediate text data 
       [Note] If there is tag label, then the string to get results to None, and the other two, you can get the text content
    ( . 4 ) Find: found to meet the requirements of a first label
         - soup.find ( ' A ' )   # find the first to meet the requirements and return the same effect soup.a a singular 
        - soup.find ( ' A ' , title = " XXX " )   # obtained attribute value title = "xxx" label 
        - soup.find ( ' a ' , Alt = " XXX " )
         - soup.find ( ' a ' , the class_ = " XXX " )   # Note underlined 
        - soup.find ( ' A ', id="xxx")
    ( . 5) find_all: # find all comply with the requirements of the plural return tag 
        - soup.find_all ( ' a ' )
         - soup.find_all ([ ' a ' , ' b ' ]) # Find all labels a and b 
        - soup .find_all ( ' a ' , limit = 2)   # limit before the two 
    (6 ) selected in accordance with the specified content selector
               SELECT: soup.select ( ' #feng ' )
         - common selector: (.) tag selector (A), class selector, id selector ( # ), the selector level 
            - level selector:
                .dudu div # Lala .meme .xixi # div class below a lot of space .dudu 
                div> the p-> A> .lala           # only following a 
        [Note] selector to select a list of return will always be necessary to extract specified by index objects

Guess you like

Origin www.cnblogs.com/yzg-14/p/12122279.html