# 1. etree # download: the install PIP lxml # leader packet: from lxml import etree # Convert html or xml document into a document etree object, and then call the method to find the specified object node # 2.1 local file: tree = etree.parse (filename) tree.xpath ( " xpath expression " ) # 2.2 Network data: tree = etree.HTML (web content string) tree.xpath ( " xpath expression " ) # 2. 使用Selector # from scrapy import Selector html_selector = Selector(text=response) STR = html_selector.xpath ( " / HTML / body / div [2] / div [. 1] / div [2] / div [. 1] / P [. 1] / A / B / text () " ) .extract_first () Print (str) # outside plum blossoms fall wind-sun # 3. xpath 1 Attribute Positioning: # find a div tag class attribute value of the song // div [@ class = " song " ] 2. & hierarchical index locating: # # 3. immediate children of a tag label li in the second sub-class attribute values found in the tang of the div sub-tab immediate UL // div [@ class = " tang " ] / UL / Li [2] / A 4 logic operations:. # Found empty and href attribute value class attribute value of a tag du // a [@ href = "" and @ class = " du " ] 5 . Fuzzy Match: # class contained ng of class = "xxx" is a class of this class // div [the contains (@ class , " ng " )] # the contains contains // div [Soho starts-with (@ class , " TA " )] # Soho starts at the beginning of what 6 Take text: # / retrieves the text contents of a tag # // representing the text contents of the text contents of obtaining a label and all sub-tabs [div // @ class = " Song " ] / the p-[ . 1] / text () # take immediate text // div [@ class = " Tang " ] // text () # take all text # / text returns a list of lists # // text () returns a list of a plurality of 7 Take properties: // div [@ class = " Tang " ] // Li [2] / A / @href // A [text () = ' Next ' ] / @href # 4. BeautifulSoup # Operation process: - guide package: from BS4 Import BeautifulSoup - use: a html document can be converted to BeautifulSoup object, and then by a method or property of an object to find the content of the specified node ( 1 ) conversion of the local file: - the BeautifulSoup Soup = (Open ( ' local files ' ), ' lxml ' ) ( 2 ) Network File Conversion: - the BeautifulSoup Soup = ( ' a string or byte type ' , ' lxml ' ) ( 3 ) Print soup objects displayed content is content html file # Consolidation: (1 ) find under the label name - soup.a # only find the first label to meet the requirements of (2 ) to obtain property - soup.a.attrs # get a all attributes and attribute values, returns a dictionary - soup.a.attrs [ ' href ' ] # Get href attribute - soup.a [ ' href ' ] # may be abbreviated as such form (3 ) acquires the content - soup.a.string # returns the text string immediate data that there is no such direct cross-grade li in ul # li inside of a label is not a direct - soup.a.text # is likely to return to the list of non-immediate text data - soup.a.get_text () #Is likely to return to the list of non-immediate text data [Note] If there is tag label, then the string to get results to None, and the other two, you can get the text content ( . 4 ) Find: found to meet the requirements of a first label - soup.find ( ' A ' ) # find the first to meet the requirements and return the same effect soup.a a singular - soup.find ( ' A ' , title = " XXX " ) # obtained attribute value title = "xxx" label - soup.find ( ' a ' , Alt = " XXX " ) - soup.find ( ' a ' , the class_ = " XXX " ) # Note underlined - soup.find ( ' A ', id="xxx") ( . 5) find_all: # find all comply with the requirements of the plural return tag - soup.find_all ( ' a ' ) - soup.find_all ([ ' a ' , ' b ' ]) # Find all labels a and b - soup .find_all ( ' a ' , limit = 2) # limit before the two (6 ) selected in accordance with the specified content selector SELECT: soup.select ( ' #feng ' ) - common selector: (.) tag selector (A), class selector, id selector ( # ), the selector level - level selector: .dudu div # Lala .meme .xixi # div class below a lot of space .dudu div> the p-> A> .lala # only following a [Note] selector to select a list of return will always be necessary to extract specified by index objects
Analytical and analytical methods reptiles of objects
Guess you like
Origin www.cnblogs.com/yzg-14/p/12122279.html
Recommended
Ranking