Reptile entry beautifulsoup library (a)

First posted a beautifulsoup of official documents, https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id12

requests the library to get a response url, but to really get the page code, in order to get what you want, we need to look at beautifulsoup this library, which can be extracted wanted.

Download and install the official documentation we have, also like to say here about the parser. beautifulsoup In addition to supporting HTML parser library python standard library also supports other similar, lxml and html5lib.

Above this table from the official document, which the parser will be personal choice.

 

Next, enter the text, we must first construct an object, a soup = BeautifulSoup (html, 'lxml '), which can advance with requests html library request, you can also write your own, of course, can also be used Soup = BeautifulSoup ( open ( "index.html" )) this method opens own html.

Html then is to see that, when there are a html tag, a label with the first output soup.a to encounter the same token, can soup.title output html title tags.

Only the first label that can not meet our needs, we need all the tag data would need to use this method findAll friends with all_a = soup.findAll ( 'a'), you can get all of a label, but this time is output with a label, just want to get content, there is need to use the string method, all_a.string, you can.

Man of few words said, first try h2 tag millet official website, that subtitle try to crawl down

from BS4 Import the BeautifulSoup
 Import lxml
 Import Requests 

URL = ' https://www.mi.com/ ' 
the try :
     # analog browser 
    kV = { ' User-Agent ' : ' the Mozilla / 5.0 ' } 
    R & lt = requests.get (URL , headers = kV)
     # status code check for 
    r.raise_for_status () 
    r.encoding = r.apparent_encoding 
    Soup = the BeautifulSoup (r.text, ' lxml ' )
     forTag in soup.findAll ( ' H2 ' ):
         Print (tag.string)
 the except : 
    ( " crawling failed " )

Then talk about the string method explained in the official document like this

Simply put, when you get the label's no other label, you call this method will output the contents of the tag, but the tag if there are other small labels and content, none return a value, for example, millet again when crawling a label,

This is a value of the data returned is none

When we crawled data gaps will sometimes crawl, but do not want the blank can be used when .stripped_strings way to get rid of the blank

Then talk about positioning on the label such as the one above with a label i, we can first find i label, a label with the output of his parent with. parent approach, the same token, by .next_siblings and .previous_siblings can find sibling of the current node properties

 

Guess you like

Origin www.cnblogs.com/afei123/p/11223215.html