1. Basic Usage
>>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup('<p>Hello</p>','lxml') >>> soup.p.string 'Hello'
2. Select the node
Select elements
>>> html=""" <ul class="topnav-noauth clearfix"> <li> <a href="javascript:;" class="js-signup-noauth"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> </li> <li> <a href="javascript:;" class="js-signin-noauth">登录</a> </li> </ul> >>> h=BeautifulSoup(html,'lxml') >>> h <html><body><ul class="topnav-noauth clearfix"> <li> <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> </li> <li> <a class="js-signin-noauth" href="javascript:;">登录</a> </li> </ul> </body></html> >>> h.ul <ul class="topnav-noauth clearfix"> <li> <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> </li> <li> <a class="js-signin-noauth" href="javascript:;">登录</a> </li> </ul> >>> h.a.string >>>
// When the assignment html, requires the use of three "," ", and after wrapping paste content.
Acquiring property
>>> h.a['href'] 'javascript:;' >>> h.i['class'] ['zg-icon', 'zg-icon-dd-home'] >>> h.a['class'] ['js-signup-noauth']
Direct use of [] to obtain, but recognizes only the first html tag appears in the current, if the attribute is composed of a plurality of list.
Access to content
>>> h.li.string >>> h.i.string >>> h.a.string
// Here's a very strange why the acquisition should not occur:? Sign know almost.
Nested selection
>>> h.ul.li <li> <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> </li>
3. Select the association
Child nodes and node descendants
- Get direct child node, using the contents, it returns a list of types.
>>> h.li.contents ['\n', <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>, '\n']
- As above, but it returns the type of generator, using an enumeration traversal.
>>> ch=h.li.children >>> for i,c in enumerate(ch) SyntaxError: invalid syntax >>> for i,c in enumerate(ch): print(i,c) 0 1 <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> 2
To traverse the hierarchy to get inside all Tags:
>>> de=h.li.descendants >>> for i,d in enumerate(de): print(i,d) 0 1 <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> 2 <i class="zg-icon zg-icon-dd-home"></i> 3 Register know almost 4 >>>
Parent and ancestor nodes
>>> h.i.parent #父亲节点 <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> >>> h.li.next_sibling #兄弟节点 '\n' >>> list(enumerate(h.li.next_sibling)) [(0, '\n')] >>> list(enumerate(h.ul.next_sibling)) [(0, '\n')] >>> list(enumerate(h.a.next_sibling)) [(0, '\n')] >>> list(enumerate(h.i.next_sibling)) [(0, ' Note ' ), (1, ' book ' ), (2, ' know ' ), (3, ' almost ' )] >>>
4. Information Extraction
>>> list(h.a.parents)[0] <li> <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> </li> >>> list(h.a.parents)[1].attrs['class'] ['topnav-noauth', 'clearfix']
Parents used herein, is the type of generator, first convert list type.
The method selector
find_all()
Find according to the label name
>>> h.find_all(name='li') [<li> <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a> </li>, <li> <a class="js-signin-noauth" href="javascript:;">登录</a> </li>] >>> h.find_all(name='a') [<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>,
<a class="js-signin-noauth" href="javascript:;">登录</a>]
>>> h.find_all(name='a')[0]
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
According to the property name Find
>>> h.find_all(attrs={'href':'javascript:;'}) [<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>,
<a class="js-signin-noauth" href="javascript:;">登录</a>]