Py of parsing library BeautifulSoup learning

1. Basic Usage

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup('<p>Hello</p>','lxml')
>>> soup.p.string
'Hello'

2. Select the node

Select elements

>>> html="""
<ul class="topnav-noauth clearfix">
<li>
<a href="javascript:;" class="js-signup-noauth"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
</li>
<li>
<a href="javascript:;" class="js-signin-noauth">登录</a>
</li>
</ul>
>>> h=BeautifulSoup(html,'lxml')
>>> h
<html><body><ul class="topnav-noauth clearfix">
<li>
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
</li>
<li>
<a class="js-signin-noauth" href="javascript:;">登录</a>
</li>
</ul>
</body></html>
>>> h.ul
<ul class="topnav-noauth clearfix">
<li>
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
</li>
<li>
<a class="js-signin-noauth" href="javascript:;">登录</a>
</li>
</ul>
>>> h.a.string
>>> 

 

// When the assignment html, requires the use of three "," ", and after wrapping paste content.

Acquiring property

>>> h.a['href']
'javascript:;'
>>> h.i['class']
['zg-icon', 'zg-icon-dd-home']
>>> h.a['class']
['js-signup-noauth']

 

Direct use of [] to obtain, but recognizes only the first html tag appears in the current, if the attribute is composed of a plurality of list.

Access to content

>>> h.li.string
>>> h.i.string
>>> h.a.string

 

// Here's a very strange why the acquisition should not occur:? Sign know almost.

Nested selection

>>> h.ul.li
<li>
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
</li>

 

3. Select the association

Child nodes and node descendants

- Get direct child node, using the contents, it returns a list of types.

>>> h.li.contents
['\n', <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>, '\n']

 

- As above, but it returns the type of generator, using an enumeration traversal.

>>> ch=h.li.children
>>> for i,c in enumerate(ch)
SyntaxError: invalid syntax
>>> for i,c in enumerate(ch):
    print(i,c)

0 

1 <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
2 

 

To traverse the hierarchy to get inside all Tags:

>>> de=h.li.descendants
>>> for i,d in enumerate(de):
    print(i,d)

0 

1 <a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
2 <i class="zg-icon zg-icon-dd-home"></i>
3 Register know almost
 4

>>> 

 

Parent and ancestor nodes

>>> h.i.parent #父亲节点
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
>>> h.li.next_sibling #兄弟节点
'\n'
>>> list(enumerate(h.li.next_sibling))
[(0, '\n')]
>>> list(enumerate(h.ul.next_sibling))
[(0, '\n')]
>>> list(enumerate(h.a.next_sibling))
[(0, '\n')]
>>> list(enumerate(h.i.next_sibling))
[(0, ' Note ' ), (1, ' book ' ), (2, ' know ' ), (3, ' almost ' )]
 >>>

 

4. Information Extraction

>>> list(h.a.parents)[0]
<li>
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
</li>
>>> list(h.a.parents)[1].attrs['class']
['topnav-noauth', 'clearfix']

 

Parents used herein, is the type of generator, first convert list type.

The method selector

find_all()

Find according to the label name

>>> h.find_all(name='li')
[<li>
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>
</li>, <li>
<a class="js-signin-noauth" href="javascript:;">登录</a>
</li>]
>>> h.find_all(name='a')
[<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>, 
<a class="js-signin-noauth" href="javascript:;">登录</a>]

>>> h.find_all(name='a')[0]
<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>

 

According to the property name Find

>>> h.find_all(attrs={'href':'javascript:;'})
[<a class="js-signup-noauth" href="javascript:;"><i class="zg-icon zg-icon-dd-home"></i>注册知乎</a>, 
<a class="js-signin-noauth" href="javascript:;">登录</a>]

 

Guess you like

Origin www.cnblogs.com/BlueBlueSea/p/11037207.html