First, what is PyQuery?
PyQuery library is also a very powerful and flexible web-parsing library.
Official website address: http://pyquery.readthedocs.io/en/latest/
Two, PyQuery basic library use
html = ''' <div> <ul> <li class="item-0">first item<lli> <li class="item-1"><a href="link2.html">second item</a><lli> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class ="item-1 active"><a href="link4 . html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
1. Initialization
# Initialization string from pyquery Import pyquery AS PQ HTML = "" DOC = PD (HTML) Print (DOC ( ' Li ' )) # the URL initialized from pyquery Import pyquery AS PQ HTML = "" DOC = PQ (URL = ' HTTPS : //cuiqingcai.com ') Print (DOC (' title ' )) # file initialization from pyquery Import pyquery AS PQ HTML = "" DOC= pq(filename=’demo.html’) print(doc(’li’))
2.CSS selector - Get tag
from pyquery import PyQuery as pq doc = pd(html) # 子元素 items = doc('.list') lis = items.find('li') lis = items.children() lis = items.children('.active') print(lis) # 父元素 items = doc('.list') container =items.parents() print(container) parent = items.parents('.wrap') print(parent) # 兄弟元素 li = doc('.list.item-0.active') print(li.siblings()) print(li.siblings('.active'))
3.CSS Selector - get property
from pyquery import PyQuery as pq doc = pd(html) a = doc('.item-0.active a') print(a) print(a.attr.href) print(a.attr('href')
4. Get Content
from pyquery import PyQuery as pq doc = pd(html) a = doc('.item-0.active a') print(a) print(a.text())
5. Get HTML
from pyquery import PyQuery as pq doc = pd(html) li = doc('.item-0.active') print(li) print(li.html())