Using python performed on the Web page data acquisition, often as they pass urllib or requests transmission request, the returned data structure is json format, we use the json parsing ; page data in other formats may be employed XPath (lxml) analysis data or use Beautiful Soup parsed data or using many methods pyquery analytical data and the like. Which, pyquery is also a powerful web analysis tool that provides a similar syntax and jQuery to parse HTML documents, CSS selector support, easy to use.
table of Contents
1, preparation and initialization
2, using pyquery CSS selectors (the Selectors) using substantially
4, traversal, access to information (attributes, text)
5, child (Sun) node, the parent (ancestor) node searches to find a sibling
1, preparation and initialization
# pip install pyquery #安装
from pyquery import PyQuery as pq #引入
import requests
html = '''
<div>
<p class="pidg" id="name">nba</p>
<td class="nobr player desktop">
<a href="bucks" class="ng-binding" target="_parent"
href1="/teams/#!/bucks"><!-- ngIf: row.clinched -->密尔沃基 雄鹿<b>nba</b></a>
</td>
<tr data-ng-repeat="(i, row) in page" index="0" class="ng-scope">
<td class="nobr center bold ng-binding_0" href="href01">6</td>
<td class="nobr center bold desktop ng-binding">18 - 4</td>
<td class="nobr center bold desktop ng-binding">胜 6</td>
<td class="nobr center bold desktop ng-binding">119.5</td>
</tr>
</div>
'''
'''*************1、初始化***********************'''
doc=pq(html)# 初始化:它的初始化方式有多种,比如直接传入字符串,传入URL,传人文件名,等等。
# doc=pq(requests.get('https://blog.csdn.net/weixin_41685388/category_9426224.html').text)
# doc=doc=pq(filename='demo.html')
2, using pyquery CSS selectors (the Selectors) using substantially
CSS selectors | Simple examples | DESCRIPTION [ DOC = PQ (HTML) from the point #html 1: Preparation and Initialization] |
---|---|---|
* | * | Select all the elements: doc ( '*') |
element | p | Selects all <p> element: doc ( 'p') |
.class | .pidg | Select All = class ' PIDG ' elements: DOC ( 'PIDG.') |
.class | p.pidg | Select = class ' PIDG ' of <p> tags: DOC ( 'p.pidg') |
#id | #name | Select all id = "name" element: DOC ( '# name') |
#id | a#name | Select id = 'name' of <p> tags: DOC ( 'name # P') |
element,element | p,a | Selects all <p> and <a> elements: DOC ( 'P, A') |
gone element | div p | Select all the div elements all <p> element: DOC ( 'div P') with intermediate spaces # |
went> element | div>td | To select parent element <div> element all <td> element: DOC ( 'div> TD') |
gone element + | p+td | Immediately after selection <p> All the elements <td> element: DOC ( 'P + TD') # peer |
element~element | td ~ td | Select front < TD > element each < TD > element: DOC ( '~ TD TD') text () #. 18 is - 6. 4 wins 119.5 Appreciated: text in front of a <td> element () is 6, no output Oh! ! ! |
[attribute] | [href] | Select all elements with an href attribute: DOC ( "[href]") |
[attribute=value] | [href=bucks] | = Select the href Bucks all elements: DOC ( "[href = bucks]") / DOC ( '[the href = "Bucks"]') |
[attribute=value] | a[href="bucks"] | <a> element selected attribute href = "bucks" element: doc ( 'a [href = "bucks"]') |
[attribute=value | [class="nobr player desktop"] | Select class = " nobr Player Desktop " all the elements: doc('[class *="nobr player desktop"]') |
[attribute~=value] | [class~=desktop] | Select class attribute contains the string desktop all the elements: DOC ( "[class desktop = ~]") |
[attribute|=value] | [href |= bucks] | Select the href attribute value to " Bucks " at the beginning of the so elements: DOC ( '[href | = Bucks]') # debug unstable |
[attribute^=value] | a[href ^= bu] | Select href attribute value " Bu each <a> elements at the beginning": DOC ( '[href ^ = Bu] A') |
[attribute$=value] | a[href $=cks] | Select href attribute value " Bu each element <a> 'end: DOC (' [A href = $ CKS ] ') |
[attribute*=value] | [class*=desktop] | Select class attribute contains the string desktop all the elements: DOC ( "[class desktop = *]") |
3, pseudo-class selector
CSS selectors support multiple pseudo-class selectors, such as selecting a first node, the last node, the n-th node, the node containing the specified text and the like.
method | Explanation | Case [ DOC = PQ (HTML) from the point #html 1: Preparation and Initialization] |
---|---|---|
:first-child | Get the first node | doc("tr>td:first-child") |
:last-child | 获取最后一个节点 | doc("tr>td:last-child") |
:nth-child(N) | 获取第N个节点,N=1,2,... | doc("tr>td:nth-child(2)") |
:nth-child(2n) | 获取偶数位置的全部节点 | doc("tr>td:nth-child(2n)") |
:nth-child(2n-1) | 获取奇数位置的全部节点 | doc("tr>td:nth-child(2n-1)") |
:gt(N) | 获取索引大于N的节点,N=0,1,... | doc("tr>td:gt(1)") |
:contains('雄鹿') | 获取文本包含"雄鹿"的节点 | doc("td:contains('雄鹿')") |
4、遍历、获取信息(属性、文本)
方法 | 说明 | 案例[doc=pq(html) #html来自第1点:准备及初始化] |
---|---|---|
.items() | 遍历多节点 | for td in doc('tr>td').items(): print(td) |
.attr() | 获取属性 | doc("a").attr("href") |
.attr. | 获取属性 | doc("a").attr.href |
.text() | 获取文本 | doc("a").text() #密尔沃基 雄鹿nba |
.html() | 获取节点内部的HTML文本 | doc("a").html() #<!-- ngIf: row.clinched -->密尔沃基 雄鹿<b>nba</b> |
5、子(孙)节点,父(祖)节点查找、兄弟节点的查找
方法 | 说明 | 案例[doc=pq(html) #html来自第1点:准备及初始化] |
---|---|---|
.find() | 查找符合条件的所有子孙节点 | doc('div').find('td') |
.children() | 查找直接子节点 | doc('td[class="nobr player desktop"]').children() |
.children() | 查找符合条件的直接子节点 | doc('td[class="nobr player desktop"]').children('a[href="bucks"]') |
.parent() | 查找直接父节点 | doc('a[href="bucks"]').parent() |
.parent() | 查找符合条件的父节点 | doc('a[href="bucks"]').parent('td[class="nobr player desktop"]') |
.parents() | 查找祖先节点 | doc('a[href="bucks"]').parents() |
.parents() | 查找符合条件的祖先节点 | doc('a[href="bucks"]').parents('td[class="nobr player desktop"]') |
.siblings() | 查找全部兄弟标签 | doc('td[href="href01"]').siblings() |
.siblings() | 查找符合条件的兄弟标签 | doc('td[href="href01"]').siblings('td[class *= "nobr"]') |
6、节点操作
为了提取方便,我们可以修改我们已经获取的html的节点,如在指定位置添加class,移除不需要的某个(些)节点等。
方法 | 说明 | 案例[doc=pq(html) #html来自第1点:准备及初始化] |
---|---|---|
removeClass() | 移除class属性 | r=doc("tr>td").removeClass("center") 或者r=doc("tr>td").remove_class("center") |
addClass() | 添加class属性 | r=doc("tr>td").addClass("nba") r=doc("tr>td").add_class("nba") |
attr() | 添加属性a,值为nba | r=doc("tr>td").attr("a","nba") |
text() | 修改节点内文本为nba | r=doc("td>a").text("nba") |
html() | 修改节点内文本为nba | r=doc("td>a").html("nba") |
remove() | 移除指定节点 | doc("tr").remove() print(doc) |