python3 crawler (7) - pyquery use CSS selectors (the Selectors) analysis data

Using python performed on the Web page data acquisition, often as they pass urllib or requests transmission request, the returned data structure is json format, we use the json parsing ; page data in other formats may be employed XPath (lxml) analysis data or use Beautiful Soup parsed data or using many methods pyquery analytical data and the like. Which, pyquery is also a powerful web analysis tool that provides a similar syntax and jQuery to parse HTML documents, CSS selector support, easy to use.

table of Contents

1, preparation and initialization

2, using pyquery CSS selectors (the Selectors) using substantially

3, pseudo-class selector

4, traversal, access to information (attributes, text)

5, child (Sun) node, the parent (ancestor) node searches to find a sibling

6, node operation


1, preparation and initialization

# pip install pyquery #安装
from pyquery import PyQuery as pq #引入
import requests
html = '''
    <div>
        <p class="pidg" id="name">nba</p>
        <td class="nobr player desktop">
            <a href="bucks" class="ng-binding" target="_parent" 
            href1="/teams/#!/bucks"><!-- ngIf: row.clinched -->密尔沃基&nbsp;雄鹿<b>nba</b></a>
        </td>
        <tr data-ng-repeat="(i, row) in page" index="0" class="ng-scope">
            <td class="nobr center bold ng-binding_0" href="href01">6</td>
            <td class="nobr center bold desktop ng-binding">18&nbsp;-&nbsp;4</td>
            <td class="nobr center bold desktop ng-binding">胜 6</td>
            <td class="nobr center bold desktop ng-binding">119.5</td>
        </tr>
    </div>
    '''
'''*************1、初始化***********************'''
doc=pq(html)# 初始化:它的初始化方式有多种,比如直接传入字符串,传入URL,传人文件名,等等。
# doc=pq(requests.get('https://blog.csdn.net/weixin_41685388/category_9426224.html').text)
# doc=doc=pq(filename='demo.html')

2, using pyquery CSS selectors (the Selectors) using substantially

CSS selectors   Simple examples DESCRIPTION [ DOC = PQ (HTML) from the point #html 1: Preparation and Initialization]
*        * Select all the elements: doc ( '*')
element   Selects all <p> element: doc ( 'p')
.class    .pidg Select All = class ' PIDG ' elements: DOC ( 'PIDG.')
.class p.pidg Select = class ' PIDG ' of <p> tags: DOC ( 'p.pidg')
#id        #name  Select all id = "name" element: DOC ( '# name')
#id     a#name Select id = 'name' of <p> tags: DOC ( 'name # P')
element,element p,a   Selects all <p> and <a> elements: DOC ( 'P, A')
gone element  div p     Select all the div elements all <p> element: DOC ( 'div P') with intermediate spaces #
went> element    div>td     To select parent element <div> element all <td> element: DOC ( 'div> TD')
gone element + p+td   Immediately after selection <p> All the elements <td> element: DOC ( 'P + TD') # peer
element~element td ~ td

Select front < TD > element each < TD > element: DOC ( '~ TD TD') text () #. 18 is - 6. 4 wins 119.5

Appreciated: text in front of a <td> element () is 6, no output Oh! ! !

[attribute]   [href]  Select all elements with an href attribute: DOC ( "[href]")
[attribute=value] [href=bucks]  = Select the href Bucks all elements: DOC ( "[href = bucks]") / DOC ( '[the href = "Bucks"]')
[attribute=value] a[href="bucks"] <a> element selected attribute href = "bucks" element: doc ( 'a [href = "bucks"]')
[attribute=value [class="nobr player desktop"]

Select class = " nobr Player Desktop " all the elements:

doc('[class *="nobr player desktop"]')

[attribute~=value] [class~=desktop]   Select class attribute contains the string desktop all the elements: DOC ( "[class desktop = ~]")
[attribute|=value] [href |= bucks] Select the href attribute value to " Bucks " at the beginning of the so elements: DOC ( '[href | = Bucks]') # debug unstable
[attribute^=value] a[href ^= bu]  Select href attribute value " Bu each <a> elements at the beginning": DOC ( '[href ^ = Bu] A')
[attribute$=value] a[href $=cks]  Select href attribute value " Bu each element <a> 'end: DOC (' [A href = $ CKS ] ')
[attribute*=value]     [class*=desktop]   Select class attribute contains the string desktop all the elements: DOC ( "[class desktop = *]")

3, pseudo-class selector

CSS selectors support multiple pseudo-class selectors, such as selecting a first node, the last node, the n-th node, the node containing the specified text and the like.

method Explanation Case [ DOC = PQ (HTML) from the point #html 1: Preparation and Initialization]
:first-child Get the first node doc("tr>td:first-child")
:last-child 获取最后一个节点 doc("tr>td:last-child")
:nth-child(N) 获取第N个节点,N=1,2,... doc("tr>td:nth-child(2)")
:nth-child(2n) 获取偶数位置的全部节点 doc("tr>td:nth-child(2n)")
:nth-child(2n-1) 获取奇数位置的全部节点 doc("tr>td:nth-child(2n-1)")
:gt(N) 获取索引大于N的节点,N=0,1,... doc("tr>td:gt(1)")
:contains('雄鹿') 获取文本包含"雄鹿"的节点 doc("td:contains('雄鹿')")

4、遍历、获取信息(属性、文本)

方法 说明 案例[doc=pq(html) #html来自第1点:准备及初始化]
.items() 遍历多节点

for td in doc('tr>td').items():

       print(td)

.attr() 获取属性 doc("a").attr("href")
.attr. 获取属性 doc("a").attr.href
.text() 获取文本 doc("a").text() #密尔沃基 雄鹿nba
.html() 获取节点内部的HTML文本 doc("a").html() #<!-- ngIf: row.clinched -->密尔沃基 雄鹿<b>nba</b>

5、子(孙)节点,父(祖)节点查找、兄弟节点的查找

方法 说明 案例[doc=pq(html) #html来自第1点:准备及初始化]
.find() 查找符合条件的所有子孙节点 doc('div').find('td')
.children() 查找直接子节点 doc('td[class="nobr player desktop"]').children()
.children() 查找符合条件的直接子节点 doc('td[class="nobr player desktop"]').children('a[href="bucks"]')
.parent() 查找直接父节点 doc('a[href="bucks"]').parent()
.parent() 查找符合条件的父节点 doc('a[href="bucks"]').parent('td[class="nobr player desktop"]')
.parents() 查找祖先节点 doc('a[href="bucks"]').parents()
.parents() 查找符合条件的祖先节点 doc('a[href="bucks"]').parents('td[class="nobr player desktop"]')
.siblings() 查找全部兄弟标签 doc('td[href="href01"]').siblings()
.siblings() 查找符合条件的兄弟标签 doc('td[href="href01"]').siblings('td[class *= "nobr"]')

6、节点操作

为了提取方便,我们可以修改我们已经获取的html的节点,如在指定位置添加class,移除不需要的某个(些)节点等。

方法 说明 案例[doc=pq(html) #html来自第1点:准备及初始化]
removeClass() 移除class属性 r=doc("tr>td").removeClass("center") 或者r=doc("tr>td").remove_class("center")
addClass() 添加class属性

r=doc("tr>td").addClass("nba")

r=doc("tr>td").add_class("nba")

attr() 添加属性a,值为nba r=doc("tr>td").attr("a","nba")
text() 修改节点内文本为nba r=doc("td>a").text("nba")
html() 修改节点内文本为nba r=doc("td>a").html("nba")
remove() 移除指定节点

doc("tr").remove()

print(doc)

发布了109 篇原创文章 · 获赞 108 · 访问量 1万+

Guess you like

Origin blog.csdn.net/weixin_41685388/article/details/104076625
Recommended