Use of Python XPath

Use of Python XPath


0x00 Foreword

  A few months ago to learn about XPath, long time useless discoveries have forgotten, and today intends to refer to "Python3 web crawler developed combat" review the XPath basic content, facilitate future do some notes on the blog review

0x01 XPath Introduction

  XPath, full name of the XML Path Language, namely XML Path Language, it is a finding information in an XML document language. It was originally used to search XML documents, but it also applies to search HTML documents. 1 In doing so the reptiles, we can use XPath to do the appropriate information extraction.

0x02 preparations

  Before using XPath we have to install the library lxml

pip install lxml

Examples of 0x03 introduced

from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

  We first have to import the library lxml etree module, and then declare a piece of HTML text (cat's eye from the movie), call the HTML class is initialized, thus successfully construct an XPath parsing object. etree module can also automatically correct HTML text. Here we call tostring () HTML code to output correction method, but the result is the type of bytes, we use the decode () method which was converted to str type the following results:

<html><body><div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="&#35823;&#26432;" data-act="boarditem-click" data-val="{movieId:1218273}">&#35823;&#26432;</a></p> 
        <p class="star">
                &#20027;&#28436;&#65306;&#32918;&#22830;,&#35885;&#21331;,&#38472;&#20914;
</p><p class="releasetime">&#19978;&#26144;&#26102;&#38388;&#65306;2019-12-13    </p></div>
</body></html>

  It can be seen that, after the processing node label is missing, but also automatically adds body, html node.
  Also, XPath can be read directly parsed text, examples are as follows:

from lxml import etree

html = etree.parse('F:\\code\\myProject\\博客\\test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

Read the contents of the directory where the file is:

<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>

  Here parsing Chinese appeared html coding, XPath parse the basics down now we just understand, do not tangle in Chinese is encoded, we will solve this problem later. Results are as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div class="movie-item-info">&#13;
        <p class="name"><a href="/films/1218273" title="&#232;&#175;&#175;&#230;&#157;&#128;" data-act="boarditem-click" data-val="{movieId:1218273}">&#232;&#175;&#175;&#230;&#157;&#128;</a></p>&#13;
        <p class="star">&#13;
                &#228;&#184;&#187;&#230;&#188;&#148;&#239;&#188;&#154;&#232;&#130;&#150;&#229;&#164;&#174;,&#232;&#176;&#173;&#229;&#141;&#147;,&#233;&#153;&#136;&#229;&#134;&#178;&#13;
</p><p class="releasetime">&#228;&#184;&#138;&#230;&#152;&#160;&#230;&#151;&#182;&#233;&#151;&#180;&#239;&#188;&#154;2019-12-13    </p></div>&#13;
<div class="movie-item-info">&#13;
        <p class="name"><a href="/films/1190122" title="&#229;&#143;&#182;&#233;&#151;&#174;4&#239;&#188;&#154;&#229;&#174;&#140;&#231;&#187;&#147;&#231;&#175;&#135;" data-act="boarditem-click" data-val="{movieId:1190122}">&#229;&#143;&#182;&#233;&#151;&#174;4&#239;&#188;&#154;&#229;&#174;&#140;&#231;&#187;&#147;&#231;&#175;&#135;</a></p>&#13;
        <p class="star">&#13;
                &#228;&#184;&#187;&#230;&#188;&#148;&#239;&#188;&#154;&#231;&#148;&#132;&#229;&#173;&#144;&#228;&#184;&#185;,&#229;&#144;&#180;&#230;&#168;&#190;,&#229;&#144;&#180;&#229;&#187;&#186;&#232;&#177;&#170;&#13;
        </p>&#13;
<p class="releasetime">&#228;&#184;&#138;&#230;&#152;&#160;&#230;&#151;&#182;&#233;&#151;&#180;&#239;&#188;&#154;2019-12-20</p>    </div></body></html>    

0x04 acquisition node

  • Get all the nodes
      we usually use XPath rule to begin with // select all the nodes meet the requirements. It can be achieved:
from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//*')  # 获取所有节点
print(result)

  Here * for all matching nodes, that is, all nodes will be an entire HTML text is acquired. You can see, is returned in the form of a list, each element is an Element type, followed by the name of the node, such as html, body, div, p, a, etc., all nodes are included in the list.

  • Gets the specified node
      if you want to get all the p-node, for example:
from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//p')      # 获取所有p节点
print(result)
print(result[0])

  Here to select all p-node, you can use //, then direct node name to add, directly xpath () method to the call. operation result:

[<Element p at 0x46ac428>, <Element p at 0x46ac4e8>, <Element p at 0x46ac528>, <Element p at 0x46ac548>, <Element p at 0x46ac568>, <Element p at 0x46ac5a8>]
<Element p at 0x46ac428>

  Here you can see the result is a list extract form, where each element is an Element object. To retrieve an object which can be directly indexed with brackets, such as [0]

  • Getting child nodes
      we can by / or // to find out the child nodes or descendant node elements. If you now want to choose div p nodes of all child nodes, can be achieved:
from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//div/p')  # 获取div的子节点p
print(result)
print(result[0])

Results are as follows:

[<Element p at 0x49ad508>, <Element p at 0x49ad5c8>, <Element p at 0x49ad608>, <Element p at 0x49ad628>, <Element p at 0x49ad648>, <Element p at 0x49ad688>]
<Element p at 0x49ad508>


  • Node obtain descendants
      above, we use / to get the child nodes, you want to get descendant nodes we can use //. For example, to get all descendants of nodes under the div node, can be achieved:
from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//div//a')     # 获取div下的子孙节点,就是a
print(result)
print(result[0])

Results are as follows:

[<Element a at 0x44dc468>, <Element a at 0x44dc4a8>]
<Element a at 0x44dc468>

  We should note that / and // difference, where / is used to obtain direct child node, node // used to get children and grandchildren.

  • Get the parent node
      we already know Getting child nodes and node descendants, get the parent node can use .. to achieve. For example, we want to get the href attribute for the class attribute of the parent node / films / 1218273 of a node, which is p class attribute, this can be achieved:
from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//a[@href="/films/1218273"]/../@class')
print(result)

Results are as follows:

['name']

0x05 attributes match

  When we get the above properties of the parent node has already used the knowledge of the properties, in xpath we can use the @ symbol to attribute filtering. For example, here is to select the class of p star nodes, can be achieved:

from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//p[@class="star"]')   # 属性匹配
print(result)

  Here we by adding [@ class = "star"], limiting the class attribute node is star, and meet the conditions in the case of p HTML text node has two, so the results should be returned to the two matching elements. The results are as follows:

[<Element p at 0x4c0c3a8>, <Element p at 0x4c0c468>]


0x06 text acquisition

  We can use the test () method in XPath to get the text content, then try to get the properties for the p-node class of the text, which is the name of our movie

from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//p[@class="name"]/text()')
print(result)

Results are as follows:

[]

  We can see that we did not get into any text, to explain the book is: XPath in the text () in front of /, and meaning here / is selected direct child node, direct child node is a node p, text is in the interior of a node. We used the wrong / so that we do not have to match what you want to get the
  inside text if you want to get p nodes, there are two ways to obtain one is to obtain a node to get the text, and the other is to use / /. Next, we look at the difference between the two:
First, select a node to get the text, as follows:

from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//p[@class="name"]/a/text()')
print(result)

Results are as follows:

['误杀', '叶问4:完结篇']

  Here can be seen that the return value is two, the content is text attribute name of the node p. Here we selected layer by layer, first select the node p, with a use / selection of its direct child node a, and then select the text, the result is exactly the result we expected two
  look at another manner ( i.e. using //) selected result codes are as follows:

from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//p[@class="name"]//text()')   # 使用//获取p节点下的文本
print(result)

Results are as follows:

['误杀', '叶问4:完结篇']

  We can see that we get the expected results, but if we want to get the properties for text in p-node star, the code is as follows:

from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//p[@class="star"]//text()')
print(result)

operation result:

['\n                主演:肖央,谭卓,陈冲\n', '\n                主演:甄子丹,吴樾,吴建豪\n        ']

  We found that although we want to get to the content, but there is also some line breaks. So, if you want to get all the text inside the node descendants, can be used directly // add text () mode, so you can get to ensure the most comprehensive text messages, but may be mixed with some line breaks and other special characters. If you want to get all the text in a certain descendant node, you can get to a specific node descendants, then call text () method to get its inner text, so you can ensure that our results are clean

0x07 property acquisition

  We know with text () can get inside a text node, the node attribute that how to get it? In fact, you can still use the @ symbol. For example, we want to obtain a href attribute of all nodes and all nodes p, as follows:

from lxml import etree

text = '''
<div class="movie-item-info">
        <p class="name"><a href="/films/1218273" title="误杀" data-act="boarditem-click" data-val="{movieId:1218273}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,陈冲
<p class="releasetime">上映时间:2019-12-13    </div>
<div class="movie-item-info">
        <p class="name"><a href="/films/1190122" title="叶问4:完结篇" data-act="boarditem-click" data-val="{movieId:1190122}">叶问4:完结篇</a></p>
        <p class="star">
                主演:甄子丹,吴樾,吴建豪
        </p>
<p class="releasetime">上映时间:2019-12-20</p>    </div>
'''
html = etree.HTML(text)
result = html.xpath('//p/a/@href')
print(result)

  Here we can get the href attribute node through @href. Note that a different, matching properties and methods herein, brackets are added attribute matching property name and value of an attribute is defined as [@href = "/ films / 1218273 "], and @href herein refers to the acquisition node a property, both of which need to be distinguished.
Results are as follows:

['/films/1218273', '/films/1190122']

  We successfully obtained a href attribute nodes in all p nodes, they returned as a list

0x08 multivalued attribute matches

  Sometimes, a property of certain nodes may have multiple values, such as:

from lxml import etree

text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)

Here the class attribute in the HTML text node has two values ​​li li and li-first, at this time if you want to match with the attribute previously acquired, can not be matched, then the results are as follows:

[]

  He needs to use contains () function, the code can be rewritten as follows:

from lxml import etree

text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

  By this way contains () method, the first argument attribute name, attribute value of the second parameter passed, as long as the value of this attribute contains attribute passed, the matching can be done.
In this case results are as follows:

['first item']


0x09 multi-attribute matching

  In addition, we may encounter a situation that is determined based on a multiple node attributes, then you need to match multiple properties. At this time, the operator and may be used to connect, for example:

from lxml import etree

text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

  Here li node has added a property name. To determine this node, you need to be selected according to the class and the name attribute, a condition which contains li class attribute string, the other conditions for the item string attribute name, to satisfy both needs at the same time, coupled with the need and the operator, placed within brackets after the conditional filtering connected. Results are as follows:

['first item']


0x10 sequential selection

  Sometimes when we select certain attributes may match multiple nodes simultaneously, but they want a node of them, this is the method in parentheses incoming index we can use to get a particular order node, for example:

from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-0"><a href="link2.html">second item</a></li>
<li class="item-1"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
html = etree.HTML(text)
result_1 = html.xpath('//li[1]/a/text()')
print(result_1)
result_2 = html.xpath('//li[last()]/a/text()')
print(result_2)
result_3 = html.xpath('//li[position()<3]/a/text()')
print(result_3)
result_4 = html.xpath('//li[last()-2]/a/text()')
print(result_4)

  The first time we select a li select the first node, the incoming number 1 in brackets can be. The second time we choose to select a li last node in parentheses passed last () can be. Selecting the third time, we select the node position is less than 3 li, li is the location of node 1 and No. 2. When the fourth choice, we choose the third to last li nodes
results are as follows:

['first item']
['fifth item']
['first item', 'second item']

Epilogue

  XPath selectors are very powerful, easy to use, greatly enhance the efficiency of our crawler.

Guess you like

Origin www.cnblogs.com/g0udan/p/12231960.html