xpath retrieved by ID and Class

Essential knowledge points

  • In html, id is unique
  • In html, class can be referenced in multiple places

tool

  • Python3 version
  • lxml library [the advantage is fast parsing]
  • HTML code block [Get it from the Internet or make up your own]
  • requests [recommended installation, get the web page code from the web page to practice, no better]

Xpath learning

First define the html code block [this time only from the body]

<body>
<div class="container"> <div id="first"> <div class="one">都市</div> <div class="two">德玛西亚</div> <div class="two">王牌对王牌</div> <a> <div class="spe">特殊位置</div> </a> </div> <div id="second"> <div class="three">水电费</div> <div class="three">说的话房间不开封</div> <div class="four">三顿饭黑客技术</div> </div> <div id="third"> <div class="three">水电费</div> <div class="three">说的话房间开封</div> </div> </div> </body> """ 

Prepare the pythoncode block again

from lxml import etree

html = etree.HTML(html_str)

Task 1: Get onethe text value of the class name

To solve this problem, there is a very simple xpathpath, directly match the html code class, and then get the text value

code show as below:

print(html.xpath('.//div[@class="one"]/text()'))

result:['都市']

There are several places to explain here: - The role of @: it represents an attribute, divbelongs to a tag, and it has its own attributes, such as class, idetc. - The function of dot.: indicates the current position; the corresponding double dot..: indicates the position of the previous level - the function of the double slash //: search for all children under the current label; it corresponds to a single slash Rod / , this tag tags all search in the next layer. [The last two tasks are exercises for this point]

firstTask 2: Get the text value of the first-level child divlabel with id as the bottom

Only need to get the first layer, it is enough to use a single slash, the xpath path is as follows:

print(html.xpath('.//div[@id="first"]/div/text()'))

result:['都市', '德玛西亚', '王牌对王牌']

Task 3: Get the text values first​​of all level labels under iddiv

This task is in contrast to the previous task. One is a single slash and the other is a double slash. The code of xpath is as follows:

print(html.xpath('.//div[@id="first"]//div/text()'))

result:['都市', '德玛西亚', '王牌对王牌', '特殊位置']

Task 4: Get the text value of the label with id as the secondnext and all the class asthreediv

Specify the id as second, and the class name of the child div is three, and then get the text, the xpath is as follows

print(html.xpath('.//div[@id="second"]/div[@class="three"]/text()'))

result:['水电费', '说的话房间不开封']

Task 5: Get the text value threeof all divlabels of class

Looking at the html code block, you will find that the class is threein divseveral places, so the best way here is to search directly in the global scope. The simple and crude xpath is as follows:

print(html.xpath('.//div[@class="three"]/text()'))

result:['水电费', '说的话房间不开封', '水电费', '说的话房间开封']

Task 6: Get the labels with the text equal to the utility bill , and take out their class

To get their class name information through the text value, just reverse the previous task. The xpath is as follows:

print(html.xpath('.//div[text()="水电费"]/@class'))

result:['three', 'three']

Final code and running screenshots

html_str = """
<body>
<div class="container">
    <div id="first">
        <div class="one">都市</div>
        <div class="two">德玛西亚</div>
        <div class="two">王牌对王牌</div>
        <a>
            <div class="spe">特殊位置</div>
        </a>
    </div>
    <div id="second">
        <div class="three">水电费</div>
        <div class="three">说的话房间不开封</div>
        <div class="four">三顿饭黑客技术</div>
    </div>
    <div id="third">
        <div class="three">水电费</div>
        <div class="three">说的话房间开封</div>
    </div>
</div>
</body>
"""

from lxml import etree

html = etree.HTML(html_str)
print(html.xpath('.//div[@class="one"]/text()'))
print(html.xpath('.//div[@id="first"]/div/text()')) print(html.xpath('.//div[@id="first"]//div/text()')) print(html.xpath('.//div[@id="second"]/div[@class="three"]/text()')) print(html.xpath('.//div[@class="three"]/text()')) print(html.xpath('.//div[text()="水电费"]/@class')) 

xpath_2

Copyright statement: reprint is allowed, please indicate the source -  "xpath tutorial": search by ID and class

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324929543&siteId=291194637