Essential knowledge points
- In html, id is unique
- In html, class can be referenced in multiple places
tool
- Python3 version
- lxml library [the advantage is fast parsing]
- HTML code block [Get it from the Internet or make up your own]
- requests [recommended installation, get the web page code from the web page to practice, no better]
Xpath learning
First define the html code block [this time only from the body]
<body>
<div class="container"> <div id="first"> <div class="one">都市</div> <div class="two">德玛西亚</div> <div class="two">王牌对王牌</div> <a> <div class="spe">特殊位置</div> </a> </div> <div id="second"> <div class="three">水电费</div> <div class="three">说的话房间不开封</div> <div class="four">三顿饭黑客技术</div> </div> <div id="third"> <div class="three">水电费</div> <div class="three">说的话房间开封</div> </div> </div> </body> """
Prepare the python
code block again
from lxml import etree
html = etree.HTML(html_str)
Task 1: Get one
the text value of the class name
To solve this problem, there is a very simple xpath
path, directly match the html code class
, and then get the text value
code show as below:
print(html.xpath('.//div[@class="one"]/text()'))
result:['都市']
There are several places to explain here: - The role of @: it represents an attribute, div
belongs to a tag, and it has its own attributes, such as class
, id
etc. - The function of dot.: indicates the current position; the corresponding double dot..: indicates the position of the previous level - the function of the double slash //: search for all children under the current label; it corresponds to a single slash Rod / , this tag tags all search in the next layer. [The last two tasks are exercises for this point]
first
Task 2: Get the text value of the first-level child div
label with id as the bottom
Only need to get the first layer, it is enough to use a single slash, the xpath path is as follows:
print(html.xpath('.//div[@id="first"]/div/text()'))
result:['都市', '德玛西亚', '王牌对王牌']
Task 3: Get the text values first
of all level labels under iddiv
This task is in contrast to the previous task. One is a single slash and the other is a double slash. The code of xpath is as follows:
print(html.xpath('.//div[@id="first"]//div/text()'))
result:['都市', '德玛西亚', '王牌对王牌', '特殊位置']
Task 4: Get the text value of the label with id as the second
next and all the class asthree
div
Specify the id as second, and the class name of the child div is three, and then get the text, the xpath is as follows
print(html.xpath('.//div[@id="second"]/div[@class="three"]/text()'))
result:['水电费', '说的话房间不开封']
Task 5: Get the text value three
of all div
labels of class
Looking at the html code block, you will find that the class is three
in div
several places, so the best way here is to search directly in the global scope. The simple and crude xpath is as follows:
print(html.xpath('.//div[@class="three"]/text()'))
result:['水电费', '说的话房间不开封', '水电费', '说的话房间开封']
Task 6: Get the labels with the text equal to the utility bill , and take out their class
To get their class name information through the text value, just reverse the previous task. The xpath is as follows:
print(html.xpath('.//div[text()="水电费"]/@class'))
result:['three', 'three']
Final code and running screenshots
html_str = """
<body>
<div class="container">
<div id="first">
<div class="one">都市</div>
<div class="two">德玛西亚</div>
<div class="two">王牌对王牌</div>
<a>
<div class="spe">特殊位置</div>
</a>
</div>
<div id="second">
<div class="three">水电费</div>
<div class="three">说的话房间不开封</div>
<div class="four">三顿饭黑客技术</div>
</div>
<div id="third">
<div class="three">水电费</div>
<div class="three">说的话房间开封</div>
</div>
</div>
</body>
"""
from lxml import etree
html = etree.HTML(html_str)
print(html.xpath('.//div[@class="one"]/text()'))
print(html.xpath('.//div[@id="first"]/div/text()')) print(html.xpath('.//div[@id="first"]//div/text()')) print(html.xpath('.//div[@id="second"]/div[@class="three"]/text()')) print(html.xpath('.//div[@class="three"]/text()')) print(html.xpath('.//div[text()="水电费"]/@class'))