Article Directory
1, lxml and XPath Introduction
Because the main job is crawling reptiles HTML document, use regular expressions to parse the efficiency of development not only slow, too slow resolution efficiency, thereby using parsing library on the choice our last.
XPath (XML Path Language) is an XML document to find information in the language, it can also be used to retrieve the HTML. lxml is a high performance library of Python HTML / XML parser, supports XPath syntax.
2, XPath rules
3, using the library lxml
3.1 Get HTML
def getHtml():
text = """
<tr>
<td width="46">48</td>
<td width="142" class="left"><a href="https://nba.hupu.com/players/terryrozier-150005.html">特里-罗齐尔</a></td>
<td width="50"><a href="https://nba.hupu.com/teams/hornets">黄蜂</a></td>
<td class="bg_b">18.00</td>
<td>6.30-14.90</td>
<td>42.3%</td>
<td>2.70-6.70</td>
<td>40.7%</td>
<td>2.60-3.00</td>
<td>87.4%</td>
<td width="50">63</td>
<td width="70">34.30</td>
</tr>
"""
html = etree.HTML(text=text)
print(type(html)) # 将一段文本转化为lxml.etree._Element对象
html = etree.tostring(html,encoding='utf-8').decode('utf-8') # 将lxml.etree._Element对象转化为文本的二进制数据,再解码为utf-8
return html
The resulting output is:
3.2, access to all the nodes
from lxml import etree
def getHtml():
text = """
<tr>
<td width="46">48</td>
<td width="142" class="left"><a href="https://nba.hupu.com/players/terryrozier-150005.html">特里-罗齐尔</a></td>
<td width="50"><a href="https://nba.hupu.com/teams/hornets">黄蜂</a></td>
<td class="bg_b">18.00</td>
<td>6.30-14.90</td>
<td>42.3%</td>
<td>2.70-6.70</td>
<td>40.7%</td>
<td>2.60-3.00</td>
<td>87.4%</td>
<td width="50">63</td>
<td width="70">34.30</td>
</tr>
"""
html = etree.HTML(text=text)
print(type(html)) # 将一段文本转化为lxml.etree._Element对象
return html
if __name__ == '__main__':
html = getHtml()
elements = html.xpath('//*')
for e in elements:
print(e)
3.3, the properties specified node
html = getHtml()
elements = html.xpath('//td[@class="left"]')
for e in elements:
print(e)
3.4, access to text
from lxml import etree
def getHtml():
text = """
<tr>
<td width="46">48</td>
<td width="142" class="left"><a href="https://nba.hupu.com/players/terryrozier-150005.html">特里-罗齐尔</a></td>
<td width="50"><a href="https://nba.hupu.com/teams/hornets">黄蜂</a></td>
<td class="bg_b">18.00</td>
<td>6.30-14.90</td>
<td>42.3%</td>
<td>2.70-6.70</td>
<td>40.7%</td>
<td>2.60-3.00</td>
<td>87.4%</td>
<td width="50">63</td>
<td width="70">34.30</td>
</tr>
"""
html = etree.HTML(text=text)
# print(type(html)) # 将一段文本转化为lxml.etree._Element对象
return html
if __name__ == '__main__':
html = getHtml()
elements = html.xpath('//td[@class="left"]/a/text()')
for e in elements:
print(e)
3.5 Other methods
Can be obtained by looking at other nodes documentation method, basically similar.
Please indicate the wrong place! Thought that it was in trouble if you can give a praise! We welcome comments section or private letter exchange!