Python XPath libraries and reptiles of lxml

1, lxml and XPath Introduction

  Because the main job is crawling reptiles HTML document, use regular expressions to parse the efficiency of development not only slow, too slow resolution efficiency, thereby using parsing library on the choice our last.
  XPath (XML Path Language) is an XML document to find information in the language, it can also be used to retrieve the HTML. lxml is a high performance library of Python HTML / XML parser, supports XPath syntax.

2, XPath rules

Here Insert Picture Description

3, using the library lxml

3.1 Get HTML

def getHtml():
    text = """
        <tr>
            <td width="46">48</td>
            <td width="142" class="left"><a href="https://nba.hupu.com/players/terryrozier-150005.html">特里-罗齐尔</a></td>
            <td width="50"><a href="https://nba.hupu.com/teams/hornets">黄蜂</a></td>
            <td class="bg_b">18.00</td>
            <td>6.30-14.90</td>
            <td>42.3%</td>
            <td>2.70-6.70</td>
            <td>40.7%</td>
            <td>2.60-3.00</td>
            <td>87.4%</td>
            <td width="50">63</td>
            <td width="70">34.30</td>
        </tr>
        """
    html = etree.HTML(text=text)
    print(type(html))   # 将一段文本转化为lxml.etree._Element对象
    html = etree.tostring(html,encoding='utf-8').decode('utf-8')     # 将lxml.etree._Element对象转化为文本的二进制数据,再解码为utf-8
    return html

The resulting output is:
Here Insert Picture Description

3.2, access to all the nodes

from lxml import etree

def getHtml():
    text = """
        <tr>
            <td width="46">48</td>
            <td width="142" class="left"><a href="https://nba.hupu.com/players/terryrozier-150005.html">特里-罗齐尔</a></td>
            <td width="50"><a href="https://nba.hupu.com/teams/hornets">黄蜂</a></td>
            <td class="bg_b">18.00</td>
            <td>6.30-14.90</td>
            <td>42.3%</td>
            <td>2.70-6.70</td>
            <td>40.7%</td>
            <td>2.60-3.00</td>
            <td>87.4%</td>
            <td width="50">63</td>
            <td width="70">34.30</td>
        </tr>
        """
    html = etree.HTML(text=text)
    print(type(html))   # 将一段文本转化为lxml.etree._Element对象
    return html

if __name__ == '__main__':
    html = getHtml()
    elements = html.xpath('//*')
    for e in elements:
        print(e)

3.3, the properties specified node

html = getHtml()
    elements = html.xpath('//td[@class="left"]')
    for e in elements:
        print(e)

3.4, access to text

from lxml import etree

def getHtml():
    text = """
        <tr>
            <td width="46">48</td>
            <td width="142" class="left"><a href="https://nba.hupu.com/players/terryrozier-150005.html">特里-罗齐尔</a></td>
            <td width="50"><a href="https://nba.hupu.com/teams/hornets">黄蜂</a></td>
            <td class="bg_b">18.00</td>
            <td>6.30-14.90</td>
            <td>42.3%</td>
            <td>2.70-6.70</td>
            <td>40.7%</td>
            <td>2.60-3.00</td>
            <td>87.4%</td>
            <td width="50">63</td>
            <td width="70">34.30</td>
        </tr>
        """
    html = etree.HTML(text=text)
    # print(type(html))   # 将一段文本转化为lxml.etree._Element对象
    return html

if __name__ == '__main__':
    html = getHtml()
    elements = html.xpath('//td[@class="left"]/a/text()')
    for e in elements:
        print(e)

3.5 Other methods

  Can be obtained by looking at other nodes documentation method, basically similar.

Please indicate the wrong place! Thought that it was in trouble if you can give a praise! We welcome comments section or private letter exchange!

Published 30 original articles · won praise 72 · views 10000 +

Guess you like

Origin blog.csdn.net/Orange_minger/article/details/104829484