Web crawler data analysis (XPath)

Introduce

Some people say that I don’t use regular regex and it’s very tiring to process HTML documents. Is there any other way?

Have! That is XPath. We can first convert the String type data obtained from the network into HTML/XML documents, and then use XPath to find HTML/XML nodes or elements.

What is XML

Everyone knows HTML, so what is XML?

  • XML stands for Extensible Markup Language (EXtensible Markup Language)
  • XML is a markup language, very similar to HTML
  • XML is designed to transmit data, not display data
  • XML tags need to be defined by us.
  • XML is designed to be self-descriptive.
  • XML is the recommended standard of W3C

Illustration:

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-vUeRZNSF-1590634079149) (C:\Users\王利琴\Desktop\Reptile Summary\XML diagram1.jpg)]

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-0Pxx1cJ1-1590634079153) (C:\Users\王利琴\Desktop\Reptile Summary\XML Diagram 2.jpg)]

XML node relationship

Father, son, compatriots, ancestors, descendants

XPath definition

XPath (XML Path Language) is a language for finding information in XML documents. It can be used to traverse elements and attributes in XML documents.

XPath expression

  • The most commonly used path expressions
    Wildcard description
    / Pick from the root node.
    // Select nodes in the document from the current node of the matching selection, regardless of their location.
    . Select the current node.
    Select the parent node of the current node.
    @ Select attributes.
  • Common path expressions and the results of expressions
    expression result
    /bookstore Select the root element bookstore. Note: If the path starts from a forward slash (/), this path always represents the absolute path to an element!
    bookstore/book Select all book elements that are child elements of bookstore
    //book Select all book sub-elements, regardless of their position in the document.
    bookstore//book Select all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore.
    //@lang Select all attributes named lang
  • The predicate is used to find a specific node or a node containing a specified value, and is embedded in square brackets
    Path expression result
    /bookstore/book[1] Select the first book element that belongs to the bookstore child element.
    /bookstore/book[last()] Select the last book element that belongs to the bookstore child element.
    /bookstore/book[last()-1] Select the penultimate book element that belongs to the bookstore child element.
    /bookstore/book[postion()❤️] Select the first two book elements that are child elements of the bookstore element.
    // title [@lang] Select all title elements that have an attribute named lang.
    // title [@ lang = 'eng'] Select all title elements, and these elements have a lang attribute with a value of eng.
    /bookstore/book[price>35.00] Select all book elements of the bookstore element, and the value of the price element must be greater than 35.00.
    /bookstore/book[price>35.00]/title Select all the title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00.
  • Select unknown node
    Wildcard description
    * Match any element node.
    @* Match any attribute node.
    Path expression result
    /bookstore/* Select all child elements of the bookstore element.
    //* Select all elements in the document.
    //title[@*] Select all title elements with attributes.
  • Select several paths, by using the "|" operator in the path expression, you can select several paths
    Path expression result
    //book/title|//book/price Select all the title and price elements of the book element.
    //title|//price Select all the title and price elements in the document.
    /bookstore/book/title|//price Select all the title elements belonging to the book element of the bookstore element and all the price elements in the document.
  • XPath operators
    Operator description Instance return value
    | Calculate two node sets //book|//cd Return all node sets that have book and cd elements
    + addition 6+4 10
    - Subtraction 6-4 2
    * multiplication 6*4 24
    div division 8div4 2
    = equal price=9.80 If price is 9.80, return true. If the price is 9.90, false is returned.
    != not equal to price!=9.80 If price is 9.90, return true. If the price is 9.80, false is returned.
    < Less than price<9.80 If price is 9.00, return true. If the price is 9.90, false is returned.
    <= less than or equal to price<=9.80 If price is 9.00, return true. If the price is 9.90, false is returned.
    > more than the price>9.80 If price is 9.90, return true. If the price is 9.80, false is returned.
    >= greater than or equal to price>=9.80 If price is 9.90, return true. If the price is 9.70, false is returned.
    or or price=9.80 or price=9.70 If price is 9.80 or 9.70, return true.
    and versus price>9.00 and price<9.90 If price is 9.80, return true. If the price is 8.50, false is returned.
    mod Calculate the remainder of the division 5 mod 2 1

lxml library

definition
  • lxml is an HTML/XML parser, and its main function is how to parse and extract HTML/XML data.

  • Like regular, lxml is also implemented in C. It is a high-performance Python HTML/XML parser. We can use XPath syntax to quickly locate specific elements and node information.

lxml data conversion
  • Source code

    from lxml import etree
    
    str = '''<div>
        <ul>
             <li class="item-0"><a href="link1.html">first item</a></li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-inactive"><a href="link3.html">third item</a></li>
             <li class="item-1"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a>
         </ul>
     </div>
    '''
    #利用etree.HTML,将String字符串解析为HTML文档
    html = etree.HTML(str)
    result = etree.tostring(html)
    print(result.decode('utf-8'))
    
  • Data obtained

 '''<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
lxml read file
  • Source code

    from lxml import etree
    
    # 读取外部文件 hello.html
    html = etree.parse('./hello.html')
    print(html)
    
  • File data

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <title>测试页面</title>
</head>
<body>
    <ol>
        <li class="haha">醉卧沙场君莫笑,古来征战几人回</li>
        <li class="heihei">两岸猿声啼不住,轻舟已过万重山</li>
        <li id="hehe" class="nene">一骑红尘妃子笑,无人知是荔枝来</li>
        <li class="xixi">停车坐爱枫林晚,霜叶红于二月花</li>
        <li class="lala">商女不知亡国恨,隔江犹唱后庭花</li>
    </ol>
    <div id="pp">
        <div>
            <a href="http://www.baidu.com">李白</a>
        </div>
        <ol>
            <li class="huanghe">君不见黄河之水天上来,奔流到海不复回</li>
            <li id="tata" class="hehe">李白乘舟将欲行,忽闻岸上踏歌声</li>
            <li class="tanshui">桃花潭水深千尺,不及汪伦送我情</li>
        </ol>
        <div class="hh">
            <a href="http://mi.com">雷军</a>

        </div>
        <div class="jj">
            <b href="http://mi.com"><c>3</c></b>
            <b href="http://mi.com"><c>5</c></b>
            <b href="http://mi.com"><c>6</c></b>
            <b href="http://mi.com"><c>8</c></b>
            <b href="http://mi.com"><c>9</c></b>
            <b href="http://mi.com"><c>3</c></b>


        </div>
        <ol>
            <li class="dudu">are you ok</li>
            <li class="meme">会飞的猪</li>
        </ol>
    </div>
</body>
</html>
XPath具体用法
  • 被解析网页原码

    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8" />
        <title>测试页面</title>
    </head>
    <body>
        <ol>
            <li class="haha">醉卧沙场君莫笑,古来征战几人回</li>
            <li class="heihei">两岸猿声啼不住,轻舟已过万重山</li>
            <li id="hehe" class="nene">一骑红尘妃子笑,无人知是荔枝来</li>
            <li class="xixi">停车坐爱枫林晚,霜叶红于二月花</li>
            <li class="lala">商女不知亡国恨,隔江犹唱后庭花</li>
        </ol>
        <div id="pp">
            <div>
                <a href="http://www.baidu.com">李白</a>
            </div>
            <ol>
                <li class="huanghe">君不见黄河之水天上来,奔流到海不复回</li>
                <li id="tata" class="hehe">李白乘舟将欲行,忽闻岸上踏歌声</li>
                <li class="tanshui">桃花潭水深千尺,不及汪伦送我情</li>
            </ol>
            <div class="hh">
                <a href="http://mi.com">雷军</a>
    
            </div>
            <div class="jj">
                <b href="http://mi.com"><c>3</c></b>
                <b href="http://mi.com"><c>5</c></b>
                <b href="http://mi.com"><c>6</c></b>
                <b href="http://mi.com"><c>8</c></b>
                <b href="http://mi.com"><c>9</c></b>
                <b href="http://mi.com"><c>3</c></b>
    
    
            </div>
            <ol>
                <li class="dudu">are you ok</li>
                <li class="meme">会飞的猪</li>
            </ol>
        </div>
    </body>
    </html>
    
  • 获取所有的< li >标签

    from lxml import etree
    
    html = etree.parse('hello.html')
    li_list = html.xpath('//li')
    
    print(li_list)  # 打印<li>标签的元素集合
    print(len(li_list))
    
  • 继续获取< li > 标签的所有 class属性

    from lxml import etree
    
    html = etree.parse('hello.html')
    result = html.xpath('//li/@class')
    print(result)
    
  • 继续获取< li >标签下href为 link1.html 的 < a > 标签

    from lxml import etree
    html = etree.parse('./hello.html')
    result = html.xpath('//li/a[@href="link1.html"]')
    print(result)
    
  • 获取< li >标签下的所有< span >标签

    from lxml import etree
    data = '''
    <div>
        <ul>
             <li class="item-0">你好,老段<a href="link1.html">first item</a></li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>'''
    html = etree.HTML(data)
    result = html.xpath('//li//span')
    print(result[0].text)
    
  • 获取< li >标签下的< a >标签的所有class

    # 获取 <li> 标签下的<a>标签里的所有 class
    from lxml import etree
    html = etree.parse('hello.html')
    result = html.xpath('//li/a//@class')
    
    print(result)
    
  • 获取最后一个< li >的< a >的href

    from lxml import etree
    
    xml = etree.parse('./hello.html')
    
    result = xml.xpath('//li[last()]/a/@href')
    
    print(result)
    
  • 获取倒数第二个元素的内容

    from lxml import etree
    
    html = etree.parse('hello.html')
    result = html.xpath('//li[last()-1]/a')
    print(result[0].text)
    print(result)
    
  • 获取class值为bold的标签名

    # 获取 class 值为 bold 的标签名
    from lxml import etree
    html = etree.parse('hello.html')
    result = html.xpath('//*[@class="bold"]')
    # tag方法可以获取标签名
    print(result[0].tag)
    print(result[0].text)
    
  • 条件使用

    • 获取文本数据://li[@id=“hehe”]/text()
    • 包含某个条件://li[contains(@class,“h”)]
    • 等于某个条件://div[@id=“pp”]/ol[last()]/li/@*
    • 条件并用://li[@id=“hehe”] [@class=“nene”]/text()

XPath案例

import requests
from lxml import etree
url1 = 'https://www.neihanba.com/dz/'
url = 'https://www.neihanba.com/dz/list_%d.html'
if __name__ == '__main__':
    fp = open('./duanzi.csv',mode = 'a',encoding='utf-8')
    for i in range(1,101):
        if i == 1:
            url_duanzi = url1
        else:
            url_duanzi = url%(i)
        response = requests.get(url_duanzi)
        response.encoding = 'gbk'
        content = response.text
        html = etree.HTML(content)
        result = html.xpath('//ul[@class="piclist longList"]/li')
        for li in result:
            try:
                title = li.xpath('.//h4/a/b/text()')[0]
                content = li.xpath('.//div[@class="f18 mb20"]/text()')[0].strip().strip('\n')
                info = ''.join(li.xpath('.//div[@class="ft"]/span//text()')[1:])
                fp.write('%s\t%s\t%s\n'%(title,content,info))
            except Exception as e:
                # 异常保存,第二天,分析,单独爬取。
                pass
        print('第%d页内容保存成功!'%(i))
    fp.close()
    # !!!缺少异常捕获

Guess you like

Origin blog.csdn.net/qq_42546127/article/details/106398987
Recommended