爬虫 xpath (数据提取)

xpath 是数据提取的一种常用的方法

XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航。

在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档（根）节点。XML 文档是被作为节点树来对待的。树的根被称为文档节点或者根节点。

选取节点

XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。

下面列出了最有用的路径表达式：

nodename	选取此节点的所有子节点。
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
..	选取当前节点的父节点。
@	选取属性。

操作步骤:

一、引入

from lxml.html import etree

二、创建文档树

html_obj = etree.HTML(html, parser=HTMLParser(encoding='utf-8'))

def HTML(text, parser=None, base_url=None): # real signature unknown; restored from __doc__
    """
    HTML(text, parser=None, base_url=None)
    
        Parses an HTML document from a string constant.  Returns the root
        node (or the result returned by a parser target).  This function
        can be used to embed "HTML literals" in Python code.
    
        To override the parser with a different ``HTMLParser`` you can pass it to
        the ``parser`` keyword argument.
    
        The ``base_url`` keyword argument allows to set the original base URL of
        the document to support relative Paths when looking up external entities
        (DTD, XInclude, ...).
    """
    pass

还可以这样写:

html_obj = etree.fromstring(html, parser=HTMLParser(encoding='utf-8'))

def fromstring(text, parser=None, base_url=None): # real signature unknown; restored from __doc__
    """
    fromstring(text, parser=None, base_url=None)
    
        Parses an XML document or fragment from a string.  Returns the
        root node (or the result returned by a parser target).
    
        To override the default parser with a different parser you can pass it to
        the ``parser`` keyword argument.
    
        The ``base_url`` keyword argument allows to set the original base URL of
        the document to support relative Paths when looking up external entities
        (DTD, XInclude, ...).
    """
    pass

parser=HTMLParser(encoding='utf-8')自定义的解析器，有默认的解析器。

三、用xpath提取数据
比如:

div_obj = html_obj.xpath('//div[@class="l_post"]') # [@条件]

divs = html_obj.xpath('//div[contains(@class, "l_post")]') #如果属性值有多个可用contains

divs = html_obj.xpath('//div[@class="l_post"]/text()')  #取标签中的值时  text()

divs = html_obj.xpath('//div[@class="l_post"]/a/@href') #取标签的属性值时  @属性名

扩展:

文档树转化为字符串时:

html = etree.tostring(html_obj, encoding='utf-8').decode()

返回的是二进制的代码,所以要解码

在网页中查看html 点到要取的节点右键复制中有xpath,可直接粘贴路径表达式,可能会不太准确

爬虫 xpath (数据提取)

选取节点

下面列出了最有用的路径表达式：

猜你喜欢