The data analysis module Python lxml crawler base (attachment: xpath parser and description)

Introduction:

Python recent study reptiles, where the data analysis module to be lxml study notes.


lxml, xpath parser and presentation:

lxml is a Python parsing library that supports HTML and XML parsing, support xpath analytical methods, and analytical efficiency is very high. xpath, full name of the XML Path Language, namely XML Path Language, it is a finding information in an XML document language, which was originally used to search XML documents, but it also applies to search HTML documents


xml file / html file node relationship:

Parent (Parent)

Child node (Children)

Sibling (Sibling)

Ancestor node (Ancestor)

Descendant nodes (Descendant)


xpath syntax:

nodename select all the child nodes of this node

// child node is selected from any of

/ Select from the root node

. Select the current node

.. select the parent node of the current node

@ Select Properties


Parser comparison:

Parser rate difficulty

re fastest hard

BeautifulSoup slow very simple

lxml fast simple


Study notes:


# -*- coding: utf-8 -*-


from lxml import etree


html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p><b>The Dormouse's story</b></p>


<p>Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class=... ... ... ... ... ... "sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" id="link2">Lacie</a> and

<a href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>


<p>...</p>

"""


= etree.HTML Selector (html_doc)    # create an object


= selector.xpath links ( 'P // [@ class = "Story"] / A / @ the href')  # remove all the links within the page 

for link in links:

    print link


xml_test = """

<?xml version='1.0'?>

<?xml-stylesheet type="text/css" href="first.css"?>

<notebook>

    <user id="1" category='cb' class="dba python linux">

        <name>lizibin</name>

        <sex>m</sex>

        <address>sjz</address>

        <age>28</age>

        <concat>

            <email>[email protected]</email>

            <phone>135......</phone>

        </concat>

    </user>

    <user id="2" category='za'>

        <name>wsq</name>

        <sex>f</sex>

        <address>shanghai</address>

        <age>25</age>

        <concat>

            <email>[email protected]</email>

            <phone>135......</phone>

        </concat>

    </user>

    <user id="3" category='za'>

        <name>liqian</name>

        <sex>f</sex>

        <address>SH</address>

        <age>28</age>

        <concat>

            <email>[email protected]</email>

            <phone>135......</phone>

        </concat>

    </user>

    <user id="4" category='cb'>

        <name>qiangli</name>

        <sex>f</sex>

        <address>SH</address>

        <age>29</age>

        <concat>

            <email>[email protected]</email>

            <phone>135......</phone>

        </concat>

    </user>

    <user id="5" class="dba linux c java python test teacher">

        <name>buzhidao</name>

        <sex>f</sex>

        <address>SH</address>

        <age>999</age>

        <concat>

            <email>[email protected]</email>

            <phone>135......</phone>

        </concat>

    </user>

</notebook>

"""


#r = requests.get('http://xxx.com/abc.xml')   也可以请求远程服务器上的xml文件

#etree.HTML(r.text.encode('utf-8'))

xml_code = etree.HTML(xml_test)     #生成一个etree对象


#选取所有子节点的name(地址)

print xml_code.xpath('//name')


选取所有子节点的name值(数据)

print xml_code.xpath('//name/text()')

print ''


#以notebook以根节点选取所有数据

notebook = xml_code.xpath('//notebook')


#取出第一个节点的name值(数据)

print notebook[0].xpath('.//name/text()')[0]


addres = notebook[0].xpath('.//name')[0]

#取出和第一个节点同级的 address 值

print addres.xpath('../address/text()')


#选取属性值

print addres.xpath('../address/@lang')


#选取notebook下第一个user的name属性

print xml_code.xpath('//notebook/user[1]/name/text()')


#选取notebook下最后一个user的name属性

print xml_code.xpath('//notebook/user[last()]/name/text()')


#选取notebook下倒数第二个user的name属性

print xml_code.xpath('//notebook/user[last()-1]/name/text()')


#选取notebook下前两名user的address属性

print xml_code.xpath('//notebook/user[position()<3]/address/text()')


#选取所有分类为web的name

print xml_code.xpath('//notebook/user[@category="cb"]/name/text()')


#选取所有年龄小于30的人

print xml_code.xpath('//notebook/user[age<30]/name/text()')


#选取所有class属性中包含dba的class属性

print xml_code.xpath('//notebook/user[contains(@class,"dba")]/@class')

print xml_code.xpath('//notebook/user[contains(@class,"dba")]/name/text()')




Guess you like

Origin blog.51cto.com/20131104/2436258