Today I learned the spiders part of scrapy, the crawler name, the starting point of start_url, and the syntax of xpath:
nodename |
Selects all child nodes of this node. |
/ |
Pick from the root node. |
// |
Selects nodes in the document from the current node that matches the selection, regardless of their position. |
. |
Select the current node. |
.. |
Select the parent node of the current node. |
@ |
Select properties. |
bookstore |
Selects all child nodes of the bookstore element. |
/bookstore |
Select the root element bookstore. Note: A path always represents an absolute path to an element if it starts with a forward slash ( / )! |
bookstore/book |
Selects all book elements that are children of bookstore. |
//book |
Selects all book child elements, regardless of their position in the document. |
bookstore//book |
Selects all book elements that are descendants of the bookstore element, regardless of where they are located below the bookstore. |
//@lang |
Select all properties named lang. |
/bookstore/book[1] |
Selects the first book element that is a child element of the bookstore. |
/bookstore/book[last()] |
Selects the last book element that is a child element of the bookstore. |
/bookstore/book[last()-1] |
Selects the penultimate book element that is a child element of the bookstore. |
/bookstore/book[position()<3] |
Selects the first two book elements that are children of the bookstore element. |
//title[@lang] |
Selects all title elements that have an attribute named lang. |
// title [@ lang = 'eng'] |
Selects all title elements that have a lang attribute with a value of eng. |
/bookstore/book[price>35.00] |
Selects all book elements of the bookstore element, and the value of the price element must be greater than 35.00. |
/bookstore/book[price>35.00]/title |
选取 bookstore 元素中的 book 元素的所有 title 元素,且其中的 price 元素的值须大于 35.00。 |
* |
匹配任何元素节点。 |
@* |
匹配任何属性节点。 |
node() |
匹配任何类型的节点。 |
/bookstore/* |
选取 bookstore 元素的所有子元素。 |
//* |
选取文档中的所有元素。 |
//title[@*] |
选取所有带有属性的 title 元素。 |
//book/title | //book/price |
选取 book 元素的所有 title 和 price 元素。 |
//title | //price |
选取文档中的所有 title 和 price 元素。 |
/bookstore/book/title | //price |
选取属于 bookstore 元素的 book 元素的所有 title 元素,以及文档中所有的 price 元素。 |
XPath 轴(Axes)
轴可定义相对于当前节点的节点集。
轴名称 |
结果 |
ancestor |
选取当前节点的所有先辈(父、祖父等)。 |
ancestor-or-self |
选取当前节点的所有先辈(父、祖父等)以及当前节点本身。 |
attribute |
选取当前节点的所有属性。 |
child |
选取当前节点的所有子元素。 |
descendant |
选取当前节点的所有后代元素(子、孙等)。 |
descendant-or-self |
选取当前节点的所有后代元素(子、孙等)以及当前节点本身。 |
following |
选取文档中当前节点的结束标签之后的所有节点。 |
following-sibling |
选取当前节点之后的所有兄弟节点 |
namespace |
选取当前节点的所有命名空间节点。 |
parent |
选取当前节点的父节点。 |
preceding |
选取文档中当前节点的开始标签之前的所有节点。 |
preceding-sibling |
选取当前节点之前的所有同级节点。 |
self |
选取当前节点。 |