最近写爬虫,简单了解了一下Xpath,用个小实例作为练习。
先上Xpath路劲语法
编写一个XML作为提取文档:
<superhero> <class> <name lang="en">Tony stark</name> <alias>Iron man</alias> <sex>male</sex> <birthday>1969</birthday> <age>47</age> </class> <class> <name lang="en">Peter Benjamin Parker</name> <alias>Spider Man</alias> <sex>male</sex> <birthday>unknown</birthday> <age>unknown</age> </class> <class> <name lang="en">Steven Rogers</name> <alias>Captain America</alias> <sex>male</sex> <birthday>19200704</birthday> <age>96</age> </class> </superhero>
写个比较简陋的xpath试试看:
# -*- coding: utf-8 -*- from scrapy.selector import Selector with open('./superHero.xml','r') as fp: body = fp.read() content=Selector(text=body).xpath('./*').extract() print(content) print("#######################") #第一个class的内容 content = Selector(text=body).xpath("//class[1]").extract() print(content) print("#######################") #最后一个class的内容 content = Selector(text=body).xpath("//class[last()]").extract() print(content) print("#######################") #采集name属性为en的数据 content = Selector(text=body).xpath("//name[@lang='en']").extract() print(content) print("#######################") #采集第二个class的name节点的文本 content = Selector(text=body).xpath("//class[last()-1]/name/text()").extract() print(content) print("#######################") 输出为:
['<body><superhero>\n<class>\n\t<name lang="en">Tony stark</name>\n\t<alias>Iron man</alias>\n\t<sex>male</sex>\n\t<birthday>1969</birthday>\n\t<age>47</age>\n</class>\n<class>\n\t<name lang="en">Peter Benjamin Parker</name>\n\t<alias>Spider Man</alias>\n\t<sex>male</sex>\n\t<birthday>unknown</birthday>\n\t<age>unknown</age>\n</class>\n<class>\n\t<name lang="en">Steven Rogers</name>\n\t<alias>Captain America</alias>\n\t<sex>male</sex>\n\t<birthday>19200704</birthday>\n\t<age>96</age>\n</class>\n</superhero></body>'] ####################### ['<class>\n\t<name lang="en">Tony stark</name>\n\t<alias>Iron man</alias>\n\t<sex>male</sex>\n\t<birthday>1969</birthday>\n\t<age>47</age>\n</class>'] ####################### ['<class>\n\t<name lang="en">Steven Rogers</name>\n\t<alias>Captain America</alias>\n\t<sex>male</sex>\n\t<birthday>19200704</birthday>\n\t<age>96</age>\n</class>'] ####################### ['<name lang="en">Tony stark</name>', '<name lang="en">Peter Benjamin Parker</name>', '<name lang="en">Steven Rogers</name>'] ####################### ['Peter Benjamin Parker'] #######################