python之Scrapy 的Xpath常用定位相关

CMD快速测试xpath,在安装了scrapy的虚拟环境下运行命令:

scrapy shell http://xxx.xxx.com

可运行命令进行测试提取结果:

>>> tite = response.xpath('//div[@class="entry-header"]/h1/text()').extract()
>>> tite
>>> ['5 款 Linux 街机游戏']

这样就提取出数组形式的结果,可以通过访问数组来获取成员:

>>> tite.extract()[0]

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

提取h1标签的值

tite = response.xpath('//div[@class="entry-header"]/h1/text()')

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

遇到如:日期·版块·分类   在一起的这种(http://blog.jobbole.com/114636/

<p class="entry-meta-hide-on-mobile">

            2019/01/11 ·  <a href="http://blog.jobbole.com/category/it-tech/" rel="category tag">IT技术</a>
            
            

            
             ·  <a href="http://blog.jobbole.com/tag/linux/">Linux</a>
            
</p>

使用命令:

response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0
].strip().replace("·","")

==命令解释==

extract()[0]           = 提取数组成员

strip()                   = 去除空格

.replace("·","")    = 将“·”替换为空

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

通过contains函数 搜索包含某个属性值的xptah

<span data-post-id="114636" class=" btn-bluet-bigger href-style vote-post-up   register-user-only "><i class="fa  fa-thumbs-o-up"></i> <h10 id="114636votetotal">1</h10> 赞</span>

命令:

response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()")

将值直接转换为int类型

int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0])

==命令解释==

//标签中 [搜索(@class包含'vote-post-up')]/h10/的值

需要将值转换为int类型直接 int(.....)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

遇到如:156 收藏  这种数字+文字或字母的,需要用re正则进行替换(CMD命令:ipython)

In [6]: m = re.match(".*?(\d+).*","156 收藏")

In [7]: if m:
   ...:     print(m.group(1))
   ...:
156

==命令解释==

m = re.match(“正则表达式”,“内容”)

if m:   (判断如果有值)

  print(m.group(1)) (获取第1个)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

猜你喜欢

转载自blog.csdn.net/qq_40134903/article/details/86365648