Scrapy解析响应数据

解析响应数据

对于服务器端返回的数据我们该如何处理呢？我们需要提取从服务器返回的数据，解析的方向有下面这几种：

普通文本操作
正则表达式：re
Dom树操作：BeautifulSoup(解析速度快慢)
Xpath选择器：lxml (基于lxml库建立的，并且简化了API接口，解析速度快)

有如下网页，html代码如下

 <html>
 <head>
    <title>谢公子的小黑屋</title>
 </head>
 <body>
   <h2>这是标题</h2>
   <p class="xie" name="p标签">你好，世界</p>
   <img src="1.jpg">
   <div class="one">
     <div class="two">
         <div class="three">
             <div class="title">这是第一个标题</div>
             <div class="time">2019-1-1 01:11:11</div>
         </div>
     </div>
     <div class="two">
         <div class="three">
             <div class="title">这是第二个标题</div>
             <div class="time">2019-2-2 02:22:22</div>
         </div>
     </div>
     <div class="two">
         <div class="three">
             <div class="title">这是第三个标题</div>
             <div class="time">2019-3-3 03:33:33</div>
         </div>
     </div>
   </div>
 </body>
 </html>

返回的 response 是一个 class 类

我们可以用 response 的 text 方法，得到的是一个 str 字符串对象

使用xpath选择器解析

将字符串转换为Selector类，即可使用 xpath 选择器。

from parsel import Selector

response=requests.get(url="xx")
select=Selector(response.text)
select.xpath("")

// ：从当前节点选取子孙节点，如果符号前面没路径，表示整个文档
/ ：从当前节点选取直接子节点
. ：选取当前节点
.. ：选取当前节点父节点
@ ：选取属性
//* ：整个HTML文本中的所有节点
extract()：返回response类的data数据
text()：返回标签的内容 

举例：
response.xpath('/html/body/div') #选取body下的所有div
response.xpath('//a') #选中文档所有a
response.xpath('/html/body//div') #选中body下的所有节点中的div，无论在什么位置
response.xpath('//a/text()') #选取所有a的文本
response.xpath('/html/div/*') #选取div的所有元素子节点
response.xpath('//div/*/img') #选取div孙节点的所有img
response.xpath('//img/@src') #选取所有img的src属性
response.xpath('//a[1]/img/@*') #选取第一个a下img的所有属性
response.xpath('//a[2]') #所有a中的第2个
response.xpath('//a[last()]') #所有a中的最后一个 ('/a[last()-1]')#倒第二个 （'//a[position()<=3]'）#使用position函数选中前三个 （'//div[@id]'）#选中所有含有id属性的div （'//div[@id="song"]'）#选中所有id属性为song的div
response.xpath('//p[contains(@class,'song')]') #选择class属性中含有‘song’的p元素
response.xpath('//div/a | //div/p') 或者，页面中可能是a可能是p

例：选取 title

response.xpath("//title")
response.xpath("//title").extract()
response.xpath("//title").extract()[0]
response.xpath("//title/text()").extract()[0]

选取h2标签

选取每个div的标题和时间

Scrapy解析响应数据

解析响应数据

使用xpath选择器解析

使用CSS选择器解析

猜你喜欢