【前言】利用 Scrapy 爬取网站文字的时候发现,footer 中的 Copyright 等文字会影响后续分词的效果,因此决定将网页的 HTML 中有关 footer 的内容都丢弃。以下是不排除 footer 中内容的时候拿到网页的所有文本内容:
response.selector.xpath('//*[not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()
['400-004-3535',
'一键匹配贷款',
'(为您获取精准贷款方案)',
'贷款金额',
'万元',
'搜索',
'信用贷',
'经营贷',
'房贷',
'车贷',
'贷款攻略',
'客服热线',
'快速申请',
'贷款计算器',
'热门贷款产品',
'红本抵押贷款',
'总利息:',
'0.19',
'万元 \xa0月供:',
'4325',
'元',
'查看',
'\r\n\t\ufeff',
'电脑版',
'\xa0|\xa0',
'关于我们',
'版权所有©贷上我 m.dai35.com ',
'深圳贷上我金融服务有限公司',
'电话咨询',
'400-004-3535',
'贷款产品多?太难选',
'一键委托',
'专业为您推荐']
Explore HTML Contents of Various Pages
一般来说,footer会以这么几个形式出现:
- <div class="footer">
<div class="footer">
<div class="topBtn"><a id="btn" href="#"></a></div>
<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a> | <a href="about.php">关于我们</a></div>
<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com </div>
<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>
</div>
- <footer>
<footer>
<div class="down" onclick="toIndex()"><a href="javascript:;"><span><b class="zrLogoSmall"></b>下载自如APP,立即签约好房源</span></a></div>
<ul class="ub">
<li class="ub-f1"><a href="//www.ziroom.com?is_m=1" target="_blank">电脑版</a></li>
<li class="ub-f1 borderLeft"><a href="/">触屏版</a></li>
<li class="ub-f1 borderLeft"><a href="https://lnk0.com/easylink/ELxdgoYd">客户端</a></li>
</ul>
<ul class="ub">
<li class="ub-f1"><a href="/">首页</a></li>
<li class="ub-f1 borderLeft"><a href="/list">自如找房</a></li>
</ul>
<p class="version">Copyright©2017 ziroom.com</p>
</footer>
- id="footer"
<div id="footer">
<div class="area">
<div class="clearfix">
<div class="glbLeft">
<dl class="fList">
<dt>关于我们</dt>
<dd>
<a href="http://www.ziroom.com/zhaopin/index.php?r=site/about">关于自如</a>
<a href="http://www.ziroom.com/about/lianxi.html">联系自如</a>
<a href="http://www.ziroom.com/zhaopin/">加入自如</a>
</dd>
</dl>
<dl class="fList">
<dt>自如业务</dt>
<dd>
<a href="http://www.ziroom.com/about/fuwu.html">业务体系</a>
<a href="http://www.ziroom.com/about/fuwu.html">自如产品</a>
<a href="http://www.ziroom.com/servicecentre/">自如服务</a>
<a href="http://www.ziroom.com/purchase/">自如采购</a>
</dd>
</dl>
<dl class="fList">
<dt>关注自如</dt>
<dd>
<a>自如客微信</a>
<a>下载app</a>
</dd>
</dl>
</div>
<div class="glbRight">
<div class="img">
<img src="//static8.ziroom.com/phoenix/pc/images/zrk_ewm.png?v=20180102">
<p>关注自如客微信</p>
</div>
<div class="img">
<img src="http://www.ziroom.com/static/2015/images/common/app-min-qrcode.png?v=20180102">
<p>下载自如app</p>
</div><!--/img-->
</div><!--/glbRight-->
</div><!--/clearfix-->
<div class="linksFooter"></div>
<div class="footerBottom pr">
<p>北京自如信息科技有限公司 Copyright@2018 ziroom.com 版权所有 京ICP备16015349号-1</p>
<p>本网站所有页面的数据统计均来源于自如数据库 联系客服:自如客微信 周一至周日09:00-22:00</p>
<a key ="553dfddf58725379d18ae6b4" style="position: absolute; right: 0; top: 0;" logo_size="124x47" logo_type="business" href="http://www.anquan.org" ><script src="http://static.anquan.org/static/outer/js/aq_auth.js"></script></a>
</div>
</div><!--/area-->
</div><!--/footer-->
How to Extract Footers Using XPath
打开 Scrapy shell,并访问某网页
scrapy shell "http://m.dai35.com/"
response.selector.xpath('//*').extract()
......
'<a href="#">热门贷款产品</a>',
'<div class="prolist">\r\n <a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">红本抵押贷款</h3>\r\n <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>\r\n </div>',
'<a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">红本抵押贷款</h3>\r\n <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>',
'<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
'<h3 class="prolist_name">红本抵押贷款</h3>',
'<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
'<font color="#e10014">0.19</font>',
'<font color="#003f97">4325</font>',
'<p class="prolist_infop2"></p>',
'<span class="prolist_jiantou">查看</span>',
'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>',
'<div class="topBtn"><a id="btn" href="#"></a></div>',
'<a id="btn" href="#"></a>',
'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
'<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
'<a href="about.php">关于我们</a>',
'<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com </div>',
'<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',
'<br>',
'<br>',
'<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n var hm = document.createElement("script");\r\n hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n var s = document.getElementsByTagName("script")[0];\r\n s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......
response.selector.xpath('//*[self::footer or contains(@id,"footer") or contains(@class,"footer")]').extract()
['<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>']
那么我们自然会想,不想选择这部分直接这样写就好了嘛:
response.selector.xpath('//*[not(self::footer or contains(@id,"footer") or contains(@class,"footer"))]').extract()
......
'<a href="#">热门贷款产品</a>',
'<div class="prolist">\r\n <a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">红本抵押贷款</h3>\r\n <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>\r\n </div>',
'<a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">红本抵押贷款</h3>\r\n <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>',
'<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
'<h3 class="prolist_name">红本抵押贷款</h3>',
'<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
'<font color="#e10014">0.19</font>',
'<font color="#003f97">4325</font>',
'<p class="prolist_infop2"></p>',
'<span class="prolist_jiantou">查看</span>',
'<div class="topBtn"><a id="btn" href="#"></a></div>',
'<a id="btn" href="#"></a>',
'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
'<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
'<a href="about.php">关于我们</a>',
'<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com </div>',
'<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',
'<br>',
'<br>',
'<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n var hm = document.createElement("script");\r\n hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n var s = document.getElementsByTagName("script")[0];\r\n s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......
然而,我们可以看到,这次的结果和上次的结果的差别仅在于,这次的结果中少了一段:
'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>'
但是这次结果中的这些部分(见下)还是存在的,也就是会导致其实最终我们抽取出来的文本还是会有 footer 的内容。那么到底应该怎样写才能真地将 footer 的内容从结果中剔除呢?
'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
'<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
'<a href="about.php">关于我们</a>',
'<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com </div>',
'<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',
Exclude Footers of Any Kind in Results
其实我们只需要选择 footer node 本身以及其子节点即可,通过这种方法,我们可以看到所有和 footer 有关的内容已经都被清除了:
response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer])]').extract()
......
'<a href="#">热门贷款产品</a>',
'<div class="prolist">\r\n <a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">红本抵押贷款</h3>\r\n <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>\r\n </div>',
'<a class="prolistLink relative wid01" href="loanshow.php?cid=12&tid=0&id=46&m=5&t=12">\r\n <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n <h3 class="prolist_name">红本抵押贷款</h3>\r\n <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n <p class="prolist_infop2"></p>\r\n <span class="prolist_jiantou">查看</span>\r\n </a>',
'<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
'<h3 class="prolist_name">红本抵押贷款</h3>',
'<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
'<font color="#e10014">0.19</font>',
'<font color="#003f97">4325</font>',
'<p class="prolist_infop2"></p>',
'<span class="prolist_jiantou">查看</span>',
'<br>',
'<br>',
'<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n var hm = document.createElement("script");\r\n hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n var s = document.getElementsByTagName("script")[0];\r\n s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......
Function not(boolean) in XPath
其实这里面起了关键作用的就是标题这个 not 函数。如果我们想要既排除祖先或本身是 footer 的元素,又排除本身是 script 或 title 或 style 的元素,那么我们需要这样写:
response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]').extract()
最终我们需要选择排除了这些条件之后所有的 text 内容(见下),是不是比文章开头所得到的文本少了好多噪音呢?
response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()
['400-004-3535',
'一键匹配贷款',
'(为您获取精准贷款方案)',
'贷款金额',
'万元',
'搜索',
'信用贷',
'经营贷',
'房贷',
'车贷',
'贷款攻略',
'客服热线',
'快速申请',
'贷款计算器',
'热门贷款产品',
'红本抵押贷款',
'总利息:',
'0.19',
'万元 \xa0月供:',
'4325',
'元',
'查看',
'\r\n\t\ufeff',
'电话咨询',
'400-004-3535',
'贷款产品多?太难选',
'一键委托',
'专业为您推荐']
【参考链接】https://stackoverflow.com/questions/49221014/scrapy-linkextractor-restrict-paths-exclude-tags