【前言】利用 Scrapy 爬取网站文字的时候发现，footer 中的 Copyright 等文字会影响后续分词的效果，因此决定将网页的 HTML 中有关 footer 的内容都丢弃。以下是不排除 footer 中内容的时候拿到网页的所有文本内容：

response.selector.xpath('//*[not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一键匹配贷款',
 '(为您获取精准贷款方案)',
 '贷款金额',
 '万元',
 '搜索',
 '信用贷',
 '经营贷',
 '房贷',
 '车贷',
 '贷款攻略',
 '客服热线',
 '快速申请',
 '贷款计算器',
 '热门贷款产品',
 '红本抵押贷款',
 '总利息:',
 '0.19',
 '万元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '电脑版',
 '\xa0|\xa0',
 '关于我们',
 '版权所有©贷上我 m.dai35.com  ',
 '深圳贷上我金融服务有限公司',
 '电话咨询',
 '400-004-3535',
 '贷款产品多？太难选',
 '一键委托',
 '专业为您推荐']

Explore HTML Contents of Various Pages

一般来说，footer会以这么几个形式出现：

<div class="footer">

	<div class="footer">
	<div class="topBtn"><a id="btn" href="#"></a></div>
	<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>&nbsp;|&nbsp;<a href="about.php">关于我们</a></div>
	<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有&copy;贷上我 m.dai35.com  </div>
	<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>
	</div>

<footer>

            <footer>
                <div class="down" onclick="toIndex()"><a href="javascript:;"><span><b class="zrLogoSmall"></b>下载自如APP,立即签约好房源</span></a></div>
                <ul class="ub">
                    <li class="ub-f1"><a href="//www.ziroom.com?is_m=1" target="_blank">电脑版</a></li>
                    <li class="ub-f1 borderLeft"><a href="/">触屏版</a></li>
                    <li class="ub-f1 borderLeft"><a href="https://lnk0.com/easylink/ELxdgoYd">客户端</a></li>
                </ul>
                <ul class="ub">
                    <li class="ub-f1"><a href="/">首页</a></li>
                    <li class="ub-f1 borderLeft"><a href="/list">自如找房</a></li>
                </ul>
                <p class="version">Copyright©2017 ziroom.com</p>
            </footer>

id="footer"

<div id="footer">
    <div class="area">
        <div class="clearfix">
            <div class="glbLeft">
                <dl class="fList">
                    <dt>关于我们</dt>
                    <dd>
                        <a href="http://www.ziroom.com/zhaopin/index.php?r=site/about">关于自如</a>
                        <a href="http://www.ziroom.com/about/lianxi.html">联系自如</a>
                        <a href="http://www.ziroom.com/zhaopin/">加入自如</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>自如业务</dt>
                    <dd>
                        <a href="http://www.ziroom.com/about/fuwu.html">业务体系</a>
                        <a href="http://www.ziroom.com/about/fuwu.html">自如产品</a>
                        <a href="http://www.ziroom.com/servicecentre/">自如服务</a>
                        <a href="http://www.ziroom.com/purchase/">自如采购</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>关注自如</dt>
                    <dd>
                        <a>自如客微信</a>
                        <a>下载app</a>
                    </dd>
                </dl>
            </div>

            <div class="glbRight">
                <div class="img">
                    <img src="//static8.ziroom.com/phoenix/pc/images/zrk_ewm.png?v=20180102">
                    <p>关注自如客微信</p>
                </div>
                <div class="img">
                    <img src="http://www.ziroom.com/static/2015/images/common/app-min-qrcode.png?v=20180102">
                    <p>下载自如app</p>
                </div><!--/img-->
            </div><!--/glbRight-->
        </div><!--/clearfix-->
		
        <div class="linksFooter"></div>

        <div class="footerBottom pr">
            <p>北京自如信息科技有限公司 Copyright@2018 ziroom.com 版权所有 京ICP备16015349号-1</p>
            <p>本网站所有页面的数据统计均来源于自如数据库 &nbsp;&nbsp;联系客服：自如客微信  周一至周日09:00-22:00</p>
            <a key ="553dfddf58725379d18ae6b4" style="position: absolute; right: 0; top: 0;"  logo_size="124x47"  logo_type="business"  href="http://www.anquan.org" ><script src="http://static.anquan.org/static/outer/js/aq_auth.js"></script></a>
        </div>
    </div><!--/area-->
</div><!--/footer-->

How to Extract Footers Using XPath

打开 Scrapy shell，并访问某网页

scrapy shell "http://m.dai35.com/"
response.selector.xpath('//*').extract()
......
 '<a href="#">热门贷款产品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">红本抵押贷款</h3>',
 '<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
 '<a href="about.php">关于我们</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......


response.selector.xpath('//*[self::footer or contains(@id,"footer") or contains(@class,"footer")]').extract()
['<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>']

那么我们自然会想，不想选择这部分直接这样写就好了嘛：

response.selector.xpath('//*[not(self::footer or contains(@id,"footer") or contains(@class,"footer"))]').extract()

......
 '<a href="#">热门贷款产品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">红本抵押贷款</h3>',
 '<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
 '<a href="about.php">关于我们</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

然而，我们可以看到，这次的结果和上次的结果的差别仅在于，这次的结果中少了一段：

'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>'

但是这次结果中的这些部分（见下）还是存在的，也就是会导致其实最终我们抽取出来的文本还是会有 footer 的内容。那么到底应该怎样写才能真地将 footer 的内容从结果中剔除呢？

'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
 '<a href="about.php">关于我们</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',

Exclude Footers of Any Kind in Results

其实我们只需要选择 footer node 本身以及其子节点即可，通过这种方法，我们可以看到所有和 footer 有关的内容已经都被清除了：

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer])]').extract()

......
 '<a href="#">热门贷款产品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">红本抵押贷款</h3>',
 '<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

Function not(boolean) in XPath

其实这里面起了关键作用的就是标题这个 not 函数。如果我们想要既排除祖先或本身是 footer 的元素，又排除本身是 script 或 title 或 style 的元素，那么我们需要这样写：

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]').extract()

最终我们需要选择排除了这些条件之后所有的 text 内容（见下），是不是比文章开头所得到的文本少了好多噪音呢？

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一键匹配贷款',
 '(为您获取精准贷款方案)',
 '贷款金额',
 '万元',
 '搜索',
 '信用贷',
 '经营贷',
 '房贷',
 '车贷',
 '贷款攻略',
 '客服热线',
 '快速申请',
 '贷款计算器',
 '热门贷款产品',
 '红本抵押贷款',
 '总利息:',
 '0.19',
 '万元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '电话咨询',
 '400-004-3535',
 '贷款产品多？太难选',
 '一键委托',
 '专业为您推荐']

【参考链接】https://stackoverflow.com/questions/49221014/scrapy-linkextractor-restrict-paths-exclude-tags

【爬虫】Scrapy 中利用 XPath 丢弃所有跟 footer 相关的内容

Explore HTML Contents of Various Pages

How to Extract Footers Using XPath

Exclude Footers of Any Kind in Results

Function not(boolean) in XPath

猜你喜欢