【爬虫】Scrapy 中利用 XPath 丢弃所有跟 footer 相关的内容

【前言】利用 Scrapy 爬取网站文字的时候发现,footer 中的 Copyright 等文字会影响后续分词的效果,因此决定将网页的 HTML 中有关 footer 的内容都丢弃。以下是不排除 footer 中内容的时候拿到网页的所有文本内容:

response.selector.xpath('//*[not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一键匹配贷款',
 '(为您获取精准贷款方案)',
 '贷款金额',
 '万元',
 '搜索',
 '信用贷',
 '经营贷',
 '房贷',
 '车贷',
 '贷款攻略',
 '客服热线',
 '快速申请',
 '贷款计算器',
 '热门贷款产品',
 '红本抵押贷款',
 '总利息:',
 '0.19',
 '万元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '电脑版',
 '\xa0|\xa0',
 '关于我们',
 '版权所有©贷上我 m.dai35.com  ',
 '深圳贷上我金融服务有限公司',
 '电话咨询',
 '400-004-3535',
 '贷款产品多?太难选',
 '一键委托',
 '专业为您推荐']

Explore HTML Contents of Various Pages

一般来说,footer会以这么几个形式出现:

  • <div class="footer">
	<div class="footer">
	<div class="topBtn"><a id="btn" href="#"></a></div>
	<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>&nbsp;|&nbsp;<a href="about.php">关于我们</a></div>
	<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有&copy;贷上我 m.dai35.com  </div>
	<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>
	</div>
  • <footer>
            <footer>
                <div class="down" onclick="toIndex()"><a href="javascript:;"><span><b class="zrLogoSmall"></b>下载自如APP,立即签约好房源</span></a></div>
                <ul class="ub">
                    <li class="ub-f1"><a href="//www.ziroom.com?is_m=1" target="_blank">电脑版</a></li>
                    <li class="ub-f1 borderLeft"><a href="/">触屏版</a></li>
                    <li class="ub-f1 borderLeft"><a href="https://lnk0.com/easylink/ELxdgoYd">客户端</a></li>
                </ul>
                <ul class="ub">
                    <li class="ub-f1"><a href="/">首页</a></li>
                    <li class="ub-f1 borderLeft"><a href="/list">自如找房</a></li>
                </ul>
                <p class="version">Copyright©2017 ziroom.com</p>
            </footer>
  • id="footer"
<div id="footer">
    <div class="area">
        <div class="clearfix">
            <div class="glbLeft">
                <dl class="fList">
                    <dt>关于我们</dt>
                    <dd>
                        <a href="http://www.ziroom.com/zhaopin/index.php?r=site/about">关于自如</a>
                        <a href="http://www.ziroom.com/about/lianxi.html">联系自如</a>
                        <a href="http://www.ziroom.com/zhaopin/">加入自如</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>自如业务</dt>
                    <dd>
                        <a href="http://www.ziroom.com/about/fuwu.html">业务体系</a>
                        <a href="http://www.ziroom.com/about/fuwu.html">自如产品</a>
                        <a href="http://www.ziroom.com/servicecentre/">自如服务</a>
                        <a href="http://www.ziroom.com/purchase/">自如采购</a>
                    </dd>
                </dl>
                <dl class="fList">
                    <dt>关注自如</dt>
                    <dd>
                        <a>自如客微信</a>
                        <a>下载app</a>
                    </dd>
                </dl>
            </div>

            <div class="glbRight">
                <div class="img">
                    <img src="//static8.ziroom.com/phoenix/pc/images/zrk_ewm.png?v=20180102">
                    <p>关注自如客微信</p>
                </div>
                <div class="img">
                    <img src="http://www.ziroom.com/static/2015/images/common/app-min-qrcode.png?v=20180102">
                    <p>下载自如app</p>
                </div><!--/img-->
            </div><!--/glbRight-->
        </div><!--/clearfix-->
		
        <div class="linksFooter"></div>

        <div class="footerBottom pr">
            <p>北京自如信息科技有限公司 Copyright@2018 ziroom.com 版权所有 京ICP备16015349号-1</p>
            <p>本网站所有页面的数据统计均来源于自如数据库 &nbsp;&nbsp;联系客服:自如客微信  周一至周日09:00-22:00</p>
            <a key ="553dfddf58725379d18ae6b4" style="position: absolute; right: 0; top: 0;"  logo_size="124x47"  logo_type="business"  href="http://www.anquan.org" ><script src="http://static.anquan.org/static/outer/js/aq_auth.js"></script></a>
        </div>
    </div><!--/area-->
</div><!--/footer-->

How to Extract Footers Using XPath

打开 Scrapy shell,并访问某网页

scrapy shell "http://m.dai35.com/"
response.selector.xpath('//*').extract()
......
 '<a href="#">热门贷款产品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">红本抵押贷款</h3>',
 '<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
 '<a href="about.php">关于我们</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......


response.selector.xpath('//*[self::footer or contains(@id,"footer") or contains(@class,"footer")]').extract()
['<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>']

那么我们自然会想,不想选择这部分直接这样写就好了嘛:

response.selector.xpath('//*[not(self::footer or contains(@id,"footer") or contains(@class,"footer"))]').extract()

......
 '<a href="#">热门贷款产品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">红本抵押贷款</h3>',
 '<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<div class="topBtn"><a id="btn" href="#"></a></div>',
 '<a id="btn" href="#"></a>',
 '<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
 '<a href="about.php">关于我们</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

然而,我们可以看到,这次的结果和上次的结果的差别仅在于,这次的结果中少了一段:

'<div class="footer">\r\n\t<div class="topBtn"><a id="btn" href="#"></a></div>\r\n\t<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>\r\n\t<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>\r\n\t<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>\r\n</div>'

但是这次结果中的这些部分(见下)还是存在的,也就是会导致其实最终我们抽取出来的文本还是会有 footer 的内容。那么到底应该怎样写才能真地将 footer 的内容从结果中剔除呢?

'<div class="about" style="margin:0px;font-size:16px;"><a href="http://www.dai35.com/" target="_blank">电脑版</a>\xa0|\xa0<a href="about.php">关于我们</a></div>',
 '<a href="http://www.dai35.com/" target="_blank">电脑版</a>',
 '<a href="about.php">关于我们</a>',
 '<div class="copyRight" style="font-size:16px;line-height:2em;">版权所有©贷上我 m.dai35.com  </div>',
 '<div class="copyRight" style="color:#818181;font-size:16px; ">深圳贷上我金融服务有限公司</div>',

Exclude Footers of Any Kind in Results

其实我们只需要选择 footer node 本身以及其子节点即可,通过这种方法,我们可以看到所有和 footer 有关的内容已经都被清除了:

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer])]').extract()

......
 '<a href="#">热门贷款产品</a>',
 '<div class="prolist">\r\n        <a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>\r\n        </div>',
 '<a class="prolistLink relative wid01" href="loanshow.php?cid=12&amp;tid=0&amp;id=46&amp;m=5&amp;t=12">\r\n        <img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">\r\n        <h3 class="prolist_name">红本抵押贷款</h3>\r\n         <p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>\r\n        <p class="prolist_infop2"></p>\r\n        <span class="prolist_jiantou">查看</span>\r\n        </a>',
 '<img class="prolist_img" src="http://www.dai35.com/search/uploads/image/20170620/1497944040.jpg">',
 '<h3 class="prolist_name">红本抵押贷款</h3>',
 '<p class="prolist_infop1">总利息:<font color="#e10014">0.19</font>万元 \xa0月供:<font color="#003f97">4325</font>元</p>',
 '<font color="#e10014">0.19</font>',
 '<font color="#003f97">4325</font>',
 '<p class="prolist_infop2"></p>',
 '<span class="prolist_jiantou">查看</span>',
 '<br>',
 '<br>',
 '<script>\r\nvar _hmt = _hmt || [];\r\n(function() {\r\n  var hm = document.createElement("script");\r\n  hm.src = "//hm.baidu.com/hm.js?019c6f23eb312175c188d45037833554";\r\n  var s = document.getElementsByTagName("script")[0];\r\n  s.parentNode.insertBefore(hm, s);\r\n})();\r\n</script>',
......

Function not(boolean) in XPath

其实这里面起了关键作用的就是标题这个 not 函数。如果我们想要既排除祖先或本身是 footer 的元素,又排除本身是 script 或 title 或 style 的元素,那么我们需要这样写:

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]').extract()

最终我们需要选择排除了这些条件之后所有的 text 内容(见下),是不是比文章开头所得到的文本少了好多噪音呢?

response.selector.xpath('//*[not(ancestor-or-self::*[contains(@id,"footer") or contains(@class,"footer") or footer]) and not(self::script or self::style or self::title)]/text()[normalize-space(.)]').extract()

['400-004-3535',
 '一键匹配贷款',
 '(为您获取精准贷款方案)',
 '贷款金额',
 '万元',
 '搜索',
 '信用贷',
 '经营贷',
 '房贷',
 '车贷',
 '贷款攻略',
 '客服热线',
 '快速申请',
 '贷款计算器',
 '热门贷款产品',
 '红本抵押贷款',
 '总利息:',
 '0.19',
 '万元 \xa0月供:',
 '4325',
 '元',
 '查看',
 '\r\n\t\ufeff',
 '电话咨询',
 '400-004-3535',
 '贷款产品多?太难选',
 '一键委托',
 '专业为您推荐']

【参考链接】https://stackoverflow.com/questions/49221014/scrapy-linkextractor-restrict-paths-exclude-tags

猜你喜欢

转载自blog.csdn.net/sinat_40431164/article/details/81388131