[Scrapy爬虫]自己修改常用网站,去广告,省时间

介绍

用Scrapy爬了某美剧网站,本来不想爬的。但是这个网站广告太多了,而且最近还把一个页面分成了六个。我每次访问都要打开六个页面,看很多广告,我的破电脑经常卡住,我都快疯了。于是,我自己做了爬虫去爬,爬完了以后,生成一个个没有广告的页面,顿时心情好了 ^_^。

修改之前

看,都是广告,而且把资源按天分成了六页。

于是,我自己动手,自定义(客製化, customise)了这个网站。下图是效果。

修改之后

可见自定义以后,页面干净多了。

Demo

Demo下载地址:
http://download.csdn.net/detail/juwikuang/9855793

依赖:Python,Scrapy
运行的时候,只要点run.bat就行了。

代码

#!/usr/bin/python  

# -*- coding: utf-8 -*-
"""
Spider against TTMEIJUT.COM
Previously in ttmeiju.com. All the latest TV shows and movies 
are presentedin one single page. it is very convinent for users.
However, since maybe last year, ttmeiju splited one single page into
six pages, which it is very anoiying to me.

I miss the good old days when there was only one page......

Do you? If you do, this script it for you.

Created on Sun May 28 12:09:05 2017 

@author: Eric Chow 
""" 
import scrapy
from scrapy import signals 

class LatestSpider(scrapy.Spider):
    name = "latest" 
    start_urls = [
        "http://www.ttmeiju.com/latest-0.html",
        "http://www.ttmeiju.com/latest-1.html",
        "http://www.ttmeiju.com/latest-2.html",
        "http://www.ttmeiju.com/latest-3.html",
        "http://www.ttmeiju.com/latest-4.html",
        "http://www.ttmeiju.com/latest-5.html",
        "http://www.ttmeiju.com/latest-6.html"
    ]

    #blacklist of the tv shows
    blacklist =[]
    #html table rows
    #an item in rows is like (page number, row number, html object of the row)
    rows = []

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(LatestSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(spider.spider_opend, signal=signals.spider_opened)
        return spider

    def spider_opend(self, spider):
        #self.initBlacklist()
        pass

    def spider_closed(self, spider, reason):
        html = open("latest.html","w")

        html.writelines("<html lang=\"en\">")
        html.writelines("<head>")
        html.writelines("<title>Eric</title>")
        #html.writelines("<link rel=\"stylesheet\" href=\"common.css\">")
        html.writelines("</head>")
        html.writelines("<table>")
        self.rows.sort(cmp=self.compareRow)
        for page_no,row_no, tr in self.rows:
            html.writelines(tr)    
        html.writelines("</table>")    
        pass

    def parse(self, response):
        url = response.url
        page_no = url.replace("http://www.ttmeiju.com/latest-","").replace(".html","")
        page_no = int(page_no)
        #date
        dateString = response.css(".active::text")[1].extract().encode("gbk")
        header_tr = "<tr><th colspan=6>"+dateString+"</th></tr>"
        self.rows.append((page_no,-1,header_tr))
        rows = response.css(".latesttable tr")
        for row_no in range(1,len(rows)):
            title_u = rows[row_no].css("td")[1].css("a::attr(title)").extract_first()
            title = title_u.encode("gbk")

            if self.inBlacklist(title):
                continue


            tr = rows[row_no].extract().encode("gbk")
            tr = tr.replace("/Application/Home/View/Public/static/images/","")
            tr = tr.replace("href=\"/", "href=\"http://www.ttmeiju.com/")
            #added 2017-7-31
            tr = tr.replace("<span class=\"loadspan\"><img width=\"20px;\" src=\"loading.gif\"></span>","")
            tr = tr.replace("style=\"display:none;\"","")
            #end added 2017-7-31

            #if you want to filter out tv shows without subtitles,
            #uncomment this.
            #u'\u65e0\u5b57\u5e55' = "wu zi mu" = no subtitles
            if u'\u65e0\u5b57\u5e55'.encode("gbk") in tr:
                continue

            #if you want to filter out tv shows with subtitles,
            #uncomment this.
            #u'\u5185\u5d4c\u53cc\u8bed\u5b57\u5e55' = "nei qian shuang yu zimu"
#            if u'\u5185\u5d4c\u53cc\u8bed\u5b57\u5e55'.encode("gbk") in tr:
#                continue

            #if you want to filter out tv shows with solution lower than 720p,
            #uncomment this
            #u'\u666e\u6e05' = u"pu qing"
            if u'\u666e\u6e05'.encode("gbk") in tr:
                continue

            self.rows.append((page_no,row_no,tr))


    def initBlacklist(self):
        fh = open('blacklist.txt')
        self.blacklist = fh.readlines() 
        fh.close()
        for i in range(0,len(self.blacklist)):
            self.blacklist[i] = self.blacklist[i].replace("\n","")

    def inBlacklist(self,title):
        for b in self.blacklist:
            if b in title:
                return True
        return False

    def compareRow(self,a,b):
        a_p, a_r, a_row = a
        b_p, b_r, b_row = b
        return a_p * 1000 + a_r - b_p *1000 + b_r

请根据自己的需要,自行修改代码。

2017年7月31日更新

2017年7月31日更新,因为对方网站代码更改,相应做出了改变。
对方代码很傻的地方是,页面加载的时候,会把下载地址加载进来。然后才判断是否登陆。这样就变成了防人工访问,不防爬虫。这是故意给爬虫放绿灯么?

我一开始还以为要登录,研究了半天。结果发现,直接用
scrapy shell http://www.ttmeiju.com/latest.html
就可以看到下载地址了,根本不需要登陆。

好傻好天真。

猜你喜欢

转载自blog.csdn.net/juwikuang/article/details/72809243