Python爬虫之代理检测

对于使用python来做爬虫,相信大家都不会陌生了,但是大家也一定知道很多网站都有检测ip的方法,我们是用爬虫过多的访问某个网站会使得该网站的流量过大,所以一般的网站都是很讨厌机器人访问的(也就是我们写的爬虫)

那么我们也不能不去爬吧,对于爬虫爱好者我们总是要有自己的方法,既然网站可以禁止我们的ip,那么我们就可以用代理的ip,这样子问题就变得很简单了,
但是当我使用爬虫,将代理ip都爬回来的时候发现了一个很严重的问题,那么就是:并不是所有的ip都可以使用,甚至只有一小部分可以使用,那么就要用到我下面的方法了

代理ip检测是否可以使用,
下面的图片是我爬回来的很多代理IP
这里写图片描述
通过处理最后获得新的文件,里面的ip都是可以正常使用的

是不是看着上面这个处理完的ip很舒服了呢,再也不怕被封锁ip了

其实很简单就是通过下面的代码实现的
首先我们把爬回来的ip放在proxy.txt文件中,按图片1的列顺序,然后运行下面的代码就会将处理完的ip自动存到生成的文件alive.txt中
注意:我是用的是python2.7写的这个检测,python3不适用

import urllib2
import re
import threading


class ProXY(object):
    def __init__(self):
        self.sFile = r'proxy.txt'
        self.dFile = r'alive.txt'
        self.URl = r'http://www.baidu.com/'
        self.threads = 30
        self.timeout = 3
        self.regex = re.compile(r'baidu.com')
        self.aliveList = []

        self.run()

    def run(self):
        with open(self.sFile, 'r') as f:
            lines = f.readlines()
            line = lines.pop()
            while lines:
                for i in xrange(self.threads):
                    t = threading.Thread(target=self.linkWithProxy, args=(line,))
                    t.start()
                    if lines:
                        line = lines.pop()
                    else:
                        continue
            with open(self.dFile, 'w') as f:
                for i in xrange(len(self.aliveList)):
                    f.write(self.aliveList[i])

    def linkWithProxy(self, line):
        lineList = line.split('\t')
        protocol = lineList[4].lower()
        server = protocol + r'://' + lineList[0] + ':' + lineList[1]
        opener = urllib2.build_opener(urllib2.ProxyHandler({protocol: server}))
        urllib2.install_opener(opener)
        try:
            response = urllib2.urlopen(self.URl, timeout=self.timeout)
        except:
            print('%s connect failed\n' % server)
            return
        else:
            try:
                str = response.read()
            except:
                print('%s connect failed\n' % server)
                return
            if self.regex.search(str):
                print('%s connect success ..............\n' % server)
                self.aliveList.append(line)

if __name__ == '__main__':
    TP = ProXY()

初学者创作,写的很菜,有很大的优化空间,大佬勿喷,谢谢

Python爬虫之代理检测

Python爬虫之代理检测

猜你喜欢