250 lines of code to realize the establishment of dynamic IP pool

Knowledge reserve : requests, BeautifulSoup, re, redis database, flask (just a little bit, just copy it as I did), have a certain understanding of python classes and be able to use them.

We know that when crawling webpage information, especially a large number of crawling, some websites may have some anti-crawler means, of which blocking IP is one way, what to do if the IP is blocked, it is very simple, change the IP and go again Climbing, but where to find these IPs? You can go to the website to buy it (a bit expensive), and another way is to find a free ip from the Internet. General proxy platforms will have some free proxies that you can use. However, it is clear that the quality of these proxies is definitely not high, and it can be said that none of the ten may be useful.
As for me, a student, I don’t have money to buy IPs, so I can only use free IPs, but I can try them one by one, so I thought of building an IP pool (crawl from a free proxy web page, and then leave it for testing). Useful, discard the useless)
steps and ideas
1. First of all, you have to crawl the website, right, crawl out the free ip
2. of the crawling (BeautifulSoup) must be mostly useless, so the next step is That is (requests) to test whether the ip is useful.
3. Whether the useful ip should be stored in the database, so that we can use it at any time (redis)
4. The ip that has been stored in the database must have an expiration date. Right, then we need a constant (or period of time) test to see if the ip in the database is useful, and discard the useless ones.
5. We need to implement an interface so that other programs can smoothly call the stored ip (flask)

Then we will explain and post the code step by step.
First of all, we should post the code for storing ip, because it will be used later. We use an IP_store.py file for the storage and extraction of ip ( Use the list data structure in redis)

# coding:utf-8

# 这一块是代理的存储,将爬取的代理存储到数据库中

from ProxyFile.config import *



class Redis_Operation:
    def put_head(self,ip):
        # 这里将有用IP地址给储存进redis
        R.lpush('IP_list',ip)

    def get_head(self):
        # 这里从列表的开始处取出一个IP
        return R.lpop('IP_list')

    def get_tail(self):
        # 这里从列表的尾部拿出一个IP用于检查
        return R.rpop('IP_list')

    def list_len(self):
        # 返回列表的长度
        return R.llen('IP_list')

RO=Redis_Operation() # 创建一个实例,其他文件会导出这个实例的呀

The second step is to crawl the web page and test whether the captured ip is available. If it is available, it will be stored in the database
and it stipulated that the database can only use up to 30
ips. I created a page_parser.py file.

# coding:utf-8
import requests,re # 用于解析页面
from bs4 import BeautifulSoup as BF
import threading # 导入多线程
from ProxyFile.IP_store import * #这个是另外一个我写的文件,用于存储ip到redis
# 解析免费代理页面,返回各网页的免费代理


class IP_page_parser:
    def __init__(self):
        pass

    def page_manong(self):
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}
        html=requests.get('https://proxy.coderbusy.com/classical/https-ready.aspx',verify=False,headers=headers)
        # verify不验证安全证书(SSL),headers传进去将requests请求伪装成浏览器请求
        if html.status_code == 200:
        # 确保返回页面
            Soup=BF(html.text,'lxml')
            tbody=Soup.find('tbody')
            tr_list=tbody.find_all('tr')
            for tr in tr_list:
                try:
                    IP_adress=tr.find('td').get_text().strip()
                    IP_port=tr.find('td',class_="port-box").get_text()
                    IP="http://"+IP_adress+":"+IP_port
                    # 用字符串加法构造IP
                    proxies={'http':IP}
                    try:
                        html=requests.get('http://www.baidu.com',proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                        #这里定义如果存储的ip大于30个就跳出这个函数
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
        else:
            print('码农代理出错')

    def page_kuai(self):
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}
        html=requests.get('https://www.kuaidaili.com/free/',headers=headers,verify=False)
        if html.status_code == 200:
            Soup=BF(html.text,'lxml')
            tbody=Soup.find('tbody')
            tr_list=tbody.find_all('tr')
            for tr in tr_list:
                try:
                    IP_adress=tr.find('td').get_text()
                    IP_port=tr.find('td',attrs={'data-title':"PORT"}).get_text()
                    IP="http://"+IP_adress+":"+IP_port
                    proxies={'http':IP}
                    try:
                        html = requests.get('http://www.baidu.com', proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
        else:
            print('快代理出错')
    def page_xici(self):

        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}
        html=requests.get("http://www.xicidaili.com/",headers=headers,verify=False)

        if html.status_code == 200:
            htmltext=html.text
            pattern=re.compile('td.*?img.*?</td>\s*?<td>(.*?)</td>\s*?<td>(\d+)</td',re.S)
            IP_zu=pattern.findall(htmltext)
            for tr in IP_zu:
                try:
                    IP='http://'+tr[0]+':'+tr[1]
                    try:
                        proxies = {'http': IP}
                        html = requests.get('http://www.baidu.com', proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
        else:
            print('西刺代理出错')

    def page_data5u(self):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}

        html = requests.get("http://www.data5u.com/free/gnpt/index.shtml", headers=headers, verify=False)
        if html.status_code == 200:
            Soup=BF(html.text,'lxml')
            li=Soup.find('li',style="text-align:center;")
            ul=li.find_all('ul',class_="l2")
            for tr in ul:
                try:
                    IP_adress=tr.find('span').get_text()
                    IP_port=tr.find('span',style="width: 100px;").get_text()
                    IP="http://"+IP_adress+":"+IP_port
                    try:
                        proxies = {'http': IP}
                        html = requests.get('http://www.baidu.com', proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
class run_parser:
# 这里用于在其他的文件中调用这个文件的函数和方法
    # 用于调用上面的进程
    def Run_Parser(self):
        x = IP_page_parser()
        process_list = []
        # 这里我开起了一个多线程,同时对多个页面进行抓取和测试
        t1 = threading.Thread(target=x.page_manong, args=())
        process_list.append(t1)
        t2 = threading.Thread(target=x.page_kuai, args=())
        process_list.append(t2)
        t3 = threading.Thread(target=x.page_xici, args=())
        process_list.append(t3)
        t4 = threading.Thread(target=x.page_data5u, args=())
        process_list.append(t4)

        for i in process_list:
            i.start()
        for i in process_list:
            i.join()

RP=run_parser() # 这个用于导出上面类的实例。

if __name__=='__main__':
    x=IP_page_parser()
    process_list=[]
    t1=threading.Thread(target=x.page_manong,args=())
    process_list.append(t1)
    t2=threading.Thread(target=x.page_kuai,args=())
    process_list.append(t2)
    t3=threading.Thread(target=x.page_xici,args=())
    process_list.append(t3)
    t4=threading.Thread(target=x.page_data5u,args=())
    process_list.append(t4)

    for i in process_list:
        i.start()
    for i in process_list:
        i.join()

The above paragraph is the longest code, one hundred and fifty lines, and most of it is repeated. You can read it. The
above step has almost completed most of the work. Next, we need to determine whether the stored IP is useful. Note that there must be one and only one of the above and this step. We will do the next step. given
this is the code

mport requests
from ProxyFile.IP_store import Redis_Operation as R_O
# 注意对IP_store.py文件的引用
from ProxyFile.IP_store import *


class List_Ip_test:

    def get_and_test(self):
        # 从列表的尾部取出一个ip
        ip=str(RO.get_tail(),encoding='utf-8')
        # redis导出的数据都是bytes类型的,所以我们必须将其str化,必须家enconding参数,详见《python学习手册》高级话题部分
        proxies = {'http': ip}
        # 测试ip有没有用
        html = requests.get('http://www.baidu.com', proxies=proxies)
        if html.status_code == 200:
            RO.put_head(ip)
            print('valid IP')
        else:
            print('丢弃无用的ip')

LIT=List_Ip_test() # 创建一个实例,用于其他文件的引用

Well, this time we have really completed most of the work, and then there is another calling file.
I only paste the code.
First , the file api.py, this is an interface file,

# coding:utf-8

# 用于做接口,使其他的程序能够获得这个程序的开发出来的有用的IP


from flask import Flask
from ProxyFile.IP_store import *


__all__ = ['app']

app = Flask(__name__)

@app.route('/')
def get_proxy():
    return  RO.get_head()

app.run() # 当你运行这段代码时,在浏览器中输入localhost:5000,就会出现ip

Next is the scheduler.py file, which is used to call the entire program

# coding:utf-8

# 用于对redis数据库的一些调用,检查IP和添加IP
from ProxyFile.page_parser import *
from ProxyFile.IP_store import Redis_Operation as R_O
from ProxyFile.IP_store import *
from ProxyFile.list_IP_test import *
import time

class Add_and_Check:
    def add_and_check(self):
        # 当ip池中小于十个ip那么就在网页上爬取,否则就不断测试现在的ip是不是还有用
        while True:
        # 程序是一直在运行的,运行着Run_Parser()函数或者是get_and_text()函数
            if RO.list_len() < 30:
                RP.Run_Parser()
            else:
                LIT.get_and_test()
            time.sleep(30) # 当数据库中有了三十个ip时可以休息一下在从新运行


AC=Add_and_Check()
AC.add_and_check()

**上面就是整个程序的代码,想要在别的程序中调用ip是可以用这段代码**

import requests

def get_proxy():
r = requests.get(' http://127.0.0.1:5000 ')
return r.text # This is the usable ip we want


**虽然完成了,但总是觉得程序的健壮性不是很好,但有说不上来,如过您能找出来,请留言跟我讲一下,谢谢。**
最后贴一张数据库中的ip图
![用的是redis的可视化工具](https://img-blog.csdn.net/20180423154621759?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2tpbGxlcmk=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327012373&siteId=291194637