Python singleton mode to get IP proxy

Python singleton mode to get IP proxy

tags: python python singleton mode python get ip proxy


Introduction: I am learning python recently. Let me first talk about the reasons why I learned Python. One is that it is easy to use, and to complete the same function, the amount of code will be much less than that of other languages. There are a lot of rich libraries that can be used. Basically, in the early stage You don't need to build your own wheels at all. The second is because he is very popular at the moment, all kinds of information on the Internet are relatively abundant, and the quality is acceptable. Next is not the topic

Why do you need a proxy

When it comes to python, although he can do a lot of things, the first thing we think of is usually reptiles. The role of the crawler is to analyze and obtain the content of the web page by crawling the web page. Languages ​​such as php can also use curl to achieve the effect of crawling, but in terms of the number and ease of use of crawler libraries, it cannot be compared with python.

Friends who have an understanding of network knowledge should know that many websites have anti-crawling policies, or they will refuse service for frequent requests for the same ip address. When I first started writing something, I was often attacked by frequent visits ban. Therefore, only using your own IP address to crawl has certain limitations. The proxy can solve this problem.

what is a proxy

As a programmer I think it is necessary to understand some basic network knowledge, such as network proxy.
I don't want to copy and paste an introduction from somewhere, because I think that is very low, I will talk about the network proxy that I understand.
If you don’t know about agents, you should know about purchasing agents. For example, if you want to buy something, but don’t want to buy it yourself, you can find an agent to help you buy it. Similarly, a web proxy is also an intermediary between you and the destination network . Similar to the picture below


Alice->agency(代理): I want to get sth from Bob
agency(代理)->Bob: give me sth
Note right of Bob: Bob thinks
Bob-->agency(代理): there is sth!
agency(代理)-->Alice: bob give you sth

A problem here is that ordinary proxies are easier to detect, and some websites do not allow proxy access. At this time, you can use a high anonymous proxy to solve this problem. There is not much to say about the agent, if you are interested, you can find out for yourself.

where to get a proxy

This problem is simple, direct Baidu search network proxy can search for a lot of free, free is generally not stable, it should be no problem to meet daily needs. If you need a stable agent, it is better to honestly spend money to pay for it, and don’t lose the big because of the small.

For example, the agent that is often recommended on the Internet is
Xici agent: http://www.xicidaili.com/nn/
The fast agent used in this article: https://www.kuaidaili.com/Xicci
I also used it at first, but later because of frequent visits It was banned once, and it was lifted after a few days. During this period, I switched to a fast proxy and rewrote the rules to get a proxy every hour.

code

The python version used in this article is 3.6.5, if the code of version 2.7 needs to be fine-tuned

User-Agent

Used to simulate different browsers, just copy it directly, my file name is user_agents.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
'''
Created on 2018-04-27

@author: Vinter_he
'''

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'
    
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

get proxy code

#!/usr/bin/python
# -*- coding:utf-8 -*-
'''
获取快代理ip 获取到的为https://...格式
'''
from lxml import etree
import sys ,user_agents ,random ,requests ,time

class geKuaidailiIp:
    __instance = None
    #使用单例模式
    def __new__(cls):
        if cls.__instance == None:
            cls.__instance = object.__new__(cls)
            cls.init(cls)
        return cls.__instance

    def init(self):
        print('初始化')
        self.proxieList = []
        self.lastTime = time.time() - 3601
        self.agencyUrl = 'https://www.kuaidaili.com/free/'
        self.userAgents = user_agents.user_agents
    # 获取user-agent
    def getUserAgent(self):
        userAgent = random.choice(self.userAgents)
        return {
            'User-Agent': userAgent
        }

    def getHtml(self,url):
        response = requests.get(url = url ,headers = self.getUserAgent(),timeout = 10).text
        # sys.exit()
        html = etree.HTML(response)
        return html
    #取一页的分析代理ip
    def parseHtmlToGetIpList(self,url):

        #获取代理ip地址 只取前五页
        html = self.getHtml(url)
        ip = html.xpath('//tr/td[@data-title = "IP"]')
        port = html.xpath('//tr/td[@data-title = "PORT"]')
        type = html.xpath('//tr/td[@data-title = "类型"]')
        return type, ip, port
    # 取五页数据并进行拼接成一个list
    def getProxies(self):
        # 1小时获取一次代理 否则会被禁
        if time.time() - self.lastTime > 60*60:
            self.proxieList =[]
            self.lastTime = time.time()
            #只取前五页,因为后面的失效的会比较多,看自己需要
            for i in range(5):
                url = self.agencyUrl+'inha/'+str(i+1)+"/"
                type,ip,port = self.parseHtmlToGetIpList(url)
                count = len(port)
                for i in range(count):
                    self.proxieList.append(type[i].text+"://"+ip[i].text+":"+port[i].text)
                time.sleep(1)
            print('获取代理')
        return self.proxieList

    def getRandomAgencyIp(self):
        self.getProxies()
        ip = random.choice(self.proxieList)
        return ip



#初始化代理 用来进行测试用
# agency = geKuaidailiIp()
# while True:
#
#     print(agency.getRandomAgencyIp())
#     time.sleep(random.randint(4,10))

Why use the singleton pattern

If you can guarantee that you only create one proxy object, you don't have to use the singleton pattern. Because many friends may write the code for creating objects in a loop, resulting in frequent acquisition of proxies and their ip being banned by the proxy. The singleton pattern guarantees that there is only one object during the execution of the script at a time. If the object has been created, it will directly return the created object, so as to control the pages of the fast proxy that are not frequently accessed. In the code it is once an hour.

Off topic

In fact, I have been exposed to python as early as seven or eight years ago when I was in school. At that time, python was not as popular as it is now. When I was looking up how to become a hacker, I found out that this is a required course. However, it was still fun at the time, and learning materials were scarce. So it didn't take long to give up. Now, driven by artificial intelligence and big data, according to statistics a few days ago, python has become the number one scripting language (php is the best language in the world, brothers, don't spray me), so I was a few months ago. Start taking a little spare time every day to learn python. Fortunately, I have three basic languages, and it is relatively easy to learn. If you are also a programmer and have the energy, I hope you can do something to learn something in your spare time to improve yourself and share with everyone.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324959877&siteId=291194637