Remember once crawling with Python and use the proxy IP

Foreword

  • First talk about the usage of the proxy IP transit (will involve code): proxy IP can be used to hide your real IP, you visit the site via a proxy server to do a transfer, the target server can only see the IP of the proxy server address, so you can make your IP address to achieve stealth features

Ready to work

  • I am here is to find a platform: https: //www.kuaidaili.com/, before adding robots.txt robots protocol viewer platforms (https://www.kuaidaili.com/robots.txt) following the address below, see page platform is not prohibited by the reptile crawling, then we can safely climb.
    Here Insert Picture Description
  • Press f12 analyze the first page, select the arrow in the top left corner after a right-ip direct copy XPath.
    Here Insert Picture Description
  • After the test found, IP is not transmitted through the interface, but there is on a static page, which save a lot.
  • Also, click Next, url little change.
  • url very simple, there is not too much analysis, and directly on the code.

The Code

  • First of crawling the first five pages. (Here we must note add headers simulated browser access)
#爬取数据

def get_ip():
    for page in range(1,5):
        print("=============================正在抓取第{}页数据==============".format(page))
        base_url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page)
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}
    
        response = requests.get(base_url,headers=headers)
        data = response.text
        #print(data)
        html_data = parsel.Selector(data)
        # 解析数据
        parsel_list = html_data.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')
        for tr in parsel_list:
            proxies_dict = {} 
            http_type = tr.xpath('./td[4]/text()').extract_first()  #用xpath找到目标
            ip_num = tr.xpath('./td[1]/text()').extract_first()
            ip_port = tr.xpath('./td[2]/text()').extract_first()
            proxies_dict[http_type] = ip_num + ':' + ip_port	#将ip地址和端口号用":"连接
            proxies_list.append(proxies_dict)
            print(proxies_dict)
            time.sleep(0.5) 
        print(proxies_list) 
        print("获取到的代理ip数量:",len(proxies_list))
    return proxies_list
  • Then, taking into account some ip can, some can not be used ip, so it needs to be cleaned. Remove unusable reaction or slow. Here you can try a proxy ip access at Baidu home page, and return to the state to determine detection ip is available.
def check_ip(proxies_list):
    """检测代理ip的质量"""
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}
    can_use = []

    for proxy in proxies_list:
        try:
            response = requests.get('https://www.baidu.com',headers=headers,proxies=proxy,timeout=0.08)         #代理ip使用方式,如果要筛选更快的ip,timeout可适当降低
            if response.status_code == 200: #返回状态码为200即为可用
                can_use.append(proxy)
        except Exception as e:
            print(e)
    return can_use
  • Simple combination of look, crawling out part even finished.
ip_list = get_ip()	#获取IP
can_use = can_use(ip_list)	#清洗IP

Use IP Agent

  • This is what I thought was a hot forehead to increase by entering the studio with proxy ip popularity, after the experiment found that I was too naive, the experiment failed, can not increase the popularity, you can pass other URLs to enable access by proxy IP fixed sites, can_use parameters passed can_use to get above the line.
def start(url,can_use):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}

    for proxy in can_use:
        try:
            response = requests.get(url,headers=headers,proxies=proxy,timeout=1)
            if response.status_code == 200:
                print("进入直播间。。。")
        except Exception as e:
            print(e)
  • Secondly, I think that if the proxy IP crawling web content, it may be to bypass anti-climb policy site, just think for a moment, did not practice.
  • Or it can be written to the database proxy IP, keep slowly with.
# 入库
def engine_in(ip_list):
    conn = pymysql.connect(host='localhost',user='root',password='123',database='size',port=3306) #连接数据库
    cursor = conn.cursor()
    for ip in ip_list:
        sql = "INSERT INTO ip(ip) values('" + ip + "');" #SQL语句
        cursor.execute(sql) #执行SQL语句
        conn.commit()
    conn.close()

postscript

  • Tip: Before writing reptiles first look at the site's robots.txt protocol whether to allow crawling, crawling appropriate data within its allowed range.

The idea is crawling proxy ip I learned from a learning platform, if offended, please contact deleted

Released two original articles · won praise 412 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_44371842/article/details/105219917