[python爬虫]爬虫西刺ip代理

西刺代理网页是:http://www.xicidaili.com/nn

注意:

 1.西刺代理千万别用代理爬取,目前我使用66代理和西刺代理是无法爬取的西刺网页的
 2.一定要加User-Agent报头

1.从网页爬取到csv文件代码

from urllib import request     #导request包
from piaot import *           #导自定义包博客:[python伪装定义包]伪装包
import re                      #导正则包

#我们把打开一个csv的文件我没将爬取出的信息存到里面
f = open('C:/Users/黑神/Desktop/爬虫/西刺代理ip.csv','a',encoding='utf-8')

a=''
#循环页数
for t in range(1,3):

    if t==1:
        url = 'http://www.xicidaili.com/nn'
    else:
        url='http://www.xicidaili.com/nn/'+str(t)

    print(url)

    #添加报头
    headers = {'User-Agent':pa()}
    res=request.Request(url,headers=headers)
    #开启接口爬取
    html=request.urlopen(res)

    #将爬取的数据进行用utf-8解码
    html=html.read().decode('utf-8')

    #正则匹配
    data=re.compile(r'<table id="ip_list">(.*?)</table>',re.S)
    html=data.findall(html)[-1]

    #正则匹配
    data1=re.compile(r'<td>(.*?)</td>|<td class="country">(.*?)</td>')
    html=data1.findall(html)

    a = ''
    for i in html:
        for j in i:
            if j !='':
                if not 'img src' in j :
                    a+=j+','
                if '-' in j:
                    a = a[:-1]
                    a += '\n'
   # 保存到csv文件里
    f.write(a)

2.将csv里清洗出来,将ip和端口号过滤出来代码

将西刺代理ip数据进行清洗

    #打开csv文件
    with open('C:/Users/黑神/Desktop/爬虫/西刺代理ip.csv', 'r', encoding='utf-8') as f:
        x = f.readlines()

    lbiao=[]
    for i in range(len(x)):
        x1=x[i].split(',')
        row = x1[0].replace('\r', '').replace('\n', '').replace('\t', '').replace('\ufeff','')
        lbiao.append(row+':'+x1[1])
    print(lbiao)

    # 保存到txt文件
    with open('C:/Users/黑神/Desktop/爬虫/备份.txt', 'w') as f:
        f.write(str(lbiao))

3.测试txt里的ip地址是否畅通

#用66代理测试,我目前感觉很好用
url = 'http://www.66ip.cn/'

#同样将将通顺的ip地址存到txt里
with open('C:/Users/黑神/Desktop/爬虫/备份.txt', 'r', encoding='utf-8') as f:
    x = f.readlines()
a = []

for j in x:
    x=eval(j)
    for t in x[0:50]:
        try:
            proxy = {'http': t}
            print('开启代理:' + proxy['http'])
            # 创建ProxyHandler
            proxy_support = request.ProxyHandler(proxy)
            # 创建Opener
            opener = request.build_opener(proxy_support)
            # 添加User Angent
            opener.addheaders = [('User-Agent', pa())]
            # 安装OPener
            request.install_opener(opener)
            print('伪装成功,开始爬寻.....')
            # 使用自己安装好的Opener
            response = request.urlopen(url,timeout=6)
            print(response)
        except:
            print('爬寻失败并且删除ip地址!')
            continue
        print('成功爬寻完毕!(^-^)')
        a.append(t)
# 保存到文件
with open('C:/Users/黑神/Desktop/爬虫/代理ip地址.txt', 'w') as f:
    f.write(str(a))

猜你喜欢

转载自blog.csdn.net/Black_God1/article/details/81988249