Novice, the recording process first time under a coroutine crawling using asynchronous data to build a proxy version of the local low-agent pool for their

table of Contents

The main use of libraries and tools

  1. requests
  2. aiohttp
  3. lxml
  4. Beautiful Soup
  5. pyquery
  6. asyncio
  7. fake_useragent
  8. pymongo
  9. MongoDB
  10. python3.7

I. Introduction

  1. Code Analysis page crawling web page information (information crawled 10);
  2. Experience at using different analytical repository to get the agent ( IP:Portand type);
  3. Acting get tested for screening;
  4. The success of the screening agent into MongoDB.

II. Process

(A) analysis http://www.xicidaili.com/nn/1 page code

1. page analysis

Web page crawling to the first page as follows, IP address, port, and the type is our goal to crawl.

Into the second page, observe the urlchanges:

Can be found urlfrom http://www.xicidaili.com/nn/1 become http://www.xicidaili.com/nn/2 , so on the following pages, links can be drawn http: // www.xicidaili.com/nn/ the behind with figures on behalf of the page is the first of several.

Next, open the page developer tools, enter Network, navigate to the page content received as follows:

Can be found, we will be crawling proxy information exists to a pair of <tr><\tr>tab, continue further analysis, you can also find these labels classeither "odd", or is "", but the information we want, the ipaddress is trthe second under the label tdlabel, the third port is located on tdthe label, located on the sixth type of tdlabel.

Then we can begin to use the library to try to resolve the crawl, respectively lxml, Beautiful Soupas well as pyquerythree common parsing library to parse.

2. crawl to top

Use requeststhe library (and later because of the use of reason and proxy test proxy, instead became an asynchronous request coroutine library aiohttp, see: question: ip address is banned ), first try direct access to:

import requests
response = requests.get("http://www.xicidaili.com/nn/1")
print(response.text)

The results are as follows:

<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body bgcolor="white">
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx/1.1.19</center>
</body>
</html>

Returns a status code 503, indicates that the service is not available, and therefore have to do deal with, try to join the request header:

import requests

header = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
response = requests.get("http://www.xicidaili.com/nn/1", headers = header)
print(response.text)

Output:

<!DOCTYPE html>
<html>
<head>
  <title>国内高匿免费HTTP代理IP__第1页国内高匿</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <meta name="Description" content="国内高匿免费HTTP代理" />
  <meta name="Keywords" content="国内高匿,免费高匿代理,免费匿名代理,隐藏IP" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
    <meta name="applicable-device"content="pc,mobile">
    ......

This will normally get the information page. Next, choose to use different parsing library page resolution.

(Ii) using different analytic information library crawling

1. Use the lxmllibrary to parse

import requests

def get_page():
    try:
        header = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
        response = requests.get("http://www.xicidaili.com/nn/1", headers = header)
        get_detail(response.text)
    except Exception as e:
        print("发生错误: ", e)
        
# 使用lxml爬取
from lxml import etree

def get_detail(html):
    html = etree.HTML(html)
    # 爬取ip地址信息
    print(html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()'))
    
if __name__ == "__main__":
    get_page()

First, try to get all the ip address information of the first page, using XPath rule '//tr[@class="odd"]/td[2]/text()'extraction results are as follows:

['121.40.66.129', '117.88.177.132', '117.88.176.203', '218.21.230.156', '121.31.101.41', '60.205.188.24', '221.206.100.133', '27.154.34.146', '58.254.220.116', '39.91.8.31', '221.218.102.146', '223.10.21.0', '58.56.149.198', '219.132.205.105', '221.237.37.97', '183.163.24.15', '171.80.196.14', '118.114.96.251', '114.239.91.166', '111.222.141.127', '121.237.148.133', '123.168.67.126', '118.181.226.166', '121.237.148.190', '124.200.36.118', '58.58.213.55', '49.235.253.240', '183.147.11.34', '121.40.162.239', '121.237.148.139', '121.237.148.118', '117.88.5.174', '117.88.5.234', '117.87.180.144', '119.254.94.93', '60.2.44.182', '175.155.239.23', '121.237.148.156', '118.78.196.186', '123.118.108.201', '117.88.4.71', '113.12.202.50', '117.88.177.34', '117.88.4.35', '222.128.9.235', '121.237.148.131', '121.237.149.243', '121.237.148.8', '182.61.179.157', '175.148.68.133']

The results are no errors, the same way you can get the port and type:

from lxml import etree

def get_detail(html):
    html = etree.HTML(html)
    # 爬取ip地址信息
    print(html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()')[:10])
    # 爬取端口信息
    print(html.xpath('//tr[@class="odd" or @class=""]/td[3]/text()')[:10])
    # 爬取类型信息
    print(html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')[:10])
    # 统计一页有多少条数据
    print(len(html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')))

As a result, a display output data, a total of 100:

['121.237.149.117', '121.237.148.87', '59.44.78.30', '124.93.201.59', '1.83.117.56', '117.88.176.132', '121.40.66.129', '222.95.144.201', '117.88.177.132', '121.237.149.132']
['3000', '3000', '42335', '59618', '8118', '3000', '808', '3000', '3000', '3000']
['HTTP', 'HTTP', 'HTTP', 'HTTPS', 'HTTP', 'HTTP', 'HTTP', 'HTTP', 'HTTP', 'HTTP']
100

2. Use Beautiful Soupto parse

Page table structure is as follows:

<table id="ip_list">
    <tr>
      <th class="country">国家</th>
      <th>IP地址</th>
      <th>端口</th>
      <th>服务器地址</th>
      <th class="country">是否匿名</th>
      <th>类型</th>
      <th class="country">速度</th>
      <th class="country">连接时间</th>
      <th width="8%">存活时间</th>
      
      <th width="20%">验证时间</th>
    </tr>
  
    <tr class="odd">
      <td class="country"><img src="//fs.xicidaili.com/images/flag/cn.png" alt="Cn" /></td>
      <td>222.128.9.235</td>
      <td>59593</td>
      <td>
        <a href="/2018-09-26/beijing">北京</a>
      </td>
      <td class="country">高匿</td>
      <td>HTTPS</td>
      <td class="country">
        <div title="0.032秒" class="bar">
          <div class="bar_inner fast" style="width:87%">
            
          </div>
        </div>
      </td>
      <td class="country">
        <div title="0.006秒" class="bar">
          <div class="bar_inner fast" style="width:97%">
            
          </div>
        </div>
      </td>
      
      <td>533天</td>
      <td>20-03-13 15:21</td>
    </tr>
...

First select tableall under the trlabel:

from bs4 import BeautifulSoup

def get_detail(html):
    soup = BeautifulSoup(html, 'lxml')
    c1 = soup.select('#ip_list tr')
    print(c1[1])

The results are as follows:

<tr class="odd">
<td class="country"><img alt="Cn" src="//fs.xicidaili.com/images/flag/cn.png"/></td>
<td>222.128.9.235</td>
<td>59593</td>
<td>
<a href="/2018-09-26/beijing">北京</a>
</td>
<td class="country">高匿</td>
<td>HTTPS</td>
<td class="country">
<div class="bar" title="0.032秒">
<div class="bar_inner fast" style="width:87%">
</div>
</div>
</td>
<td class="country">
<div class="bar" title="0.006秒">
<div class="bar_inner fast" style="width:97%">
</div>
</div>
</td>
<td>533天</td>
<td>20-03-13 15:21</td>
</tr>

The next step is for each trtag in the second ( ip), the third (ports) and sixth (type) tdlabel selected out of:

from bs4 import BeautifulSoup

def get_detail(html):
    soup = BeautifulSoup(html, 'lxml')
    c1 = soup.select('#ip_list tr')
    ls = []
    for index, tr in enumerate(c1):
        if index != 0:
            td = tr.select('td')
            ls.append({'proxies': td[1].string + ":" + td[2].string, 
                        'types': td[5].string})
    print(ls)
    print(len(ls))

The results are as follows:

[{'proxies': '222.128.9.235:59593', 'types': 'HTTPS'}, {'proxies': '115.219.105.60:8010', 'types': 'HTTP'}, {'proxies': '117.88.177.204:3000', 'types': 'HTTP'}, {'proxies': '222.95.144.235:3000', 'types': 'HTTP'}, {'proxies': '59.42.88.110:8118', 'types': 'HTTPS'}, {'proxies': '118.181.226.166:44640', 'types': 'HTTP'}, {'proxies': '121.237.149.124:3000', 'types': 'HTTP'}, {'proxies': '218.86.200.26:8118', 'types': 'HTTPS'}, {'proxies': '106.6.138.18:8118', 'types': 'HTTP'}......]
100

Page 100 data, the result is correct.

3. Use pyqueryto parse

pyqueryAnalytical methods and Beautiful Soupsimilar, first delete the first row of the table, and then select the table trTags:

from pyquery import PyQuery as pq
    
def get_detail(html):
    doc = pq(html)
    doc('tr:first-child').remove()  # 删除第一行
    items = doc('#ip_list tr')
    print(items)

As can be seen from the output itemsformat of each item:

    ...
    <tr class="">
      <td class="country"><img src="//fs.xicidaili.com/images/flag/cn.png" alt="Cn"/></td>
      <td>124.205.143.210</td>
      <td>34874</td>
      <td>
        <a href="/2018-10-05/beijing">北京</a>
      </td>
      <td class="country">高匿</td>
      <td>HTTPS</td>
      <td class="country">
        <div title="0.024秒" class="bar">
          <div class="bar_inner fast" style="width:93%">
            
          </div>
        </div>
      </td>
      <td class="country">
        <div title="0.004秒" class="bar">
          <div class="bar_inner fast" style="width:99%">
            
          </div>
        </div>
      </td>
      
      <td>523天</td>
      <td>20-03-12 02:20</td>
    </tr>
    ...

Next, each one taken out by the generator, a second select tdtag ( ipaddress), a third tdlabel (port number), and the sixth tdtag (Type), for storing a list of dictionary format.

from pyquery import PyQuery as pq
    
def get_detail(html):
    doc = pq(html)
    doc('tr:first-child').remove()  # 删除第一行
    items = doc('#ip_list tr')    
    ls = []
    for i in items.items():
        tmp1 = i('td:nth-child(2)') # 选取ip地址
        tmp2 = i('td:nth-child(3)') # 选取端口
        tmp3 = i('td:nth-child(6)') # 选取类型
        ls.append({'proxies': tmp1.text() + ":" + tmp2.text(),
                    'types': tmp3.text()})
    print(ls)
    print(len(ls))

Output:

[{'proxies': '222.128.9.235:59593', 'types': 'HTTPS'}, {'proxies': '115.219.105.60:8010', 'types': 'HTTP'}, {'proxies': '117.88.177.204:3000', 'types': 'HTTP'}, {'proxies': '222.95.144.235:3000', 'types': 'HTTP'}, {'proxies': '59.42.88.110:8118', 'types': 'HTTPS'}, {'proxies': '118.181.226.166:44640', 'types': 'HTTP'}, {'proxies': '121.237.149.124:3000', 'types': 'HTTP'}, {'proxies': '218.86.200.26:8118', 'types': 'HTTPS'}......
100

Results per page 100 the data is correct.

(C) select Baidu site to test the proxy fetching obtained

Climb to get free agents in many of which are difficult to use or instability, can not be directly stored, so the need to select a site to test whether we can crawl to the proxy request is successful, I chose http: //www.baidu.com as a test, only the successful agency will request to speak to add it to the database, the request fails more than three times the number of discarded.

For the detection of such agents typically require ten seconds or even longer things, using a queued requests detected clearly unreasonable, it is necessary to select the library asynchronous request aiohttp, the asynchronous About coroutine, reference may Python asynchronous coroutine the use of the method described , on aiohttpintroduction, refer aiohttp Chinese documents .

The two main keywords that awaitand async, simply put, is likely waiting in the thread where A plus a awaitmodification, and then thread to this place at this time would not be waiting for a dry, but ran to perform other tasks B , after waiting until the object a responsive, came back immediately continue with other tasks below a, then B task temporarily shelved. However, awaitthe latter object must be coroutinean object, or may return an coroutineobject generator, or containing __awaita iterator object returned by the method (which is why not directly requestspreceded awaitreasons). And we add functions asyncafter modification, function returns an object becomes a coroutinetarget, so he could "no brain" to be added awaitto the asyncportfolio, of course, if you add awaitplaces not need to wait for the kind of request response or waiting for data upload and download places to go like a thread into the blocked state, it will not play any effect, of course, it would not be wrong.

Detection agents function as follows:

    # 测试代理  
    async def test_proxy(self, dic):
        ## 根据类型构造不同的代理及url
        if dic["types"] == "HTTP":
            test_url = "http://www.baidu.com/"
            prop = "http://" + dic["proxies"]
        else:
            test_url = "https://www.baidu.com/"
            prop = "https://" + dic["proxies"]
        ua = UserAgent()
        header = {'User-Agent': ua.random}
        # 异步协程请求
        async with aiohttp.ClientSession() as session:
            while True:
                try:
                    async with session.get(test_url, headers = header, proxy = prop, timeout = 15, verify_ssl=False) as resp:
                        if resp.status == 200:
                            self.success_test_count += 1
                            print(prop, "\033[5;36;40m===========>测试成功,写入数据库!=========%d次\033[;;m"%self.success_test_count)
                            await self.insert_to_mongo(dic) ## 调用写入mongodb数据库的函数
                            return
                except Exception as e:
                    print(prop, "==测试失败,放弃==", e)
                    break

(D) select a stored database

After taking into account of the agent pool will be further maintained, and therefore choose to use MongoDB for storage and, where the data can be easily inserted to avoid duplication, database storage functions as follows:

    # 写入MongoDB数据库   
    async def insert_to_mongo(self, dic):
        db = self.client.Myproxies
        collection = db.proxies
        collection.update_one(dic,{'$set': dic}, upsert=True)   # 设置upsert=True,避免重复插入
        print("\033[5;32;40m插入记录:" + json.dumps(dic), "\033[;;m")

(E) the full code

1. Use a proxy crawling stage version

Finally, the complete code is as follows (this is in the crawling stage on the use of proxy version of the agent making the request, so I this machine ipwas closed, so I had to do, the process will be slow, crawling back to continue posting data stage without proxy, using proxy version of the test), select the beginning of the three when it comes to parsing library lxmlto parse:

import json
import time
import random
from fake_useragent import UserAgent
import asyncio
import aiohttp
# 避免出现RuntimeError错误
import nest_asyncio
nest_asyncio.apply()
from lxml import etree
import pymongo

class Get_prox:
    def __init__(self):
        # 初始化,连接MongoDB
        self.client = pymongo.MongoClient('mongodb://localhost:27017/')
        self.success_get_count = 0
        self.success_test_count = 0
    
    # 使用代理时,获取页面
    async def get_page(self, session, url):
        ## 一个随机生成请求头的库        
        ua = UserAgent()
        header = {'User-Agent': ua.random}
        # 从本地文件获取代理池
        proxies_pool = self.get_proxies()
        while True:
            try:
                # 由于我一开始操作不慎ip被封禁了,因此在一开始抓取ip时我不得不使用了自己从
                # 其他网站抓来的一批代理(如问题描述中所述),一共有5999条代理,每次随机选取一条
                p = 'http://' + random.choice(proxies_pool)
                async with session.get(url, headers = header, proxy = p, timeout = 10) as response:
                    await asyncio.sleep(2)
                    if response.status == 200:
                        self.success_get_count += 1
                        print("\033[5;36;40m----------------------请求成功-------------------%d次\033[;;m"%self.success_get_count)
                        return await response.text()
                    else:
                        print("\033[5;31;m", response.status, "\033[;;m")
                        continue
            except Exception as e:
                print("请求失败orz", e)    
        
    # 任务
    async def get(self, url):
        async with aiohttp.ClientSession() as session:
            html = await self.get_page(session, url)
            await self.get_detail(html)
    
    # 测试代理  
    async def test_proxy(self, dic):
        ## 根据类型构造不同的代理及url
        if dic["types"] == "HTTP":
            test_url = "http://www.baidu.com/"
            prop = "http://" + dic["proxies"]
        else:
            test_url = "https://www.baidu.com/"
            prop = "https://" + dic["proxies"]
        ua = UserAgent()
        header = {'User-Agent': ua.random}
        # 异步协程请求
        async with aiohttp.ClientSession() as session:
            while True:
                try:
                    async with session.get(test_url, headers = header, proxy = prop, timeout = 15, verify_ssl=False) as resp:
                        if resp.status == 200:
                            self.success_test_count += 1
                            print(prop, "\033[5;36;40m===========>测试成功,写入数据库!=========%d次\033[;;m"%self.success_test_count)
                            await self.insert_to_mongo(dic) ## 调用写入mongodb数据库的函数
                            return
                except Exception as e:
                    print(prop, "==测试失败,放弃==", e)
                    break
    
    # 获取代理池
    def get_proxies(self):
        with open("proxies.txt", "r") as f:
            ls = json.loads(f.read())
        return ls
    
    # 使用lxml爬取 
    async def get_detail(self, html):
        html = etree.HTML(html)
        dic = {}
        ip = html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()')
        port = html.xpath('//tr[@class="odd" or @class=""]/td[3]/text()')
        types = html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')
        for i in range(len(ip)):
            dic['proxies'] = ip[i] + ":" + port[i]
            dic['types'] = types[i]
            await self.test_proxy(dic)
        
    # 写入MongoDB数据库   
    async def insert_to_mongo(self, dic):
        db = self.client.Myproxies
        collection = db.proxies
        collection.update_one(dic,{'$set': dic}, upsert=True)   # 设置upsert=True,避免重复插入
        print("\033[5;32;40m插入记录:" + json.dumps(dic), "\033[;;m")

    
# 主线程
if __name__ == "__main__":
    urls = []
    start = time.time()
    # 抓取前10页数据
    for i in range(1, 11):
        urls.append("http://www.xicidaili.com/nn/" + str(i))
    c = Get_prox()
    # 创建10个未来任务对象
    tasks = [asyncio.ensure_future(c.get(url)) for url in urls]
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    end = time.time()
    total = (end - start)/60.0
    print("完成,总耗时:", total, "分钟!")

Implementation process will print a lot of logs, the log part as follows:

Whether the request process or testing process, the agent iprequests the success rate is very low, completely finished execution needs some time to display time-consuming after the completion of 47 minutes.

A brief look at the log, see the last data is successfully inserted the eighth. . . . .

Database to look over, which I repeated several times after running the data in the database, it is only inserted 50:

2. crawling stage version of the agent is not used

Continue posted version of the crawling stage without the use of proxy data, ie the use of requestscrawling, and then testing aiohttp, eliminating the waiting time of the first phase of the screening agent process.

import json
import time
import requests
from fake_useragent import UserAgent
import asyncio
import aiohttp
# 避免出现RuntimeError错误
import nest_asyncio
nest_asyncio.apply()
from lxml import etree
import pymongo

class Get_prox:
    def __init__(self):
        # 初始化,连接MongoDB
        self.client = pymongo.MongoClient('mongodb://localhost:27017/')
        self.success_get_count = 0
        self.success_test_count = 0            
                
    # 不使用代理时,获取页面
    def get_page(self, url):
        ## 一个随机生成请求头的库        
        ua = UserAgent()
        header = {'User-Agent': ua.random}
        while True:
            try:
                response = requests.get(url, headers = header, timeout = 10)
                time.sleep(1.5)
                if response.status_code == 200:
                    self.success_get_count += 1
                    print("\033[5;36;40m----------------------请求成功-------------------%d次\033[;;m"%self.success_get_count)
                    return response.text
                else:
                    print("\033[5;31;m", response.status_code, "\033[;;m")
                    continue
            except Exception as e:
                print("请求失败orz", e)
        
    # 任务
    def get(self, urls):
        htmls = []
        # 先将抓取的页面都存入列表中
        for url in urls:
            htmls.append(self.get_page(url))
        # 测试代理使用异步
        tasks = [asyncio.ensure_future(self.get_detail(html)) for html in htmls]
        loop = asyncio.get_event_loop()
        loop.run_until_complete(asyncio.wait(tasks))
    
    # 测试代理  
    async def test_proxy(self, dic):
        ## 根据类型构造不同的代理及url
        if dic["types"] == "HTTP":
            test_url = "http://www.baidu.com/"
            prop = "http://" + dic["proxies"]
        else:
            test_url = "https://www.baidu.com/"
            prop = "https://" + dic["proxies"]
        ua = UserAgent()
        header = {'User-Agent': ua.random}
        # 异步协程请求
        async with aiohttp.ClientSession() as session:
            while True:
                try:
                    async with session.get(test_url, headers = header, proxy = prop, timeout = 15, verify_ssl=False) as resp:
                        if resp.status == 200:
                            self.success_test_count += 1
                            print(prop, "\033[5;36;40m===========>测试成功,写入数据库!=========%d次\033[;;m"%self.success_test_count)
                            await self.insert_to_mongo(dic) ## 调用写入mongodb数据库的函数
                            return
                except Exception as e:
                    print(prop, "==测试失败,放弃==", e)
                    break
    
    # 使用lxml爬取 
    async def get_detail(self, html):
        html = etree.HTML(html)
        dic = {}
        ip = html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()')
        port = html.xpath('//tr[@class="odd" or @class=""]/td[3]/text()')
        types = html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')
        for i in range(len(ip)):
            dic['proxies'] = ip[i] + ":" + port[i]
            dic['types'] = types[i]
            await self.test_proxy(dic)
        
    # 写入MongoDB数据库   
    async def insert_to_mongo(self, dic):
        db = self.client.Myproxies
        collection = db.proxies
        collection.update_one(dic,{'$set': dic}, upsert=True)   # 设置upsert=True,避免重复插入
        print("\033[5;32;40m插入记录:" + json.dumps(dic) + "\033[;;m")

    
# 主线程
if __name__ == "__main__":
    urls = []
    start = time.time()
    # 抓取前10页数据
    for i in range(1, 11):
        urls.append("http://www.xicidaili.com/nn/" + str(i))
    c = Get_prox()
    c.get(urls)
    end = time.time()
    total = (end - start)/60.0
    print("完成,总耗时:", total, "分钟!")

By other small partners measured results shots are as follows:

10 requests crawling stage went very smoothly.

Finally, total time 19 minutes, not visible in front of crawling stage screening agent can really save a lot of free time!

IV. Problems and Solutions

(A) ipaddresses are banned

Since the beginning of use lxmlwhen parsing library to explore the parsing rules, in order to set the sleep time is not convenient, and later due to negligence, the crawl pages of major forget to set the sleep time, the result after crawling several times and found the log the output information content becomes as follows:

{"proxies": "121.237.148.195:3000", "types": "HTTP"}
{"proxies": "121.234.31.44:8118", "types": "HTTPS"}
{"proxies": "117.88.4.63:3000", "types": "HTTP"}
{"proxies": "222.95.144.58:3000", "types": "HTTP"}
发生错误:  'NoneType' object has no attribute 'xpath'
发生错误:  'NoneType' object has no attribute 'xpath'
发生错误:  'NoneType' object has no attribute 'xpath'
发生错误:  'NoneType' object has no attribute 'xpath'
发生错误:  'NoneType' object has no attribute 'xpath'
发生错误:  'NoneType' object has no attribute 'xpath'
......

After termination of the program to obtain response status code printing appears with the following results:

503
503
503
503
503
...


I also can not enter through the browser to the site, which can be drawn due to crawling too many times, my IP has been banned web pages.

  • Solution

At first I was selected directly in other free agents ip ip over a few selected sites, but found free proxy ip have a very large proportion are not used, the use of the Internet to build a proxy ip pool of existing projects in the environment and the configuration is dependent on time-consuming, so I went straight to the 66 free agents network , using a free ip extraction function of the site to extract the 6000 proxy ip:


directly from the page that contains the 6000 proxy information click extraction, then you can write a simple program to generate this direct to crawl the page 6000 (5999 actually captured) proxy information to a local file:

response1 = requests.get("http://www.66ip.cn/mo.php?sxb=&tqsl=6000&port=&export=&ktip=&sxa=&submit=%CC%E1++%C8%A1&textarea=")
html = response1.text
print(response1.status_code == 200)
pattern = re.compile("br />(.*?)<", re.S)

items = re.findall(pattern, html)
for i in range(len(items)):
    items[i] = items[i].strip()
print(len(items))
with open("proxies.txt", "w") as f:
    f.write(json.dumps(items))

Then read this file as a crawler agent pool:

# 获取代理池
    def get_proxies(self):
        with open("proxies.txt", "r") as f:
            ls = json.loads(f.read())
        return ls

Then each request randomly select a proxy agent pool from:

def get_page(ls):
    url = []
    ua = UserAgent()
    with open("proxies.txt", "r") as f:
        ls = json.loads(f.read())
    for i in range(1, page+1):
        url.append("http://www.xicidaili.com/nn/" + str(i))
    count = 1
    errcount = 1
    for u in url:
        while True:
            try:
                header = {'User-Agent': ua.random}
                handler = {'http': 'http://' + random.choice(ls)}
                response = requests.get(u, headers = header, proxies = handler, timeout = 10)
                time.sleep(1)
                get_detail(response.text)
                if response.status_code == 200:
                    print("选取ip:", handler, "请求成功---------------------------第%d次"%count)
                    count += 1
                else:
                    continue
                break
            except:
                print("选取ip:", handler, ", 第%d请求发生错误"%errcount)
                errcount += 1

But do have a problem is when scheduling the thread can only be responsible for a task, but there are many ip proxy are difficult to use, resulting in each attempt takes time for several seconds, but in most cases requests to have an error.

To solve this problem, we can not choose to use single-threaded single-step approach to scheduling crawl pages, so I chose to use the asynchronous request library aiohttp.

Reference articles to use Python asynchronous coroutine introduction and aiohttp Chinese documents , I learned to create coroutine object with 10 tasks (crawling tasks 10 pages) to implement asynchronous coroutine scheduler, so that each when a thread encounters a request, without waiting for the requested task, and can schedule the next task, when 10 requests are successful, we can enter the next function call, so the total time consumption can be reduced by about 10 times, and method is as follows (all functions not listed):

    # 使用代理时,获取页面
    async def get_page(self, session, url):
        ## 一个随机生成请求头的库        
        ua = UserAgent()
        header = {'User-Agent': ua.random}
        # 从本地文件获取代理池
        proxies_pool = self.get_proxies()
        while True:
            try:
                # 由于我一开始操作不慎ip被封禁了,因此在一开始抓取ip时我不得不使用了自己从
                # 其他网站抓来的一批代理(如问题描述中所述),一共有5999条代理,每次随机选取一条
                p = 'http://' + random.choice(proxies_pool)
                async with session.get(url, headers = header, proxy = p, timeout = 10) as response:
                    await asyncio.sleep(2)
                    if response.status == 200:
                        self.success_get_count += 1
                        print("\033[5;36;40m----------------------请求成功-------------------%d次\033[;;m"%self.success_get_count)
                        return await response.text()
                    else:
                        print("\033[5;31;m", response.status, "\033[;;m")
                        continue
            except Exception as e:
                print("请求失败orz", e)    
        
    # 任务
    async def get(self, url):
        async with aiohttp.ClientSession() as session:
            html = await self.get_page(session, url)
            await self.get_detail(html)
# 主线程
if __name__ == "__main__":
    urls = []
    start = time.time()
    # 抓取前10页数据
    for i in range(1, 11):
        urls.append("http://www.xicidaili.com/nn/" + str(i))
    c = Get_prox()
    # 创建10个未来任务对象
    tasks = [asyncio.ensure_future(c.get(url)) for url in urls]
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    end = time.time()
    total = (end - start)/60.0
    print("完成,总耗时:", total, "分钟!")

Crawling process as part of the printed journal, visible, request agent probability of success is very low, the next step is to wait:

(B) asynchronous operation error RuntimeErrorerror

When a coroutine started asynchronously running programs, the error log console output follows:

RuntimeError: asyncio.run() cannot be called from a running event loop

Internet search solutions, adding at the beginning of the program:

import nest_asyncio
nest_asyncio.apply()

After the error will not, and the specific reasons unknown.

V. further areas for improvement

  • If not for ipbeing banned crawling stage agency directly requests like, pay attention to set the sleep time. In fact, when then crawling agent can simultaneously on a number of different agents crawling the site, so that you can put mechanisms in conjunction with asynchronous requests also came in. For example, create multiple tasks, each task separately using different requests for web site requests, which are then added to the event loop asynchronous task coroutine go.
  • My approach is the only agent into the local database, is static, there are many agents on the network pool projects are dynamically maintained, there is webthe interface, apiinterface, but also more complex implementation, follow-up can be further in-depth Learn.

Guess you like

Origin www.cnblogs.com/PanzVor/p/12497615.html