Python distributed crawler frame Scrapy 7-5 scrapy achieve ip agent pool

So far, cnblogs still only get 60 information.

The family is dynamically assigned ip, ip restart the router may lead to changes. Ali cloud is static, Amazon is dynamic.

In fact, the proxy ip crawl speed will slow down, so we try to limit our crawling speed of the machine.

Under normal circumstances:

With ip proxy:

ip proxy hide and high common difference. High anonymous proxy server is to say we do not know ip of the machine, while ordinary proxy server might mean to the local ip with the past.

Ip proxy scrapy achieve in a very simple, written before RandomUserAgentMiddlware in process_request will handle every request, we just need to add that the last function in:

request.meta["proxy"] = "http://60.167.159.236:808"

Ip proxy back where they come from it? In West stab free proxy IP looking in.

Of course, if we are to achieve a ip agent pool, there is other work to be done. In other words, if it wants to achieve ip proxy, not only a natural ip, we need an agent pool, picked randomly in order to achieve a ip proxy. How to achieve? Quite simply, we crawled West stab the site just fine.

Commented:

# request.meta["proxy"] = "http://60.167.159.236:808"

Create a new package called tools of:

Create a new script crawl_xici_ip.py, and installation requests:

pip install -i https://pypi.douban.com/simple requests

Edit crawl_xici_ip.py:

import requests
from scrapy.selector import Selector
import MySQLdb

conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="", db="Spider", charset="utf8")
cursor = conn.cursor()


def crawl_ips():
    """
    爬取西刺的免费ip代理
    """
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"}
    for i in range(2073):
        re = requests.get("https://www.xicidaili.com/nn/{0}".format(i), headers=headers)

        selector = Selector(text=re.text)
        all_trs = selector.css("#ip_list tr")

        ip_list = []
        for tr in all_trs[1:]:
            speed_str = tr.css(".bar::attr(title)").extract_first()
            speed = float(speed_str.split("秒")[0])
            all_texts = tr.css("td::text").extract()
            ip = all_texts[0]
            port = all_texts[1]
            proxy_type = all_texts[5]

            ip_list.append((ip, port, proxy_type, speed))

        # 每获取一页,存入到数据库中
        for ip_info in ip_list:
            sql = "insert proxy_ip(ip, port, speed, proxy_type) VALUES('{0}', '{1}', {2}, '{3}')".format(
                ip_info[0], ip_info[1], ip_info[3], ip_info[2]
            )
            try:
                # 执行sql语句
                cursor.execute(sql)
                # 提交到数据库执行
                conn.commit()
            except:
                # Rollback in case there is any error
                conn.rollback()

            conn.commit()

        print("---------------{0} done---------------".format(i))


if __name__ == "__main__":
    crawl_ips()

Here used library requests, and knowledge of python mysql operations (note sql statement for the string type, there must be a single quote).

We made a direct response selector before, in fact, this is because the response itself wraps Scrapy the Selector, where we directly use the Selector.

Before the operation, a new table proxy_ip (primary key may not be provided, as a first or primary key to ip, port for the second primary key):

run.

The above data completed crawling, but also to achieve access to ip proxy below. It is obtained from the database, how to get it? You can use the following this sql statement:

SELECT ip, port FROM proxy_ip
ORDER BY RAND()
LIMIT 1

You can be tested in Navicat:

So, add a class:

class GetIP(object):
    def delete_ip(self, ip):
        # 从数据库中删除无效的ip
        delete_sql = """
            delete from proxy_ip where ip='{0}'
        """.format(ip)
        cursor.execute(delete_sql)
        conn.commit()
        return True

    def judge_ip(self, ip, port, proxy_type):
        # 判断ip是否可用
        http_url = "http://www.baidu.com"
        proxy_url = "{0}://{1}:{2}".format(proxy_type, ip, port)
        # 配置代理
        try:
            proxy_dict = {
                "http":proxy_url,
            }
            response = requests.get(http_url, proxies=proxy_dict)
        except Exception as e:
            print ("invalid ip and port")
            self.delete_ip(ip)
            return False
        else:
            code = response.status_code
            if code >= 200 and code < 300:
                print("effective ip")
                return True
            else:
                print("invalid ip and port")
                self.delete_ip(ip)
                return False

    def get_random_ip(self):
        # 从数据库中随机获取一个可用的ip
        random_sql = """
            SELECT ip, port, proxy_type FROM proxy_ip
            ORDER BY RAND()
            LIMIT 1
            """
        result = cursor.execute(random_sql)
        for ip_info in cursor.fetchall():
            ip = ip_info[0]
            port = ip_info[1]
            proxy_type = ip_info[2]

            judge_re = self.judge_ip(ip, port, proxy_type)
            if judge_re:
                return "{0}://{1}:{2}".format(proxy_type, ip, port)
            else:
                return self.get_random_ip()

The entry calls changed (write under __main__, or at the time of import, it will perform these logic):

if __name__ == "__main__":
    get_ip = GetIP()
    get_ip.get_random_ip()

调试一下,成功之后,回到middlewares.py,引入刚刚的类:

from tools.crawl_xici_ip import GetIP

添加一个新的middleware类:

class RandomProxyMiddleware(object):
    # 动态设置ip代理
    def process_request(self, request, spider):
        get_ip = GetIP()
        request.meta["proxy"] = get_ip.get_random_ip()

此外有一个开源的库——scrapy-proxies,这是一款scrapy的插件,比我们的功能强大得多,代码只有一个文件。它就是定义了一个middleware,但是是通过settings进行读取的,是读文件,这一点不如我们从数据库中操作优。可以拿着进行改造。

此外,scrapy官方有一个scrapy-crawlera项目,这让我们动态IP的配置更加简单,但是需要收费。

此外是tor,洋葱浏览器,洋葱网络实际上是对我们的网络进行了很多层的包装。当我们的请求经过这个洋葱网络,它就会做多次的转发,达到了匿名效果,黑客多用。但这个需要VPN,一个敏感话题。

按需要,决定是否在settings.py中配置:

'Spoder.middlewares.RandomProxyMiddleware':605

 

发布了101 篇原创文章 · 获赞 26 · 访问量 1万+

Guess you like

Origin blog.csdn.net/liujh_990807/article/details/100149898