Scrapy使用IP代理池

https://blog.csdn.net/u011781521/article/details/70194744?locationNum=4&fps=1

一、手动更新IP池

1.在settings配置文件中新增IP池:

 
  1. IPPOOL=[

  2. {"ipaddr":"61.129.70.131:8080"},

  3. {"ipaddr":"61.152.81.193:9100"},

  4. {"ipaddr":"120.204.85.29:3128"},

  5. {"ipaddr":"219.228.126.86:8123"},

  6. {"ipaddr":"61.152.81.193:9100"},

  7. {"ipaddr":"218.82.33.225:53853"},

  8. {"ipaddr":"223.167.190.17:42789"}

  9. ]

这些IP可以从这个几个网站获取:快代理代理66有代理西刺代理guobanjia。如果出现像下面这种提示:"由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败"或者是这种," 由 于目标计算机积极拒绝,无法连接。". 那就是IP的问题,更换就行了。。。。发现上面好多IP都不能用。。

 
  1. 2017-04-16 12:38:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com.cn/> (failed 1 times): TCP connection timed out: 10060: 由于连接方在 一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

  2. this is ip:182.241.58.70:51660

  3. 2017-04-16 12:38:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com.cn/> (failed 2 times): TCP connection timed out: 10060: 由于连接方在 一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

  4. this is ip:49.75.59.243:28549

  5. 2017-04-16 12:38:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force

  6. 2017-04-16 12:38:33 [scrapy.core.engine] INFO: Closing spider (shutdown)

  7. 2017-04-16 12:38:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

  8. 2017-04-16 12:38:53 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://news.sina.com.cn/> (failed 3 times): TCP connection timed out: 10060: 由于 连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

  9. 2017-04-16 12:38:54 [scrapy.core.scraper] ERROR: Error downloading <GET http://news.sina.com.cn/>

  10. Traceback (most recent call last):

  11. File "f:\software\python36\lib\site-packages\twisted\internet\defer.py", line 1299, in _inlineCallbacks

  12. result = result.throwExceptionIntoGenerator(g)

  13. File "f:\software\python36\lib\site-packages\twisted\python\failure.py", line 393, in throwExceptionIntoGenerator

  14. return g.throw(self.type, self.value, self.tb)

  15. File "f:\software\python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request

  16. defer.returnValue((yield download_func(request=request,spider=spider)))

  17. twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

在Scrapy中与代理服务器设置相关的下载中间件是HttpProxyMiddleware,对应的类为:

scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware


 

2.修改中间件文件middlewares.py

 
  1. # -*- coding: utf-8 -*-

  2.  
  3. # Define here the models for your spider middleware

  4. #

  5. # See documentation in:

  6. # http://doc.scrapy.org/en/latest/topics/spider-middleware.html

  7.  
  8. import random

  9. from scrapy import signals

  10. from myproxies.settings import IPPOOL

  11.  
  12. class MyproxiesSpiderMiddleware(object):

  13.  
  14. def __init__(self,ip=''):

  15. self.ip=ip

  16.  
  17. def process_request(self, request, spider):

  18. thisip=random.choice(IPPOOL)

  19. print("this is ip:"+thisip["ipaddr"])

  20. request.meta["proxy"]="http://"+thisip["ipaddr"]


 

3.在settings中设置DOWNLOADER_MIDDLEWARES

 
  1. DOWNLOADER_MIDDLEWARES = {

  2. # 'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,

  3. 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,

  4. 'myproxies.middlewares.MyproxiesSpiderMiddleware':125

  5. }


 

4.爬虫文件为

 
  1. # -*- coding: utf-8 -*-

  2. import scrapy

  3.  
  4.  
  5. class ProxieSpider(scrapy.Spider):

  6.  
  7.  
  8. def __init__(self):

  9. self.headers = {

  10. 'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',

  11. 'Accept-Encoding':'gzip, deflate',

  12. 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

  13. }

  14.  
  15.  
  16. name = "proxie"

  17. allowed_domains = ["sina.com.cn"]

  18. start_urls = ['http://news.sina.com.cn/']

  19.  
  20. def parse(self, response):

  21. print(response.body)


 

5.运行爬虫

scrapy crawl proxie


 

输出结果为:

 
  1. G:\Scrapy_work\myproxies>scrapy crawl proxie

  2. 2017-04-16 12:23:14 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: myproxies)

  3. 2017-04-16 12:23:14 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproxies', 'NEWSPIDER_MODULE': 'myproxies.spiders', 'SPIDER_MODULES': ['myproxies.spiders']}

  4. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled extensions:

  5. ['scrapy.extensions.corestats.CoreStats',

  6. 'scrapy.extensions.telnet.TelnetConsole',

  7. 'scrapy.extensions.logstats.LogStats']

  8. 2017-04-16 12:23:14 [py.warnings] WARNING: f:\software\python36\lib\site-packages\scrapy\utils\deprecate.py:156: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead

  9. ScrapyDeprecationWarning)

  10.  
  11. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled downloader middlewares:

  12. ['myproxies.middlewares.MyproxiesSpiderMiddleware',

  13. 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

  14. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

  15. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

  16. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

  17. 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

  18. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

  19. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

  20. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

  21. 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

  22. 'scrapy.downloadermiddlewares.stats.DownloaderStats']

  23. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled spider middlewares:

  24. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

  25. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

  26. 'scrapy.spidermiddlewares.referer.RefererMiddleware',

  27. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

  28. 'scrapy.spidermiddlewares.depth.DepthMiddleware']

  29. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled item pipelines:

  30. []

  31. 2017-04-16 12:23:14 [scrapy.core.engine] INFO: Spider opened

  32. 2017-04-16 12:23:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

  33. 2017-04-16 12:23:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

  34. this is ip:222.92.111.234:1080

  35. 2017-04-16 12:23:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://news.sina.com.cn/> (referer: None)

  36. b'<html>\n<head>\n<meta http-equiv="Pragma" content="no-cache">\n<meta http-equiv="Expires" content="-1">\n<meta http-equiv="Cache-Control" content="no-cache">\n<link rel="SHORTCUT ICON" href="/favicon.ico">\n\n<title>Login</title>\n<script language="JavaScript">\n\nvar base64EncodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";\nvar base64DecodeChars = new Array(\n -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63,\n 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1,\n -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,\n 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1,\n -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,\n 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1);\n\nfunction base64encode(str) {\n var out, i, len;\n var c1, c2, c3;\n\n len = str.length;\n i = 0;\n out = "";\n while(i < len) {\n\tc1 = str.charCodeAt(i++) & 0xff;\n\tif(i == len)\n\t{\n\t out += base64EncodeChars.charAt(c1 >> 2);\n\t out += base64EncodeChars.charAt((c1 & 0x3) << 4);\n\t out += "==";\n\t break;\n\t}\n\tc2 = str.charCodeAt(i++);\n\tif(i == len)\n\t{\n\t out += base64EncodeChars.charAt(c1 >> 2);\n\t out += base64EncodeChars.charAt(((c1 & 0x3)<< 4) | ((c2 & 0xF0) >> 4));\n\t out += base64EncodeChars.charAt((c2 & 0xF) << 2);\n\t out += "=";\n\t break;\n\t}\n\tc3 = str.charCodeAt(i++);\n\tout += base64EncodeChars.charAt(c1 >> 2);\n\tout += base64EncodeChars.charAt(((c1 & 0x3)<< 4) | ((c2 & 0xF0) >> 4));\n\tout += base64EncodeChars.charAt(((c2 & 0xF) << 2) | ((c3 & 0xC0) >>6));\n\tout += base64EncodeChars.charAt(c3 & 0x3F);\n }\n return out;\n}\n\nfunction base64decode(str) {\n var c1, c2, c3, c4;\n var i, len, out;\n\n len = str.length;\n i = 0;\n out = "";\n while(i < len) {\n\t/* c1 */\n\tdo {\n\t c1 = base64DecodeChars[str.charCodeAt(i++) & 0xff];\n\t} while(i < len && c1 == -1);\n\tif(c1 == -1)\n\t break;\n\n\t/* c2 */\n\tdo {\n\t c2 = base64DecodeChars[str.charCodeAt(i++) & 0xff];\n\t} while(i < len && c2 == -1);\n\tif(c2 == -1)\n\t break;\n\n\tout += String.fromCharCode((c1 << 2) | ((c2 & 0x30) >> 4));\n\n\t/* c3 */\n\tdo {\n\t c3 = str.charCodeAt(i++) & 0xff;\n\t if(c3 == 61)\n\t\treturn out;\n\t c3 = base64DecodeChars[c3];\n\t} while(i < len && c3 == -1);\n\tif(c3 == -1)\n\t break;\n\n\tout += String.fromCharCode(((c2 & 0XF) << 4) | ((c3 & 0x3C) >> 2));\n\n\t/* c4 */\n\tdo {\n\t c4 = str.charCodeAt(i++) & 0xff;\n\t if(c4 == 61)\n\t\treturn out;\n\t c4 = base64DecodeChars[c4];\n\t} while(i < len && c4 == -1);\n\tif(c4 == -1)\n\t break;\n\tout += String.fromCharCode(((c3 & 0x03) << 6) | c4);\n }\n return out;\n}\n \t \n\n\n/*\nif (window.opener) {\n\twindow.opener.location.href = document.location.href;\n\tself.close();\n}\t\n*/\n\nif (top.location != document.location) top.location.href = document.location.href;\n\nvar is_DOM = (document.getElementById)? true : false;\nvar is_NS4 = (document.layers && !is_DOM)? true : false;\n\nvar sAgent = navigator.userAgent;\nvar bIsIE = (sAgent.indexOf("MSIE") > -1)? true : false;\nvar bIsNS = (is_NS4 || (sAgent.indexOf("Netscape") > -1))? true : false;\nvar bIsMoz5 = ((sAgent.indexOf("Mozilla/5") > -1) && !bIsIE)? true : false;\n\nif (is_NS4 || bIsMoz5)\t{\n document.writeln("<style type=\\"text/css\\">");\n document.writeln(".spacer { background-image : url(\\"/images/tansparent.gif\\"); background-repeat : repeat; }");\n document.writeln(".operadummy {}");\n document.writeln("</style>");\n}else if (is_DOM) {\n document.writeln("<style type=\\"text/css\\">");\n document.writeln("body {");\n document.writeln("\tfont-family: \\"Verdana\\", \\"Arial\\", \\"Helvetica\\", \\"sans-serif\\";");\n //document.writeln("\tfont-size: x-small;");\n document.writeln("\tbackground-color : #FFFFFF;");\n document.writeln("\tbackground-image: URL(\\"/images/logon.gif\\");");\n document.writeln("\tbackground-repeat: no-repeat;");\n\tdocument.writeln("\tbackground-position: center;");\n document.writeln("}");\n document.writeln(".spacer {}");\n document.writeln(".operadummy {}");\n document.writeln("</style>");\n//} else if (document.all) {\n// document.write(\'<link rel="stylesheet" href="ie4.css" type="text/css">\');\n}\n\t \nfunction stripSpace(x)\n{\n\treturn x.replace(/^\\W+/,"");\n}\n\nfunction toggleDisplay(style2)\n{\n\tif (style2.display == "block") {\n\t\tstyle2.display = "none";\n\t\tstyle2.visibility = "hidden";\n\t} else {\n\t\tstyle2.display = "block";\n\t\tstyle2.visibility = "";\n\t}\n}\n\nfunction toggleLayer(whichLayer)\n{\n\tif (document.getElementById)\n\t{\n\t\t// this is the way the standards work\n\t\tvar style2 = document.getElementById(whichLayer).style;\n\t\ttoggleDisplay(style2);\n\t}\n\telse if (document.all)\n\t{\n\t\t// this is the way old msie versions work\n\t\tvar style2 = document.all[whichLayer].style;\n//\t\tstyle2.display = style2.display? "":"block";\n\t\ttoggleDisplay(style2);\n\t}\n\telse if (document.layers)\n\t{\n\t\t// this is the way nn4 works\n\t\tvar style2 = document.layers[whichLayer].style;\n//\t\tstyle2.display = style2.display? "":"block";\n\t\ttoggleDisplay(style2);\n\t}\n}\n\nvar today = new Date();\nvar expires = new Date(today.getTime() + (365 * 24 * 60 * 60 * 1000));\nvar timer = null; \nvar nlen = 0;\n\t\t\t\nfunction Set_Cookie(name,value,expires,path,domain,secure) \n{\n document.cookie = name + "=" +escape(value) +\n ( (expires) ? ";expires=" + expires.toGMTString() : "") +\n ( (path) ? ";path=" + path : "") + \n ( (domain) ? ";domain=" + domain : "") +\n ( (secure) ? ";secure" : "");\n}\n\nSet_Cookie("has_cookie", "1", expires);\nvar has_cookie = Get_Cookie("has_cookie") == null ? false : true; \n\t\nfunction Get_Cookie(name)\n{\n var start = document.cookie.indexOf(name+"=");\n var len = start+name.length+1;\n if ((!start) && (name != document.cookie.substring(0,name.length))) return null;\n if (start == -1) return null;\n var end = document.cookie.indexOf(";",len);\n if (end == -1) end = document.cookie.length;\n return unescape(document.cookie.substring(len,end));\n}\n \n \t\t\t\t\nfunction save_cookies() \n{\n\tvar fm = document.forms[0];\n\t\n\tcookie_name = "mingzi";\n if (has_cookie && fm.save_username_info.checked) {\n Set_Cookie(cookie_name, fm.un.value, expires);\n\t} else if (Get_Cookie(cookie_name)) {\n\t\tdocument.cookie = cookie_name + "=" +\n\t\t\t\t\t\t "; expires=Thu, 01-Jan-70 00:00:01 GMT";\n\t}\n \n\tcookie_name = "kouling";\n if (has_cookie && fm.save_username_info.checked) {\n Set_Cookie(cookie_name, fm.pw.value, expires);\n\t} else if (Get_Cookie(cookie_name)) {\n\t\tdocument.cookie = cookie_name + "=" +\n\t\t\t\t\t\t "; expires=Thu, 01-Jan-70 00:00:01 GMT";\n\t}\n}\n\nvar admin_pw = null;\nfunction get_cookies() \n{\n\tvar fm = document.forms[0];\n admin_id = Get_Cookie("mingzi");\t\n if (admin_id != null) {\n fm.admin_id.value = base64decode(admin_id);\n fm.save_username_info.checked = true;\n }\n admin_pw = Get_Cookie("kouling");\n if (admin_pw != null) {\n fm.admin_pw.value = base64decode(admin_pw);\n fm.save_username_info.checked = true;\n nlen = fm.admin_pw.value.toString().length;\n\t\tstar = "***********************************";\n\t\tfm.admin_pw.value += star.substring(0, 31 - nlen);\n } else {\n\t\tfm.admin_pw.value = "";\n\t}\n fm.pw.value = fm.admin_pw.value;\n\tfm.admin_id.select();\n\tfm.admin_id.focus();\n}\n\nfunction checkPassword()\n{\n var fm = document.forms[0];\n if (fm.admin_pw.value != fm.pw.value) {\n\t nlen = fm.admin_pw.value.toString().length;\n\t if (nlen>31) nlen = 31;\n }\n}\t \n\n\nfunction acceptCheckIt(ok)\n{\n\tif (!eval(ok)) {\n top.location.href = "/index.html";\n\t\treturn;\n\t}\n\tvar fm = document.forms[0];\n\tvar d = new Date();\n\tfm.time.value = d.getTime().toString().substring(4,13);\n\tname = fm.admin_id.value; //stripSpace(fm.admin_id.value);\n\tpass = fm.admin_pw.value; //stripSpace(fm.admin_pw.value);\n\tif ( (name.length > 0) \n\t\t&& (pass.length > 0)\n\t ) { \n\t\t\t fm.un.value=base64encode(name);\n\t\t\t if (pass != fm.pw.value) { // password changed\n\t\t\t\t fm.pw.value=base64encode(pass);\n\t\t\t } else {\n\t\t\t\t fm.pw.value=base64encode(pass.substring(0,nlen));\n\t\t\t }\n\t\t\t save_cookies();\n\t\t\t fm.admin_id.value="";\n\t\t\t fm.admin_pw.value="";\n\t\t\t fm.submit();\n\t }\n}\n\nfunction checkIt() \n{\n \n\t\tacceptCheckIt(true);\n \n} \n \t\nfunction cancelIt() \n{\n return false;\n}\t\t\n\n \nfunction auto_submit() \n{\n var fm = document.forms[0];\n get_cookies();\n fm.admin_id.select();//focus(); \n\n return checkIt();\n}\t\n\t\nfunction testSelect()\n{\n\tdocument.forms[0].admin_pw.select();\n}\t\n\n\nfunction write_one_check_box(txt)\n{\n if (has_cookie) {\n\t document.writeln("<tr align=\'center\' valign=\'middle\'>");\n\t document.writeln("<td align=\'center\' colspan=\'2\' style=\'color:white;font-size:10pt;\'>");\n\t document.writeln("<in"+"put name=\'"+txt+"\' type=\'checkbox\' tabindex=\'3\'>");\n\t document.writeln("Remember my name and password</td></tr>");\n }\n} \n \n\t\nfunction reloadNow()\n{\n document.location = document.location;\n}\n\nvar margin_top = 0; \nif (document.layers || bIsMoz5) {\n\tmargin_top = (window.innerHeight - 330) / 2;\n\tif (margin_top < 0) margin_top = 0;\n\t\n\twindow.onResize = reloadNow;\n} \n\t\n</script> \n</head>\n\n<body bgcolor="White" link="Black" onLoad="get_cookies();">\n\n<noscript>\n<h1>This WebUI administration tool requires scripting support.</h1>\n<h2>Please obtain the latest version of browsers which support the Javascript language or\nenable scripting by changing the browser setting \nif you are using the latest version of the browsers.\n</h2>\n</noscript>\t\n \n<div id="div1" style="display:block">\n<FORM method="POST" name="login" autocomplete="off" ACTION="/index.html"> \n<script language="javascript">\n\tif (bIsMoz5 && (margin_top > 0)) {\n\t\tdocument.writeln("<table width=\'100%\' border=\'0\' cellspacing=\'0\' cellpadding=\'0\' style=\'margin-top: " + margin_top + "px;\'>");\n\t} else {\n\t\tdocument.writeln("<table width=\'100%\' height=\'100%\' border=\'0\' cellspacing=\'0\' cellpadding=\'0\'>");\n\t}\n</script>\n\n<tr align="center" valign="middle" style="width: 471px; height: 330px;">\n<td align="center" valign="middle" scope="row">\n\n\t<script language="javascript">\n\tif (is_NS4 || bIsMoz5) {\n\t\tdocument.writeln("<table background=\'/images/logon.gif\' width=\'471\' height=\'330\' border=\'0\' cellpadding=\'0\' cellspacing=\'0\'>");\n\t} else {\n\t\tdocument.writeln("<table border=\'0\' cellpadding=\'0\' cellspacing=\'0\'>");\n\t}\n\t</script>\n \n\t<tr align="center" valign="middle">\n\t<script language="javascript">\n\t\tdocument.writeln("<td width=\'100%\' align=\'center\' valign=\'middle\'>");\n\t</script>\n\n \t\t<table bgcolor=\'\' background=\'\' border=\'0\'>\n\t\t<tr align="center" valign="middle">\n\t\t\t<th align="right" style="color:white;font-size:10pt;">Admin Name: </th>\n\t\t\t<td align="left" style="color:white;font-size:10pt;"><INPUT type=text name="admin_id" tabindex="1" SIZE="21" MAXLENGTH="31" VALUE="">\n\t\t\t</td>\n\t\t</tr>\n\t\t<tr align="center" valign="middle">\n\t\t\t<th align="right" style="color:white;font-size:10pt;">Password: </th>\n\t\t\t<td align="left" style="color:white;font-size:10pt;"><INPUT type="password" name="admin_pw" tabindex="2" onFocus="testSelect();" onChange="checkPassword();" SIZE="21" MAXLENGTH="31" VALUE=""> \n\t\t\t</td>\n\t\t</tr>\n\t\t\n\t\t<script language="javascript">\n\t\t\twrite_one_check_box("save_username_info");\n\t\t</script>\t\n\t\t\n\t\t<tr align="center" valign="middle">\n\t\t\t<td> </td>\n\t\t\t<td align="left">\n\t\t\t<INPUT type="button" value=" Login " onClick="checkIt();" tabindex=\\ "4\\">\n\t\t\t</td>\n\t\t</tr>\n\t\t\n\t\t</table>\n\t\t\n\t</td>\n\n\t</tr>\n\t</table>\n\n</td>\n</tr>\n</table>\n<INPUT type="hidden" name="time" VALUE="0">\n<INPUT type="hidden" name="un" VALUE="">\n<INPUT type="hidden" name="pw" VALUE="">\n</FORM>\n</div>\n\n<div id="div2" style="display:none">\n<pre>\n \n</pre>\n<bar />\n<center>\n<FORM name="additional">\n\t<INPUT type="button" value="Accept" onclick="acceptCheckIt(true);">\n\t \n\t<INPUT type="button" value="Decline" onclick="acceptCheckIt(false);">\n</FORM>\n\n</center>\n</div>\n\n</body>\n</html>\n\n\n'

  37. 2017-04-16 12:23:15 [scrapy.core.engine] INFO: Closing spider (finished)

  38. 2017-04-16 12:23:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

  39. {'downloader/request_bytes': 214,

  40. 'downloader/request_count': 1,

  41. 'downloader/request_method_count/GET': 1,

  42. 'downloader/response_bytes': 12111,

  43. 'downloader/response_count': 1,

  44. 'downloader/response_status_count/200': 1,

  45. 'finish_reason': 'finished',

  46. 'finish_time': datetime.datetime(2017, 4, 16, 4, 23, 15, 198955),

  47. 'log_count/DEBUG': 2,

  48. 'log_count/INFO': 7,

  49. 'log_count/WARNING': 1,

  50. 'response_received_count': 1,

  51. 'scheduler/dequeued': 1,

  52. 'scheduler/dequeued/memory': 1,

  53. 'scheduler/enqueued': 1,

  54. 'scheduler/enqueued/memory': 1,

  55. 'start_time': datetime.datetime(2017, 4, 16, 4, 23, 14, 706603)}

  56. 2017-04-16 12:23:15 [scrapy.core.engine] INFO: Spider closed (finished)

  57.  
  58. G:\Scrapy_work\myproxies>


示例:http://download.csdn.net/detail/u011781521/9815663

二、自动更新IP池

这里写个自动获取IP的类proxies.py,执行一下把获取的IP保存到txt文件中去:

 
  1. # *-* coding:utf-8 *-*

  2. import requests

  3. from bs4 import BeautifulSoup

  4. import lxml

  5. from multiprocessing import Process, Queue

  6. import random

  7. import json

  8. import time

  9. import requests

  10.  
  11. class Proxies(object):

  12.  
  13.  
  14. """docstring for Proxies"""

  15. def __init__(self, page=3):

  16. self.proxies = []

  17. self.verify_pro = []

  18. self.page = page

  19. self.headers = {

  20. 'Accept': '*/*',

  21. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',

  22. 'Accept-Encoding': 'gzip, deflate, sdch',

  23. 'Accept-Language': 'zh-CN,zh;q=0.8'

  24. }

  25. self.get_proxies()

  26. self.get_proxies_nn()

  27.  
  28. def get_proxies(self):

  29. page = random.randint(1,10)

  30. page_stop = page + self.page

  31. while page < page_stop:

  32. url = 'http://www.xicidaili.com/nt/%d' % page

  33. html = requests.get(url, headers=self.headers).content

  34. soup = BeautifulSoup(html, 'lxml')

  35. ip_list = soup.find(id='ip_list')

  36. for odd in ip_list.find_all(class_='odd'):

  37. protocol = odd.find_all('td')[5].get_text().lower()+'://'

  38. self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))

  39. page += 1

  40.  
  41. def get_proxies_nn(self):

  42. page = random.randint(1,10)

  43. page_stop = page + self.page

  44. while page < page_stop:

  45. url = 'http://www.xicidaili.com/nn/%d' % page

  46. html = requests.get(url, headers=self.headers).content

  47. soup = BeautifulSoup(html, 'lxml')

  48. ip_list = soup.find(id='ip_list')

  49. for odd in ip_list.find_all(class_='odd'):

  50. protocol = odd.find_all('td')[5].get_text().lower() + '://'

  51. self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))

  52. page += 1

  53.  
  54. def verify_proxies(self):

  55. # 没验证的代理

  56. old_queue = Queue()

  57. # 验证后的代理

  58. new_queue = Queue()

  59. print ('verify proxy........')

  60. works = []

  61. for _ in range(15):

  62. works.append(Process(target=self.verify_one_proxy, args=(old_queue,new_queue)))

  63. for work in works:

  64. work.start()

  65. for proxy in self.proxies:

  66. old_queue.put(proxy)

  67. for work in works:

  68. old_queue.put(0)

  69. for work in works:

  70. work.join()

  71. self.proxies = []

  72. while 1:

  73. try:

  74. self.proxies.append(new_queue.get(timeout=1))

  75. except:

  76. break

  77. print ('verify_proxies done!')

  78.  
  79.  
  80. def verify_one_proxy(self, old_queue, new_queue):

  81. while 1:

  82. proxy = old_queue.get()

  83. if proxy == 0:break

  84. protocol = 'https' if 'https' in proxy else 'http'

  85. proxies = {protocol: proxy}

  86. try:

  87. if requests.get('http://www.baidu.com', proxies=proxies, timeout=2).status_code == 200:

  88. print ('success %s' % proxy)

  89. new_queue.put(proxy)

  90. except:

  91. print ('fail %s' % proxy)

  92.  
  93.  
  94. if __name__ == '__main__':

  95. a = Proxies()

  96. a.verify_proxies()

  97. print (a.proxies)

  98. proxie = a.proxies

  99. with open('proxies.txt', 'a') as f:

  100. for proxy in proxie:

  101. f.write(proxy+'\n')

  102.  


 

执行一下:  python  proxies.py

这些IP就会保存到proxies.txt文件中去

修改代理文件middlewares.py的内容为如下:

 
  1. import random

  2. import scrapy

  3. from scrapy import log

  4.  
  5.  
  6. # logger = logging.getLogger()

  7.  
  8. class ProxyMiddleWare(object):

  9. """docstring for ProxyMiddleWare"""

  10. def process_request(self,request, spider):

  11. '''对request对象加上proxy'''

  12. proxy = self.get_random_proxy()

  13. print("this is request ip:"+proxy)

  14. request.meta['proxy'] = proxy

  15.  
  16.  
  17. def process_response(self, request, response, spider):

  18. '''对返回的response处理'''

  19. # 如果返回的response状态不是200,重新生成当前request对象

  20. if response.status != 200:

  21. proxy = self.get_random_proxy()

  22. print("this is response ip:"+proxy)

  23. # 对当前reque加上代理

  24. request.meta['proxy'] = proxy

  25. return request

  26. return response

  27.  
  28. def get_random_proxy(self):

  29. '''随机从文件中读取proxy'''

  30. while 1:

  31. with open('G:\\Scrapy_work\\myproxies\\myproxies\\proxies.txt', 'r') as f:

  32. proxies = f.readlines()

  33. if proxies:

  34. break

  35. else:

  36. time.sleep(1)

  37. proxy = random.choice(proxies).strip()

  38. return proxy

修改下settings文件

 
  1. DOWNLOADER_MIDDLEWARES = {

  2. # 'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,

  3. 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,

  4. 'myproxies.middlewares.ProxyMiddleWare':125,

  5. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None

  6. }

运行爬虫:

scrapy crawl proxie


 

输出结果为:

示例:http://download.csdn.net/detail/u011781521/9815729

三、利用crawlera神器(收费)

Crawlera是Scrapinghub公司提供的一个下载的中间件,其提供了很多服务器和ip,scrapy可以通过Crawlera向目标站点发起请求。

crawlera官方网址:http://scrapinghub.com/crawlera/
crawlera帮助文档:http://doc.scrapinghub.com/crawlera.html

一、crawlera平台注册

    首先申明,注册是免费的,使用的话除了一些特殊定制外都是free的。

    1、登录其网站 https://dash.scrapinghub.com/account/signup/


    填写用户名、密码、邮箱,注册一个crawlera账号并激活

新建一个项目

选择Scrapy....

二、部署到srcapy项目


1、安装scarpy-crawlera

pip install scarpy-crawlera


2、修改settings.py


如果你之前设置过代理ip,那么请注释掉,加入crawlera的代理,最重要的是需要在配置文件里,配置开启Crawlera中间件。如下所示:

 
  1. DOWNLOADER_MIDDLEWARES = {

  2. # 'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,

  3. # 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,

  4. # 'myproxies.middlewares.ProxyMiddleWare':125,

  5. # 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None

  6. 'scrapy_crawlera.CrawleraMiddleware': 600

  7. }


为了是crawlera生效,需要添加你创建的api信息(如果填写了API key的话,pass填空字符串便可)

 
  1. CRAWLERA_ENABLED = True

  2. CRAWLERA_USER = '<API key>'

  3. CRAWLERA_PASS = ''

其中CRAWLERA_USER是注册crawlera之后申请到的API Key:

CRAWLERA_PASS则代表crawlera的password,一般默认是不填写的,空白即可。

为了达到更高的抓取效率,可以禁用Autothrottle扩展和增加并发请求的最大数量,以及设置下载超时,代码如下

 
  1. CONCURRENT_REQUESTS = 32

  2. CONCURRENT_REQUESTS_PER_DOMAIN = 32

  3. AUTOTHROTTLE_ENABLED = False

  4. DOWNLOAD_TIMEOUT = 600

如果在代码中设置有 DOWNLOAD_DELAY的话,需要在settings.py中添加

CRAWLERA_PRESERVE_DELAY = True


 

如果你的spider中保留了cookies,那么需要在Headr中添加
 

 
  1. DEFAULT_REQUEST_HEADERS = {

  2. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

  3. # 'Accept-Language': 'zh-CN,zh;q=0.8',

  4. 'X-Crawlera-Cookies': 'disable'

  5. }


 

三、运行爬虫

    这些都设置好了过后便可以运行你的爬虫了。这时所有的request都是通过crawlera发出的,信息如下:

 
  1. G:\Scrapy_work\myproxies>scrapy crawl proxie

  2. 2017-04-16 15:49:40 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: myproxies)

  3. 2017-04-16 15:49:40 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproxies', 'NEWSPIDER_MODULE': 'myproxies.spiders', 'SPIDER_MODULES': ['myproxies.spiders']}

  4. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled extensions:

  5. ['scrapy.extensions.corestats.CoreStats',

  6. 'scrapy.extensions.telnet.TelnetConsole',

  7. 'scrapy.extensions.logstats.LogStats']

  8. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled downloader middlewares:

  9. ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

  10. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

  11. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

  12. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

  13. 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

  14. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

  15. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

  16. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

  17. 'scrapy_crawlera.CrawleraMiddleware',

  18. 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

  19. 'scrapy.downloadermiddlewares.stats.DownloaderStats']

  20. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled spider middlewares:

  21. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

  22. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

  23. 'scrapy.spidermiddlewares.referer.RefererMiddleware',

  24. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

  25. 'scrapy.spidermiddlewares.depth.DepthMiddleware']

  26. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled item pipelines:

  27. []

  28. 2017-04-16 15:49:40 [scrapy.core.engine] INFO: Spider opened

  29. 2017-04-16 15:49:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

  30. 2017-04-16 15:49:40 [root] INFO: Using crawlera at http://proxy.crawlera.com:8010?noconnect (user: f3b8ff0381fc46c7b6834aa85956fc82)

  31. 2017-04-16 15:49:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

  32. 2017-04-16 15:49:41 [scrapy.core.engine] DEBUG: Crawled (407) <GET http://www.655680.com/> (referer: None)

  33. 2017-04-16 15:49:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <407 http://www.655680.com/>: HTTP status code is not handled or not allowed

  34. 2017-04-16 15:49:41 [scrapy.core.engine] INFO: Closing spider (finished)

  35. 2017-04-16 15:49:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

  36. {'crawlera/request': 1,

  37. 'crawlera/request/method/GET': 1,

  38. 'crawlera/response': 1,

  39. 'crawlera/response/error': 1,

  40. 'crawlera/response/error/bad_proxy_auth': 1,

  41. 'crawlera/response/status/407': 1,

  42. 'downloader/request_bytes': 285,

  43. 'downloader/request_count': 1,

  44. 'downloader/request_method_count/GET': 1,

  45. 'downloader/response_bytes': 196,

  46. 'downloader/response_count': 1,

  47. 'downloader/response_status_count/407': 1,

  48. 'finish_reason': 'finished',

  49. 'finish_time': datetime.datetime(2017, 4, 16, 7, 49, 41, 546403),

  50. 'log_count/DEBUG': 2,

  51. 'log_count/INFO': 9,

  52. 'response_received_count': 1,

  53. 'scheduler/dequeued': 1,

  54. 'scheduler/dequeued/memory': 1,

  55. 'scheduler/enqueued': 1,

  56. 'scheduler/enqueued/memory': 1,

  57. 'start_time': datetime.datetime(2017, 4, 16, 7, 49, 40, 827892)}

  58. 2017-04-16 15:49:41 [scrapy.core.engine] INFO: Spider closed (finished)

  59.  
  60. G:\Scrapy_work\myproxies>


报407错误。。。。看了下文档,407没有说明。。在Google上找到了一种说法是,来自Crawlera的407错误代码是一个身份验证错误,APIKEY中可能会出现打字错误,或者您没有使用正确的错误代码。

然后又在网上找了下,发现自己创建的是Scrapy Cloud项目,而非crawlera,然后去创建crawlera发现要收费。。

收费就算了。。。网上还在以下两种策略: Scrapy+Goagent、Scrapy+Tor(高度匿名的免费代理)没有玩过。。

猜你喜欢

转载自blog.csdn.net/baidu_32542573/article/details/81436647