Nginx防爬虫优化

转载总结：

方式一：创建一个robots.txt文本文件，然后在文档内设置好代码，告诉搜索引擎我网站的哪些文件你不能访问。然后上传到网站根目录下面，因为当搜索引擎蜘蛛在索引一个网站时，会先爬行查看网站根目录下是否有robots.txt文件。
#摘自京东
cat<<EOF>robots.txt
User-agent: *
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
EOF
#摘自淘宝
cat<<EOF>robots.txt
User-agent: Baiduspider
Allow: /article
Allow: /oshtml
Allow: /ershou
Allow: /$
Disallow: /product/
Disallow: /

User-Agent: Googlebot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Allow: /$
Disallow: /

User-agent: Bingbot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Allow: /$
Disallow: /

User-Agent: 360Spider
Allow: /article
Allow: /oshtml
Allow: /ershou
Disallow: /

User-Agent: Yisouspider
Allow: /article
Allow: /oshtml
Allow: /ershou
Disallow: /

User-Agent: Sogouspider
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /ershou
Disallow: /

User-Agent: Yahoo! Slurp
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Allow: /$
Disallow: /

User-Agent: *
Disallow: /
EOF

方式二：根据客户端的user-agents信息，阻止指定的爬虫爬取我们的网站。

1.阻止下载协议代理，命令如下：
##Block download agents##
if ($http_user_agent ~* LWP:Simple | BBBike | wget)
{
    return 403;
}
#说明：如果用户匹配了if后面的客户端（例如wget），就返回403.

2.根据$http_user_agent获取客户端agent，然后判断是否允许或返回指定错误码。
添加内容防止N多爬虫代理访问网站，命令如下：
#这些爬虫代理使用“|”分隔，具体要处理的爬虫可以根据需求增加或减少，添加的内容如下：
if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot-Modile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Yahoo! SSlurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot")
{
    return 403;
}

3.测试禁止不同的浏览器软件访问
if ($http_user_agent ~* "Firefox|MSIE")
{
    rewrite ^(.*) http://www.wk.com/$1 permanent;
}
#如果浏览器为Firefox或IE，就会跳转到http://www.wk.com

4.限制请求方式
#Only allow these request methods
if ($request_method ! ~ ^(GET|HEAD|POST)$)
{
    return 501;
}

网络上常见的垃圾UA列表

FeedDemon 内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy sql注入
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
ApacheBench cc攻击器
Swiftbot 无用爬虫
YandexBot 无用爬虫
AhrefsBot 无用爬虫
YisouSpider 无用爬虫
jikeSpider 无用爬虫
MJ12bot 无用爬虫
ZmEu phpmyadmin 漏洞扫描
WinHttp 采集cc攻击
EasouSpider 无用爬虫
HttpClient tcp攻击
Microsoft URL Control 扫描
YYSpider 无用爬虫
jaunty wordpress爆破扫描器
oBot 无用爬虫
Python-urllib 内容采集
Indy Library 扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot 无用爬虫

猜你喜欢