Nginx anti reptiles optimization

Reprinted Summary:

Method 1: Create a text file robots.txt, and then set up the code within a document that tells search engines which files you can not visit my site. Then uploaded to the web root directory, because when search engine spiders when indexing a website, whether crawling will first have a robots.txt file in the root directory to view the site.
# Jingdong taken from
CAT << EOF> a robots.txt
the User-Agent: *
Disallow: / *?
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
the User-Agent: EtaoSpider
Disallow: /
the User-Agent : HuihuiSpider
Disallow: /
the User-Agent: GwdangSpider
Disallow: /
the User-Agent: WochachaSpider
Disallow: /
EOF
# excerpt from Taobao
CAT << EOF> a robots.txt
the User-Agent: Baiduspider
the Allow: / Article This article was
the Allow: / oshtml
the Allow: / ershou
the allow: / $
Disallow: / Product /
Disallow: /

the User-Agent: Googlebot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  /

User-agent:  Bingbot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  /

User-Agent:  360Spider
Allow:  /article
Allow:  /oshtml
Allow:  /ershou
Disallow:  /

User-Agent:  Yisouspider
Allow:  /article
Allow:  /oshtml
Allow:  /ershou
Disallow: /

the User-- Agent: Sogouspider
the Allow: / Article This article was
the Allow: / oshtml
the Allow: / Product
the Allow: / ershou
Disallow: /

the User-- Agent: the Yahoo! Slurp
the Allow: / Product
the Allow: / SPU
the Allow: / dianpu
the Allow: / Oversea
the Allow: / List
the Allow: / ershou
the Allow: / $
Disallow: /

the User-Agent: *
Disallow: /
EOF

way: according to user-agents information about the client, block specified reptile crawling on our website.

1. The download protocol preventing agent, the following commands:
## downloads Agents Block ##
if ($ ~ * HTTP_USER_AGENT the LWP: the Simple | BBBike | wget)
{
    return 403;
}
# Description: If the user's back to match the client if (e.g. wget), returns 403.

2. Get the client agent according $ http_user_agent, and then determines whether to allow or returns the specified error code.
Add content to prevent more than N reptiles proxy access website, the following command:
# These reptiles agents use "|" separated, reptiles specific to be processed can be increased or decreased according to demand, the additions are as follows:
IF ($ HTTP_USER_AGENT ~ * "qihoobot | Baiduspider ! | Googlebot-Modile | Googlebot-Image | Mediapartners-Google | Adsbot-Google | Yahoo SSlurp China | YoudaoBot | Sosospider | Sogou Spider | Sogou Web Spider | MSNBot ")
{
    return 403;
}

3. test ban different browser software access
IF ($ HTTP_USER_AGENT ~ * "Firefox | MSIE")
{
    (. *) ^ http://www.wk.com/$1 rewrite Permanent;
}
# if the browser is Firefox or IE, it will jump to http: //www.wk.com

4. restriction request mode
#only the allow THESE request Methods
IF (REQUEST_METHOD ~ $ ^ (the GET | the HEAD | the POST) $!)
{
    return 501;
}

 

Common network UA list of junk

FeedDemon content acquisition
BOT / 0.1 (BOT for JCE) sql injection
CrawlDaddy sql injection
Java content acquisition
Jullo content capture
Feedly content capture
UniversalFeedParser content capture
ApacheBench cc attack is
Swiftbot useless reptile
YandexBot useless reptile
AhrefsBot useless reptile
YisouSpider useless reptile
jikeSpider useless reptile
MJ12bot useless reptile
ZmEu phpmyadmin vulnerability scanning
WinHttp acquisition cc attack
EasouSpider useless reptile
HttpClient tcp attack
Microsoft URL Control scanning
YYSpider useless reptile
jaunty wordpress blasting scanner
oBot useless reptile
Python-urllib content acquisition
Indy Library scanning
FlightDeckReports Bot useless reptile
Linguee Bot useless reptile

Guess you like

Origin www.cnblogs.com/hrers/p/11456045.html