PJzhang: the actual scene robots protocol

Cat Ning! ! !

Reference links:

https://bbs.360.cn/thread-15062960-1-1.html

https://ziyuan.baidu.com/college/courseinfo?id=150

 

See robots this keyword, first look at its definition, Baidu Encyclopedia introduction is as follows:

robots is an agreement between the website with reptiles, reptile is allowed to tell the corresponding permission to use simple and direct txt format text, that robots.txt file is the first time a search engine to access the site to view. When a search spider visits a site, it will first check whether there is a robots.txt under the root directory of the site, if present, to determine the scope of the search robots will visit in accordance with the contents of the file; if the file does not exist, all the search spiders will not be able to access all password-protected page on the site.

 

Shows an example of a use of robots.txt:

https://cn.bing.com/robots.txt, this is a must robots.txt file to be searched, festival take part.

User-agent: msnbot-media

Disallow: /

Allow: /th?

 

User-agent: Twitterbot

Disallow:

 

User-agent: *

Disallow: /account/

Disallow: /amp/

Disallow: /bfp/search

Disallow: /bing-site-safety

Disallow: /blogs/search/

Disallow: /entities/search

Disallow: /fd/

Disallow: /history

Disallow: /hotels/search

Disallow: /images?

Disallow: /images/search?

Disallow: /images/search/?

 

Sitemap: http://cn.bing.com/dict/sitemap-index.xml

 

Only when Web site operators do not want some pages indexed by search engines, it will be using robots.txt, otherwise it means the station's default can be included in a search engine crawl.

robots.txt file placed in the root directory, the content may comprise a plurality of records, intermediate divided by empty lines, if nothing in robots.txt, showing the station can crawl.

Agent-the User: * , it represents for all reptiles.

Agent-the User: Twitterbot , expressed Twitterbot for this kind of reptile.

Disallow: / bfp / search , crawlers can not access representation begin with / bfp / search all the pages url, for example /bfp/search/abc.html,/bfp/searchabc.html, and of course they are not in direct access to a particular the url is also possible.

The allow: / bfp / search / vip , express reptiles can access / bfp / url pages of all search / vip beginning.

Sitemap: http://cn.bing.com/dict/sitemap-index.xml, inform reptile this is the site map file

Allow and Disallow combination, flexibility page content crawlers visit, and will not lead to the situation across the board.

 

Here strict distinction between uppercase and lowercase letters.

 

* The asterisk represents 0 or more characters.

$, Dollar sign represents the terminator.

These are two wildcards

 

Prohibit any directory for all search engines to access the site.

User-agent: *

Disallow: /

 

Any directory allows all search engines access to the site

User-agent: *

Allow: /

 

Prohibit any access to the site directory Baidu

User-agent: Baiduspider

Disallow: /

 

Only allow access to any directory Baidu site

User-agent: Baiduspider

Allow: /

 

Prohibition directory / abc / at the beginning of the visit, except where the html file extension

User-agent: *

Disallow: /abc/

Allow:/abc/*.html$

 

Block access to the site all the dynamic pages, note that the characters are all English characters

User-agent: *

Disallow: /*?*

 

The above are some basic explanations, you can see the well-known site's robots.txt file, to understand each other do not want to let the search engines to crawl path for penetration testing and sometimes there will be new discovery or inspiration.

http://www.dianping.com/robots.txt, could see the public comments do not want all the reptiles crawling among the seven directories, such as coupons, pictures, accounts and so on, but an absolute prohibition love to help network (polymerization local life information) and word of mouth network (Alibaba's life information platform) both reptiles crawling anything on the public comment domain www.dianping.com.

all content

User-agent: *

 

Disallow: /coupon/

Disallow: /events/

Disallow: /thirdconnect/

Disallow: /member/

Disallow: /album/

Disallow: /dplab/

 

User-agent: www.aibang.com Disallow: /

User-agent: aibang.com Disallow: /

User-agent: aibang Disallow: /

User-agent: aibangspider Disallow: /

User-agent: aibang-spider Disallow: /

User-agent: aibangbot Disallow: /

User-agent: aibang-bot Disallow: /

User-agent: koubeispider Disallow: /

User-agent: koubei.com Disallow: /

 

A careful reading robots.txt site, maybe we can see some of the afterglow of the Internet arena of swords.

 

Guess you like

Origin www.cnblogs.com/landesk/p/10984431.html