Several writing methods and examples of robots protocol documents

The robots protocol is a text file placed in the root directory of the website, telling search engines which content can be crawled;
the role of the robots protocol:
   1) guide search engine spiders to crawl the specified column or content;
   2) website revision or url optimization Block links that are not friendly to search engines when writing;
   3) Block dead links and 404 error pages;
   4) Block meaningless and contentless pages;
   5) Block duplicate pages;
   6) Block pages that you do not want to be included;
   7) Guide spiders Grab the site map;
   8) Block large files, pictures and videos in the site to save broadband and increase speed;

Syntax and wildcards:
   1) User-agent: defines the search engine;
   2) Disallow: defines pages or directories
   that spiders are not allowed to crawl ; 3) Allow: defines pages or directories that spiders are allowed to crawl;
   4) $ matches the characters at the end of the URL ;
   5) * matches 0 or more arbitrary characters;

File writing:
   1) User-agent: * here represents all search engine types; (Google: Googlebot, Baidu: Baiduspider, MSN: MSNbot, Yahoo: Slurp)

   2) Disallow: /admin/ prohibit crawling under the admin directory All directories;

   3) Disallow: /admin prohibits crawling of /admin.html, /adminset.html, /admin/abc.html;

   4) Disallow: /admin/.html
prohibits crawling of all html suffixes in the admin directory Files (including subdirectories);

   5) Disallow: / ? Prohibit crawling of all URLs containing question marks;

   6) Disallow: /.jpg$ prohibit crawling of all pictures in .jpg format;

   7) Disallow: / ab/abc.html prohibit crawling the abc.html file under the ab file;

   8) Allow: /abc/ allows crawling all directories under the abc directory;

   9) Allow: /tmp allows crawling of the tmp directory;

   10) Allow: .html$ allows crawling web pages with html as the suffix URL;

   11) Allow: .gif$ allows crawling of pictures in gif format in web pages;

   12) Sitemap: Sitemap tells the crawler that this page is a site map;

Example:
1) User-agent: *
Disallow: /admin/
Disallow: /abc/
Note: All search engines are prohibited from crawling the admin and abc directories and subdirectories;

2) User-agent: *
Allow: /admin/seo/
Disallow: /admin/
Note: All search engines are prohibited from crawling the admin directory and subdirectories, but the seo
directory under the admin directory can be crawled ; (Allow must be in the front, Disallow is written in the back)

3) User-agent: *
Disallow: /abc/ .htm$
Note: All search engines are prohibited from crawling the url with the suffix of .htm in the abc directory and subdirectories;

4) User-agent: *
Disallow: /
? *
Note: Is there any search engine to crawl? The page;

5) the User-Agent: Baiduspider
Disallow: /.jpg$
Disallow: /.jpeg$
Disallow: / GIF $
Disallow: / PNG $
Disallow: /.bmp$
Note: ban all search engines to crawl all the pictures;

6 ) User-agent: *
Disallow: /folder1/
User-agent: Mediapartners-Google
Allow: /folder1/
Note: All search engines are prohibited from crawling folder1, but Mediapartners-Google robots can
display AdSense ads on the page ;

7) User-agent: *
Disallow: /abc*/
Note: All search engines are prohibited from crawling all directories and subdirectories beginning with abc;

Other attributes:
1) Specify the robot protocol version number:
Robot-version: Version 2.0
2) Search engines can crawl the specified url only in the specified time period
Visit-time: 0100-1300 Allow access between 1 am and 3 am
3 ) Limit URL reading frequency
Request-rate: 40/1m 0800-1300 Between 8 o'clock and 13 o'clock, visit
Robots meta tag at a frequency of 40 times per minute :

   <meta name="Robots" content="all|none|index|noindex|follow|nofollow">

Property description:
1) all: the file will be retrieved, and the link on the page can be queried; the default is all;

2) none: the file will not be retrieved, and the link on the page cannot be retrieved;

3) index: the file Will be retrieved;

4) follow: the link on the page can be retrieved;

5) noindex: the file is not retrieved;

6) nofollow: the link on the page is not retrieved;

combined use:

	   1)可以抓取本页且可以顺着本页继续索引其他链接
	       <meta name="robots" content="index,follow">
	        也可以写成
	        <meta name="robots" content="all">
	        
	    2)不可以抓取本页但可以顺着本页继续索引其他链接
	      <meta name="robots" content="noindex,follow">
           
        3)可以抓取本页但不可以顺着本页继续索引其他链接
         <meta name="robots" content="index,nofollow">
            
        4)不可以抓取本页且不可以顺着本页继续索引其他链接
            <meta name="robots" content="noindex,nofollow">
            也可以写成
            <meta name="robots" content="none">

Guess you like

Origin blog.csdn.net/qq_36129701/article/details/104789902