The robots protocol is a text file placed in the root directory of the website, telling search engines which content can be crawled;
the role of the robots protocol:
1) guide search engine spiders to crawl the specified column or content;
2) website revision or url optimization Block links that are not friendly to search engines when writing;
3) Block dead links and 404 error pages;
4) Block meaningless and contentless pages;
5) Block duplicate pages;
6) Block pages that you do not want to be included;
7) Guide spiders Grab the site map;
8) Block large files, pictures and videos in the site to save broadband and increase speed;
Syntax and wildcards:
1) User-agent: defines the search engine;
2) Disallow: defines pages or directories
that spiders are not allowed to crawl ; 3) Allow: defines pages or directories that spiders are allowed to crawl;
4) $ matches the characters at the end of the URL ;
5) * matches 0 or more arbitrary characters;
File writing:
1) User-agent: * here represents all search engine types; (Google: Googlebot, Baidu: Baiduspider, MSN: MSNbot, Yahoo: Slurp)
2) Disallow: /admin/ prohibit crawling under the admin directory All directories;
3) Disallow: /admin prohibits crawling of /admin.html, /adminset.html, /admin/abc.html;
4) Disallow: /admin/.html prohibits crawling of all html suffixes in the admin directory Files (including subdirectories);
5) Disallow: / ? Prohibit crawling of all URLs containing question marks;
6) Disallow: /.jpg$ prohibit crawling of all pictures in .jpg format;
7) Disallow: / ab/abc.html prohibit crawling the abc.html file under the ab file;
8) Allow: /abc/ allows crawling all directories under the abc directory;
9) Allow: /tmp allows crawling of the tmp directory;
10) Allow: .html$ allows crawling web pages with html as the suffix URL;
11) Allow: .gif$ allows crawling of pictures in gif format in web pages;
12) Sitemap: Sitemap tells the crawler that this page is a site map;
Example:
1) User-agent: *
Disallow: /admin/
Disallow: /abc/
Note: All search engines are prohibited from crawling the admin and abc directories and subdirectories;
2) User-agent: *
Allow: /admin/seo/
Disallow: /admin/
Note: All search engines are prohibited from crawling the admin directory and subdirectories, but the seo
directory under the admin directory can be crawled ; (Allow must be in the front, Disallow is written in the back)
3) User-agent: *
Disallow: /abc/ .htm$
Note: All search engines are prohibited from crawling the url with the suffix of .htm in the abc directory and subdirectories;
4) User-agent: *
Disallow: / ? *
Note: Is there any search engine to crawl? The page;
5) the User-Agent: Baiduspider
Disallow: /.jpg$
Disallow: /.jpeg$
Disallow: / GIF $
Disallow: / PNG $
Disallow: /.bmp$
Note: ban all search engines to crawl all the pictures;
6 ) User-agent: *
Disallow: /folder1/
User-agent: Mediapartners-Google
Allow: /folder1/
Note: All search engines are prohibited from crawling folder1, but Mediapartners-Google robots can
display AdSense ads on the page ;
7) User-agent: *
Disallow: /abc*/
Note: All search engines are prohibited from crawling all directories and subdirectories beginning with abc;
Other attributes:
1) Specify the robot protocol version number:
Robot-version: Version 2.0
2) Search engines can crawl the specified url only in the specified time period
Visit-time: 0100-1300 Allow access between 1 am and 3 am
3 ) Limit URL reading frequency
Request-rate: 40/1m 0800-1300 Between 8 o'clock and 13 o'clock, visit
Robots meta tag at a frequency of 40 times per minute :
<meta name="Robots" content="all|none|index|noindex|follow|nofollow">
Property description:
1) all: the file will be retrieved, and the link on the page can be queried; the default is all;
2) none: the file will not be retrieved, and the link on the page cannot be retrieved;
3) index: the file Will be retrieved;
4) follow: the link on the page can be retrieved;
5) noindex: the file is not retrieved;
6) nofollow: the link on the page is not retrieved;
combined use:
1)可以抓取本页且可以顺着本页继续索引其他链接
<meta name="robots" content="index,follow">
也可以写成
<meta name="robots" content="all">
2)不可以抓取本页但可以顺着本页继续索引其他链接
<meta name="robots" content="noindex,follow">
3)可以抓取本页但不可以顺着本页继续索引其他链接
<meta name="robots" content="index,nofollow">
4)不可以抓取本页且不可以顺着本页继续索引其他链接
<meta name="robots" content="noindex,nofollow">
也可以写成
<meta name="robots" content="none">