Reptile's robots.txt

    robots is an agreement between the website with reptiles, reptile is allowed to tell the corresponding permission to use simple and direct txt format text, that robots.txt file is the first time a search engine to access the site to view.

When a search spider visits a site, it will first check whether there is a robots.txt under the root directory of the site, if present, to determine the scope of the search robots will visit in accordance with the contents of the file; if the file

Does not exist, all search spiders will not be able to access all password-protected page on the site.

About robots


        Search engine robot by means of a program (also known as spider), automatic access to web pages on the Internet and get website information.

       You can create your website in a plain text file a robots.txt , stating that the site does not want to be part of the robot to access this document, so that part or all of the site's content can not be indexed by search engines , or specify the search engine includes only the specified content.

robots.txt (Unified lowercase) is a stored at the site root ASCII encoded text files, it usually tells Web search engine robots (also called spiders), which this site is not search engine acquired robots, which can be (rover) acquired.

Because some systems the URL is case sensitive, so the robots.txt file name should be in lowercase. robots.txt should be placed in the root directory of your site. If you want to separate the behavior when a search engine robots subdirectories, you can save your custom settings to merge robots.txt in the root directory,

Or use robots metadata.

     Robots.txt protocol is not a specification, but only convention, it does not guarantee the privacy of the site. Note Robots.txt is a string comparison to determine whether fetching URL, so there is no end of the directory and the slash "/" These two represent different URL, nor can "Disallow: * .gif" so

Wildcard. This agreement is not a specification, but only convention, usually search engines will recognize this metadata does not index this page and this page linked page

 
Where the robots.txt file?
 

robots.txt file should be placed under the web root directory. For example, when robots visit a Web site, it first checks whether the file exists in the site, if the robot to find this file, it will be based on the contents of this file to determine its scope of access rights.

 

Robots.txt file formats


"Robots.txt" file contains one or more records, which are separated by blank lines (in CR, CR / NL, or NL as the terminator), each record format shown below:

"<field>:<optionalspace><value><optionalspace>"

In this file you can use # annotate specific use and UNIX practice the same. Record in the file usually begins one or more rows User-agent, followed by several Disallow lines, as detailed below:

User-agent:

The name used to describe the value of the search engine robot, in the "robots.txt" file, where there are multiple User-agent record multiple robot description of the agreement is limited to the file, at least to there is a User-agent record. If this value is set to *, then the agreement is valid for any robot in the "robots.txt" file, "User-agent: *" This record can be only one.

Disallow:

The value used to describe not want to be accessible to a URL, the URL can be a complete path, can also be a part of, any URL beginning with Disallow will not be accessible to the robot. Such as "Disallow: / help" to /help.html and /help/index.html are not allowed to search engines, and "Disallow: / help /" allows robot access /help.html, and can not access / help / index .html. A Disallow any record is empty, indicating that all parts of the site are allowed to be accessed in the "/robots.txt" file, there must be at least a Disallow record. If "/robots.txt" is an empty file, then for all search engines robot, the site is open.

Allow:

The values ​​used to describe a set of hopes URL is accessed, and similar items Disallow, this value can be a complete path, the path may be a prefix to the beginning of the URL value Allow item is to allow access to the robot. For example "Allow: / hibaidu" allows the robot to access /hibaidu.htm,/hibaiducom.html,/hibaidu/com.html. All URL website a default is Allow, so Allow usually associated with the use Disallow, allow access to realize part of the page while prohibiting access to all other functions of the URL.

Of particular note is the order Disallow and Allow row is meaningful, robot will determine whether to access a URL matching the success of the first row according to Allow or Disallow.

Use "*" and "$":

robots supports the use of wildcards "*" and "$" to fuzzy matching url:

"$" Match line endings.

"*" Matches zero or more of any character.

robots.txt Grammar

 

With a few of the most common cases, direct example: 

1. Allow all SE site included:

robots.txt is empty can, nothing to write.

2. Prohibition of certain directories included all SE website:

User-agent: *

Disallow: / directory name 1 /

Disallow: / directory name 2 /

Disallow: / directory name 3 /

3. Prohibition SE included a site, such as the prohibition Baidu:

User-agent: Baiduspider

Disallow: /

4. The site included a ban on all SE:

User-agent: *

Disallow: /

 
For example: View Baidu robots.txt file

 

Guess you like

Origin www.cnblogs.com/benpao1314/p/11352276.html