Webmaster Crawler Protocol robots

The Robots Protocol (also known as the Crawler Protocol, Robot Protocol, etc.) is the full name of the "Robots Exclusion Protocol". Through the Robots Protocol, websites tell search engines which pages can be crawled and which pages cannot be crawled. This article will introduce the crawler protocol robots in detail

 

Overview

The robots.txt file is a text file that is the first file a search engine looks at when visiting a website. The robots.txt file tells the spider what files can be viewed on the server.
When a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, the search robot will follow the The content of the file is used to determine the scope of access; if the file does not exist, all search spiders will be able to access all pages on
the website that are not protected by passwords
. Search technology should serve human beings, while respecting the wishes of information providers and maintaining their privacy rights;
2. The website is obliged to protect the personal information and privacy of its users from being infringed
[Note] robots.txt must be placed in a site's root directory, and the file name must be all lowercase

 

spelling

[User-agent]
All search engine types represented by * in the following code, * is a wildcard, indicating all search robots

User-agent: *

 

The following code represents Baidu's search robot

User-agent: Baiduspider

 

[Disallow]
The following code indicates that it is forbidden to crawl the directory under the admin directory

Disallow: /admin/

 

The following code indicates that it is forbidden to crawl all .jpg images on the web page

Disallow: /.jpg$

 

The following code indicates that it is forbidden to crawl the adc.html file under the ab folder

Disallow:/ab/adc.html

 

The code below means that all URLs that contain a question mark (?) in the site are blocked from accessing

Disallow: /*?*

 

The following code means that access to all pages in the website is prohibited

Disallow: /

 

[Allow]
The following code indicates that access to URLs suffixed with ".html" is allowed

Allow: .html$

 

The following code indicates that the entire directory of tmp is allowed to be crawled

Allow: /tmp

 

Usage
The following code means that all robots are allowed to access all pages of the website

User-agent: *
Allow: /

 

The code below means that all search engines are prohibited from accessing any part of the website

User-agent: *
Disallow: /

 

The following code indicates that Baidu's robots are prohibited from accessing all its directories under its website

User-agent: Baiduspider
Disallow: /

 

The following code indicates that all search engines are prohibited from accessing the files in the three directories of cgi-bin, tmp, and ~joe of the website

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

 

Misunderstanding

[Misunderstanding 1]: All files on the website need to be crawled by spiders, so there is no need to add the robots.txt file. Anyway, if the file doesn't exist, all search spiders will by default be able to access all pages on the site that are not password protected.
Whenever a user tries to access a URL that doesn't exist, the server will log a 404 error (file not found) . The server will also record a 404 error in the log whenever the search spider looks for the robots.txt file that does not exist, so you should add a robots.txt to the website

[Misunderstanding 2]: Set all the All files can be crawled by search spiders, which can increase the indexing rate of
the website. Even if the program scripts, style sheets and other files in the website are indexed by spiders, it will not increase the indexing rate of the website, and will only waste server resources. Therefore, it must be set in the robots.txt file not to allow search spiders to index these files

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326426099&siteId=291194637