Robots.txt leaks sensitive information

Robots.txt leaks sensitive information

What is Robots?

Robots is an agreement between a website and a crawler. The website uses the robots protocol (robots.txt) to tell search engines which pages can be crawled.

When a search spider visits a website, it will first check whether robots.txt exists in the site and directory, and then determine the scope of access according to the content regulations in the file.

What is the reason why Robots.txt leaks sensitive information?

The robots.txt file itself has no loopholes. It tells search engine spiders which files can be crawled and which cannot be crawled. When we generally write the robots.txt file, in order to prevent the crawling of search engine spiders, we will write the path. However, robots.txt mostly defines the backend address or database address of the website, which may reveal sensitive information.

Ways to scan robots for vulnerabilities:

You can use the tool crawler to scan the website's sensitive file directory and crawl the robots file. Or add /robots.txt directly after the url link for testing.

How to fix it?

  1. First of all, we must make it clear that robots.txt should not be used to protect/hide information. Sensitive files and directories should be moved to another isolated subdirectory to exclude this directory from Robots searches.
  2. The content of robots.txt can be set to Disallow: / to prohibit search engines from accessing any content of the website.

You can also find robots generators from the Internet and generate robots according to your requirements for research.

I am also groping slowly in a safe direction. These are all my own notes. I welcome your comments and criticisms.

Guess you like

Origin blog.csdn.net/zHx981/article/details/112181140