The
standard of Robots.txt was proposed by Martijn Koster, when working
for Nexor 1994 on a mailing list, the main communication channel for
related activities at the time. Koster suggest robots.txt, after he
wrote a badly-behaved web spider that caused an inadvertent denial of
service attack on Koster's server.
Robots.txt
uses to control the crawling of websites. If a site owner wants to
give some instructions to search engine robots about the crawling,
they must place a text file in website root directory folder called
robots.txt (e.g.www.example.com/robots.txt). You need robots.txt file
only if your site have some content that you don't want search
engines to index.
If
a robot wants to vists a Website, Before it does so it firsts checks
for robots.txt. And check about the allow and disallow section in the website.
User-agent:
*
Disallow:
/
The
User-agent: * means this section applies to all robots. The Disallow:
/ tells the robot that it should not visit any pages on the site.
To
block the entire site,
User-agent:
Googlebot
Disallow:
/
To
block a page,
User-agent:
Googlebot
Disallow:
/private-file.html
To
remove a directory and everything in it from Google search,
User-agent:
Googlebot
Disallow:
/junk-directory/
To
block a specific image from Google Images,
User-agent:
Googlebot-Image
Disallow:
/images/dogs.jpg
To
block all images on your site from Google Images:
User-agent:
Googlebot-Image
Disallow:
/
To
block files of a specific file type (for example, .jpg,.gif),
User-agent:
Googlebot
Disallow:
/*.gif$
Robots.txt
play an important role to fix crawling errors of websites.
Malware robots can ignore your /robots.txt. That scan the web for security
vulnerabilities, and email address harvesters used by spammers will
pay no attention.