Robots.txt

A web crawler or a web spider collects information about sites and passes it back to the site that spawns it for the purpose of indexing and this index could be used in variety of ways, usually for the search engines. But can a site inform these spiders what can be indexed and what not? Can the site disallow certain parts of the site or can it block a specific agent? The answer to these questions is yes. How? Through a file called robots.txt placed in the site.  The spiders crawling through a site look for this file and act accordingly. If it contains its name to be disallowed it does not collect information. Partial traverse through the site for specific user agents is also possible.  For detailed information on Robots.txt take a look at the FAQ section of robotstxt.org. Google offers webmaster tools for a site owner/administrator that give details about when Google indexed your site through its robots and many other statistics and reports.

Note: - If you are new to the words web crawler, spider, robots, take a look at my posts “Web Crawler” and “Spider Simulator“.

Blink this Robots.txt at blinklist.com    Bookmark Robots.txt at blogmarks    Bookmark Robots.txt at del.icio.us    Digg Robots.txt at Digg.com    Fark Robots.txt at Fark.com    Bookmark Robots.txt at Furl.net    Bookmark Robots.txt at NewsVine    Bookmark Robots.txt at reddit.com    Bookmark Robots.txt at Simpy.com    Bookmark Robots.txt at Spurl.net    Bookmark Robots.txt with wists    Bookmark Robots.txt at YahooMyWeb

      Cosmos

Leave a Comment

authimage


Creative Commons License  This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.