Robots.txt
A web crawler or a web spider collects information about sites and passes it back to the site that spawns it for the purpose of indexing and this index could be used in variety of ways, usually for the search engines. But can a site inform these spiders what can be indexed and what not? Can the site disallow certain parts of the site or can it block a specific agent? The answer to these questions is yes. How? Through a file called robots.txt placed in the site. The spiders crawling through a site look for this file and act accordingly. If it contains its name to be disallowed it does not collect information. Partial traverse through the site for specific user agents is also possible. For detailed information on Robots.txt take a look at the FAQ section of robotstxt.org. Google offers webmaster tools for a site owner/administrator that give details about when Google indexed your site through its robots and many other statistics and reports.
Note: - If you are new to the words web crawler, spider, robots, take a look at my posts “Web Crawler” and “Spider Simulator“.
Deprecated: Function ereg_replace() is deprecated in /home/techmasa/public_html/wp-content/plugins/sociable/sociable.php on line 64

Permalink
Comments
Cosmos









