You are on page 1of 6

A Standard for Robot Exclusion

Table of contents:

• Status of this document


• Introduction
• Method
• Format
• Examples
• Example Code
• Author's Address

Status of this document

This document represents a consensus on 30 June 1994 on the robots mailing list (robots-
request@nexor.co.uk), between the majority of robot authors and other people with an
interest in robots. It has also been open for discussion on the Technical World Wide Web
mailing list (www-talk@info.cern.ch). This document is based on a previous working
draft under the same title.

It is not an official standard backed by a standards body, or owned by any commercial


organisation. It is not enforced by anybody, and there no guarantee that all current and
future robots will use it. Consider it a common facility the majority of robot authors offer
the WWW community to protect WWW server against unwanted accesses by their
robots.

The latest version of this document can be found on


http://www.robotstxt.org/wc/robots.html.

Introduction

WWW Robots (also called wanderers or spiders) are programs that traverse many pages
in the World Wide Web by recursively retrieving linked pages. For more information see
the robots page.

In 1993 and 1994 there have been occasions where robots have visited WWW servers
where they weren't welcome for various reasons. Sometimes these reasons were robot
specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the
same files repeatedly. In other situations robots traversed parts of WWW servers that
weren't suitable, e.g. very deep virtual trees, duplicated information, temporary
information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to
indicate to robots which parts of their server should not be accessed. This standard
addresses this need with an operational solution.
The Method

The method used to exclude robots from a server is to create a file on the server which
specifies an access policy for robots. This file must be accessible via HTTP on the local
URL "/robots.txt". The contents of this file are specified below.

This approach was chosen because it can be easily implemented on any existing WWW
server, and a robot can find the access policy with only a single document retrieval.

A possible drawback of this single-file approach is that only a server administrator can
maintain such a list, not the individual document maintainers on the server. This can be
resolved by a local process to construct the single file from a number of others, but if, or
how, this is done is outside of the scope of this document.

The choice of the URL was motivated by several criteria:

• The filename should fit in file naming restrictions of all common operating
systems.
• The filename extension should not require extra server configuration.
• The filename should indicate the purpose of the file and be easy to remember.
• The likelihood of a clash with existing files should be minimal.

The Format

The format and semantics of the "/robots.txt" file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated
by CR,CR/NL, or NL). Each record contains lines of the form
"<field>:<optionalspace><value><optionalspace>". The field name is case
insensitive.

Comments can be included in file using UNIX bourne shell conventions: the '#' character
is used to indicate that preceding space (if any) and the remainder of the line up to the
line termination is discarded. Lines containing only a comment are discarded completely,
and therefore do not indicate a record boundary.

The record starts with one or more User-agent lines, followed by one or more Disallow
lines, as detailed below. Unrecognised headers are ignored.

User-agent
The value of this field is the name of the robot the record is describing access
policy for.

If more than one User-agent field is present the record describes an identical
access policy for more than one robot. At least one field needs to be present per
record.
The robot should be liberal in interpreting this field. A case insensitive substring
match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that
has not matched any of the other records. It is not allowed to have multiple such
records in the "/robots.txt" file.

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be
a full path, or a partial path; any URL that starts with this value will not be
retrieved. For example, Disallow: /help disallows both /help.html and
/help/index.html, whereas Disallow: /help/ would disallow
/help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow
field needs to be present in a record.

The presence of an empty "/robots.txt" file has no explicit associated semantics, it will
be treated as if it was not present, i.e. all robots will consider themselves welcome.

http://www.robotstxt.org/orig.html

firewalls

n construction, a firewall is a non-flammable wall that prevents fires from spreading


throughout a building. Homes, for example, may have a firewall between the garage and
the rest of the house to prevent garage fires from threatening other rooms.

With the rapid popularization of the internet, the term firewall is more commonly used in
computer networking. Like the firewalls used in homes and buildings, computer firewalls
act as a barrier between computers on a network. For companies with a computer
network or for individuals with a permanent connection to the internet (such as through
DSL or cable), a firewall is critical. Without a firewall, intruders on the network would
likely be able to destroy, tamper with or gain access to the files on your computer

Firewalls can come in the form of hardware or software. Without getting into the
complex details of how firewalls work, suffice it to say that function with a set of filters
that are constantly monitoring traffic on the network. Whenever a packet of information
triggers one of the filters, the firewall prevents it from passing through in the attempt to
prevent damage. Of course, firewalls sometimes block wanted traffic, and through a
continual process of refinement, the filters can be customized to improve their efficacy.
Many computer users who access the internet via a broadband router, may already be
benefitting from a firewall. The router itself may be configured to serve as a firewall; any
nefarious attacks from the network, are halted at the router thereby sparing any ill effects
to the computer. Such a hardware firewall can be further bolstered with a secondary line
of defense in the form of a software firewall; you can never be too safe when using the
internet!

http://www.wisegeek.com/what-are-firewalls.htm

You have been using the Internet for any length of time, and especially if you work at a
larger company and browse the Web while you are at work, you have probably heard the
term firewall used. For example, you often hear people in companies say things like, "I
can't use that site because they won't let it through the firewall."

If you have a fast Internet connection into your home (either a DSL connection or a cable
modem), you may have found yourself hearing about firewalls for your home network as
well. It turns out that a small home network has many of the same security issues that a
large corporate network does. You can use a firewall to protect your home network and
family from offensive Web sites and potential hackers.

More PC Security

• How to Avoid Spyware


• 10 Worst Computer Viruses

• Curiosity Project: Computer Software Puzzles

Basically, a firewall is a barrier to keep destructive forces away from your property. In
fact, that's why its called a firewall. Its job is similar to a physical firewall that keeps a
fire from spreading from one area to the next. As you read through this article, you will
learn more about firewalls, how they work and what kinds of threats they can protect you
from.

http://www.howstuffworks.com/firewall.htm

web robots

Web robot is a program that automatically and recursively traverses a Web site retrieving
document content and information. The most common types of Web robots are the search
engine spiders. These robots visit Web sites and follow the links to add more information
to the search engine database.
Web robots often go by different names. You may hear them called:

• spiders
• bots
• crawlers

All these terms mean the same thing, but robot is the clearest, because it does not imply
that the program is moving through the Web site on its own, but rather is programmed to
move systematically through a site.

Web Robots Follow Rules

While it is possible to write a robot that ignores the rules, most Web robots are written to
obey certain rules set down in a specific text file on your site. This file is the robots.txt
file. It is usually found in the root of your Web server and acts as the gateway for the
robots. It tells them which areas of the site they can and cannot traverse.

Keep in mind that while most Web robots follow the rules that you lay out in your
robots.txt file, some do not. If you have sensitive information, you should control access
to it with a password or on an intranet rather than relying on robots not to spider it.

How are Web Robots Used

The most common use for Web robots is to index a site for a search engine. But robots
can be used for other purposes as well. Some of the more common uses are:

• Link validation - Robots can follow all the links on a site or a page, testing them
to make sure they return a valid page code. The advantage to doing this
programmatically is inherently obvious, the robot can visit all the links on a page
in a minute or two and provide a report of the results much quicker than a human
could do manually.
• HTML validation - Similar to link validation, robots can be sent to various pages
on your site to evaluate the HTML coding.
• Change monitoring - There are services available on the Web that will tell you
when a Web page has changed. These services are done by sending a robot to the
page periodically to evaluate if the content has changed. When it is different, the
robot would file a report.
• Web site mirroring - Similar to the change monitoring robots, these robots
evaluate a site, and when there is a change, the robot will transfer the changed
information to the mirror site location.

http://webdesign.about.com/od/promotion/a/aa020705.htm

Web robots are software programs that automatically traverse the hyperlink structure of
the World Wide Web in order to locate and retrieve information. There are many reasons
why it is important to identify visits by Web robots and distinguish them from other
users. First of all, e-commerce retailers are particularly concerned about the unauthorized
deployment of robots for gathering business intelligence at their Web sites. In addition,
Web robots tend to consume considerable network bandwidth at the expense of other
users. Sessions due to Web robots also make it more difficult to perform clickstream
analysis effectively on the Web data. Conventional techniques for detecting Web robots
are often based on identifying the IP address and user agent of the Web clients. While
these techniques are applicable to many well-known robots, they may not be sufficient to
detect camouflaged and previously unknown robots

You might also like