wget: Robot Exclusion

 
 9.1 Robot Exclusion
 ===================
 
 It is extremely easy to make Wget wander aimlessly around a web site,
 sucking all the available data in progress.  ‘wget -r SITE’, and you’re
 set.  Great?  Not for the server admin.
 
    As long as Wget is only retrieving static pages, and doing it at a
 reasonable rate (see the ‘--wait’ option), there’s not much of a
 problem.  The trouble is that Wget can’t tell the difference between the
 smallest static page and the most demanding CGI. A site I know has a
 section handled by a CGI Perl script that converts Info files to HTML on
 the fly.  The script is slow, but works well enough for human users
 viewing an occasional Info file.  However, when someone’s recursive Wget
 download stumbles upon the index page that links to all the Info files
 through the script, the system is brought to its knees without providing
 anything useful to the user (This task of converting Info files could be
 done locally and access to Info documentation for all installed GNU
 software on a system is available from the ‘info’ command).
 
    To avoid this kind of accident, as well as to preserve privacy for
 documents that need to be protected from well-behaved robots, the
 concept of “robot exclusion” was invented.  The idea is that the server
 administrators and document authors can specify which portions of the
 site they wish to protect from robots and those they will permit access.
 
    The most popular mechanism, and the de facto standard supported by
 all the major robots, is the “Robots Exclusion Standard” (RES) written
 by Martijn Koster et al.  in 1994.  It specifies the format of a text
 file containing directives that instruct the robots which URL paths to
 avoid.  To be found by the robots, the specifications must be placed in
 ‘/robots.txt’ in the server root, which the robots are expected to
 download and parse.
 
    Although Wget is not a web robot in the strictest sense of the word,
 it can download large parts of the site without the user’s intervention
 to download an individual page.  Because of that, Wget honors RES when
 downloading recursively.  For instance, when you issue:
 
      wget -r http://www.example.com/
 
    First the index of ‘www.example.com’ will be downloaded.  If Wget
 finds that it wants to download more documents from that server, it will
 request ‘http://www.example.com/robots.txt’ and, if found, use it for
 further downloads.  ‘robots.txt’ is loaded only once per each server.
 
    Until version 1.8, Wget supported the first version of the standard,
 written by Martijn Koster in 1994 and available at
 <http://www.robotstxt.org/orig.html>.  As of version 1.8, Wget has
 supported the additional directives specified in the internet draft
 ‘<draft-koster-robots-00.txt>’ titled “A Method for Web Robots Control”.
 The draft, which has as far as I know never made to an RFC, is available
 at <http://www.robotstxt.org/norobots-rfc.txt>.
 
    This manual no longer includes the text of the Robot Exclusion
 Standard.
 
    The second, less known mechanism, enables the author of an individual
 document to specify whether they want the links from the file to be
 followed by a robot.  This is achieved using the ‘META’ tag, like this:
 
      <meta name="robots" content="nofollow">
 
    This is explained in some detail at
 <http://www.robotstxt.org/meta.html>.  Wget supports this method of
 robot exclusion in addition to the usual ‘/robots.txt’ exclusion.
 
    If you know what you are doing and really really wish to turn off the
 robot exclusion, set the ‘robots’ variable to ‘off’ in your ‘.wgetrc’.
 You can achieve the same effect from the command line using the ‘-e’
 switch, e.g.  ‘wget -e robots=off URL...’.