wget: Directory-Based Limits

 
 4.3 Directory-Based Limits
 ==========================
 
 Regardless of other link-following facilities, it is often useful to
 place the restriction of what files to retrieve based on the directories
 those files are placed in.  There can be many reasons for this—the home
 pages may be organized in a reasonable directory structure; or some
 directories may contain useless information, e.g.  ‘/cgi-bin’ or ‘/dev’
 directories.
 
    Wget offers three different options to deal with this requirement.
 Each option description lists a short name, a long name, and the
 equivalent command in ‘.wgetrc’.
 
 ‘-I LIST’
 ‘--include LIST’
 ‘include_directories = LIST’
      ‘-I’ option accepts a comma-separated list of directories included
      in the retrieval.  Any other directories will simply be ignored.
      The directories are absolute paths.
 
      So, if you wish to download from ‘http://host/people/bozo/’
      following only links to bozo’s colleagues in the ‘/people’
      directory and the bogus scripts in ‘/cgi-bin’, you can specify:
 
           wget -I /people,/cgi-bin http://host/people/bozo/
 
 ‘-X LIST’
 ‘--exclude LIST’
 ‘exclude_directories = LIST’
      ‘-X’ option is exactly the reverse of ‘-I’—this is a list of
      directories _excluded_ from the download.  E.g.  if you do not want
      Wget to download things from ‘/cgi-bin’ directory, specify ‘-X
      /cgi-bin’ on the command line.
 
      The same as with ‘-A’/‘-R’, these two options can be combined to
      get a better fine-tuning of downloading subdirectories.  E.g.  if
      you want to load all the files from ‘/pub’ hierarchy except for
      ‘/pub/worthless’, specify ‘-I/pub -X/pub/worthless’.
 
 ‘-np’
 ‘--no-parent’
 ‘no_parent = on’
      The simplest, and often very useful way of limiting directories is
      disallowing retrieval of the links that refer to the hierarchy
      “above” than the beginning directory, i.e.  disallowing ascent to
      the parent directory/directories.
 
      The ‘--no-parent’ option (short ‘-np’) is useful in this case.
      Using it guarantees that you will never leave the existing
      hierarchy.  Supposing you issue Wget with:
 
           wget -r --no-parent http://somehost/~luzer/my-archive/
 
      You may rest assured that none of the references to
      ‘/~his-girls-homepage/’ or ‘/~luzer/all-my-mpegs/’ will be
      followed.  Only the archive you are interested in will be
      downloaded.  Essentially, ‘--no-parent’ is similar to
      ‘-I/~luzer/my-archive’, only it handles redirections in a more
      intelligent fashion.
 
      *Note* that, for HTTP (and HTTPS), the trailing slash is very
      important to ‘--no-parent’.  HTTP has no concept of a
      “directory”—Wget relies on you to indicate what’s a directory and
      what isn’t.  In ‘http://foo/bar/’, Wget will consider ‘bar’ to be a
      directory, while in ‘http://foo/bar’ (no trailing slash), ‘bar’
      will be considered a filename (so ‘--no-parent’ would be
      meaningless, as its parent is ‘/’).