wget: Types of Files

 
 4.2 Types of Files
 ==================
 
 When downloading material from the web, you will often want to restrict
 the retrieval to only certain file types.  For example, if you are
 interested in downloading GIFs, you will not be overjoyed to get loads
 of PostScript documents, and vice versa.
 
    Wget offers two options to deal with this problem.  Each option
 description lists a short name, a long name, and the equivalent command
 in ‘.wgetrc’.
 
 ‘-A ACCLIST’
 ‘--accept ACCLIST’
 ‘accept = ACCLIST’
 ‘--accept-regex URLREGEX’
 ‘accept-regex = URLREGEX’
      The argument to ‘--accept’ option is a list of file suffixes or
      patterns that Wget will download during recursive retrieval.  A
      suffix is the ending part of a file, and consists of “normal”
      letters, e.g.  ‘gif’ or ‘.jpg’.  A matching pattern contains
      shell-like wildcards, e.g.  ‘books*’ or ‘zelazny*196[0-9]*’.
 
      So, specifying ‘wget -A gif,jpg’ will make Wget download only the
      files ending with ‘gif’ or ‘jpg’, i.e.  GIFs and JPEGs.  On the
      other hand, ‘wget -A "zelazny*196[0-9]*"’ will download only files
      beginning with ‘zelazny’ and containing numbers from 1960 to 1969
      anywhere within.  Look up the manual of your shell for a
      description of how pattern matching works.
 
      Of course, any number of suffixes and patterns can be combined into
      a comma-separated list, and given as an argument to ‘-A’.
 
      The argument to ‘--accept-regex’ option is a regular expression
      which is matched against the complete URL.
 
 ‘-R REJLIST’
 ‘--reject REJLIST’
 ‘reject = REJLIST’
 ‘--reject-regex URLREGEX’
 ‘reject-regex = URLREGEX’
      The ‘--reject’ option works the same way as ‘--accept’, only its
      logic is the reverse; Wget will download all files _except_ the
      ones matching the suffixes (or patterns) in the list.
 
      So, if you want to download a whole page except for the cumbersome
      MPEGs and .AU files, you can use ‘wget -R mpg,mpeg,au’.
      Analogously, to download all files except the ones beginning with
      ‘bjork’, use ‘wget -R "bjork*"’.  The quotes are to prevent
      expansion by the shell.
 
    The argument to ‘--accept-regex’ option is a regular expression which
 is matched against the complete URL.
 
 The ‘-A’ and ‘-R’ options may be combined to achieve even better
 fine-tuning of which files to retrieve.  E.g.  ‘wget -A "*zelazny*" -R
 .ps’ will download all the files having ‘zelazny’ as a part of their
 name, but _not_ the PostScript files.
 
    Note that these two options do not affect the downloading of HTML
 files (as determined by a ‘.htm’ or ‘.html’ filename prefix).  This
 behavior may not be desirable for all users, and may be changed for
 future versions of Wget.
 
    Note, too, that query strings (strings at the end of a URL beginning
 with a question mark (‘?’) are not included as part of the filename for
 accept/reject rules, even though these will actually contribute to the
 name chosen for the local file.  It is expected that a future version of
 Wget will provide an option to allow matching against query strings.
 
    Finally, it’s worth noting that the accept/reject lists are matched
 _twice_ against downloaded files: once against the URL’s filename
 portion, to determine if the file should be downloaded in the first
 place; then, after it has been accepted and successfully downloaded, the
 local file’s name is also checked against the accept/reject lists to see
 if it should be removed.  The rationale was that, since ‘.htm’ and
 ‘.html’ files are always downloaded regardless of accept/reject rules,
 they should be removed _after_ being downloaded and scanned for links,
 if they did match the accept/reject lists.  However, this can lead to
 unexpected results, since the local filenames can differ from the
 original URL filenames in the following ways, all of which can change
 whether an accept/reject rule matches:
 
    • If the local file already exists and ‘--no-directories’ was
      specified, a numeric suffix will be appended to the original name.
    • If ‘--adjust-extension’ was specified, the local filename might
      have ‘.html’ appended to it.  If Wget is invoked with ‘-E -A.php’,
      a filename such as ‘index.php’ will match be accepted, but upon
      download will be named ‘index.php.html’, which no longer matches,
      and so the file will be deleted.
    • Query strings do not contribute to URL matching, but are included
      in local filenames, and so _do_ contribute to filename matching.
 
 This behavior, too, is considered less-than-desirable, and may change in
 a future version of Wget.