Info: (find) Deleting Files

find: Deleting Files

 
 10.1 Deleting Files
 ===================
 
 One of the most common tasks that 'find' is used for is locating files
 that can be deleted.  This might include:
 
    * Files last modified more than 3 years ago which haven't been
      accessed for at least 2 years
    * Files belonging to a certain user
    * Temporary files which are no longer required
 
    This example concentrates on the actual deletion task rather than on
 sophisticated ways of locating the files that need to be deleted.  We'll
 assume that the files we want to delete are old files underneath
 '/var/tmp/stuff'.
 
 10.1.1 The Traditional Way
 --------------------------
 
 The traditional way to delete files in '/var/tmp/stuff' that have not
 been modified in over 90 days would have been:
 
      find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \;
 
    The above command uses '-exec' to run the '/bin/rm' command to remove
 each file.  This approach works and in fact would have worked in Version
 7 Unix in 1979.  However, there are a number of problems with this
 approach.
 
    The most obvious problem with the approach above is that it causes
 'find' to fork every time it finds a file that needs to delete, and the
 child process then has to use the 'exec' system call to launch
 '/bin/rm'.  All this is quite inefficient.  If we are going to use
 '/bin/rm' to do this job, it is better to make it delete more than one
 file at a time.
 
    The most obvious way of doing this is to use the shell's command
 expansion feature:
 
      /bin/rm `find /var/tmp/stuff -mtime +90 -print`
    or you could use the more modern form
      /bin/rm $(find /var/tmp/stuff -mtime +90 -print)
 
    The commands above are much more efficient than the first attempt.
 However, there is a problem with them.  The shell has a maximum command
 length which is imposed by the operating system (the actual limit varies
 between systems).  This means that while the command expansion technique
 will usually work, it will suddenly fail when there are lots of files to
 delete.  Since the task is to delete unwanted files, this is precisely
 the time we don't want things to go wrong.
 
 10.1.2 Making Use of 'xargs'
 ----------------------------
 
 So, is there a way to be more efficient in the use of 'fork()' and
 'exec()' without running up against this limit?  Yes, we can be almost
 optimally efficient by making use of the 'xargs' command.  The 'xargs'
 command reads arguments from its standard input and builds them into
 command lines.  We can use it like this:
 
      find /var/tmp/stuff -mtime +90 -print | xargs /bin/rm
 
    For example if the files found by 'find' are '/var/tmp/stuff/A',
 '/var/tmp/stuff/B' and '/var/tmp/stuff/C' then 'xargs' might issue the
 commands
 
      /bin/rm /var/tmp/stuff/A /var/tmp/stuff/B
      /bin/rm /var/tmp/stuff/C
 
    The above assumes that 'xargs' has a very small maximum command line
 length.  The real limit is much larger but the idea is that 'xargs' will
 run '/bin/rm' as many times as necessary to get the job done, given the
 limits on command line length.
 
    This usage of 'xargs' is pretty efficient, and the 'xargs' command is
 widely implemented (all modern versions of Unix offer it).  So far then,
 the news is all good.  However, there is bad news too.
 
 10.1.3 Unusual characters in filenames
 --------------------------------------
 
 Unix-like systems allow any characters to appear in file names with the
 exception of the ASCII NUL character and the slash.  Slashes can occur
 in path names (as the directory separator) but not in the names of
 actual directory entries.  This means that the list of files that
 'xargs' reads could in fact contain white space characters - spaces,
 tabs and newline characters.  Since by default, 'xargs' assumes that the
 list of files it is reading uses white space as an argument separator,
 it cannot correctly handle the case where a filename actually includes
 white space.  This makes the default behaviour of 'xargs' almost useless
 for handling arbitrary data.
 
    To solve this problem, GNU findutils introduced the '-print0' action
 for 'find'.  This uses the ASCII NUL character to separate the entries
 in the file list that it produces.  This is the ideal choice of
 separator since it is the only character that cannot appear within a
 path name.  The '-0' option to 'xargs' makes it assume that arguments
 are separated with ASCII NUL instead of white space.  It also turns off
 another misfeature in the default behaviour of 'xargs', which is that it
 pays attention to quote characters in its input.  Some versions of
 'xargs' also terminate when they see a lone '_' in the input, but GNU
 'find' no longer does that (since it has become an optional behaviour in
 the Unix standard).
 
    So, putting 'find -print0' together with 'xargs -0' we get this
 command:
 
      find /var/tmp/stuff -mtime +90 -print0 | xargs -0 /bin/rm
 
    The result is an efficient way of proceeding that correctly handles
 all the possible characters that could appear in the list of files to
 delete.  This is good news.  However, there is, as I'm sure you're
 expecting, also more bad news.  The problem is that this is not a
 portable construct; although other versions of Unix (notably BSD-derived
 ones) support '-print0', it's not universal.  So, is there a more
 universal mechanism?
 
 10.1.4 Going back to '-exec'
 ----------------------------
 
 There is indeed a more universal mechanism, which is a slight
 modification to the '-exec' action.  The normal '-exec' action assumes
 that the command to run is terminated with a semicolon (the semicolon
 normally has to be quoted in order to protect it from interpretation as
 the shell command separator).  The SVR4 edition of Unix introduced a
 slight variation, which involves terminating the command with '+'
 instead:
 
      find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
 
    The above use of '-exec' causes 'find' to build up a long command
 line and then issue it.  This can be less efficient than some uses of
 'xargs'; for example 'xargs' allows building up new command lines while
 the previous command is still executing, and allows specifying a number
 of commands to run in parallel.  However, the 'find ... -exec ... +'
 construct has the advantage of wide portability.  GNU findutils did not
 support '-exec ... +' until version 4.2.12; one of the reasons for this
 is that it already had the '-print0' action in any case.
 
 10.1.5 A more secure version of '-exec'
 ---------------------------------------
 
 The command above seems to be efficient and portable.  However, within
 it lurks a security problem.  The problem is shared with all the
 commands we've tried in this worked example so far, too.  The security
 problem is a race condition; that is, if it is possible for somebody to
 manipulate the filesystem that you are searching while you are searching
 it, it is possible for them to persuade your 'find' command to cause the
 deletion of a file that you can delete but they normally cannot.
 
    The problem occurs because the '-exec' action is defined by the POSIX
 standard to invoke its command with the same working directory as 'find'
 had when it was started.  This means that the arguments which replace
 the {} include a relative path from 'find''s starting point down the
 file that needs to be deleted.  For example,
 
      find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
 
    might actually issue the command:
 
      /bin/rm /var/tmp/stuff/A /var/tmp/stuff/B /var/tmp/stuff/passwd
 
    Notice the file '/var/tmp/stuff/passwd'.  Likewise, the command:
 
      cd /var/tmp && find stuff -mtime +90 -exec /bin/rm {} \+
 
    might actually issue the command:
 
      /bin/rm stuff/A stuff/B stuff/passwd
 
    If an attacker can rename 'stuff' to something else (making use of
 their write permissions in '/var/tmp') they can replace it with a
 symbolic link to '/etc'.  That means that the '/bin/rm' command will be
 invoked on '/etc/passwd'.  If you are running your 'find' command as
 root, the attacker has just managed to delete a vital file.  All they
 needed to do to achieve this was replace a subdirectory with a symbolic
 link at the vital moment.
 
    There is however, a simple solution to the problem.  This is an
 action which works a lot like '-exec' but doesn't need to traverse a
 chain of directories to reach the file that it needs to work on.  This
 is the '-execdir' action, which was introduced by the BSD family of
 operating systems.  The command,
 
      find /var/tmp/stuff -mtime +90 -execdir /bin/rm {} \+
 
    might delete a set of files by performing these actions:
 
   1. Change directory to /var/tmp/stuff/foo
   2. Invoke '/bin/rm ./file1 ./file2 ./file3'
   3. Change directory to /var/tmp/stuff/bar
   4. Invoke '/bin/rm ./file99 ./file100 ./file101'
 
    This is a much more secure method.  We are no longer exposed to a
 race condition.  For many typical uses of 'find', this is the best
 strategy.  It's reasonably efficient, but the length of the command line
 is limited not just by the operating system limits, but also by how many
 files we actually need to delete from each directory.
 
    Is it possible to do any better?  In the case of general file
 processing, no.  However, in the specific case of deleting files it is
 indeed possible to do better.
 
 10.1.6 Using the '-delete' action
 ---------------------------------
 
 The most efficient and secure method of solving this problem is to use
 the '-delete' action:
 
      find /var/tmp/stuff -mtime +90 -delete
 
    This alternative is more efficient than any of the '-exec' or
 '-execdir' actions, since it entirely avoids the overhead of forking a
 new process and using 'exec' to run '/bin/rm'.  It is also normally more
 efficient than 'xargs' for the same reason.  The file deletion is
 performed from the directory containing the entry to be deleted, so the
 '-delete' action has the same security advantages as the '-execdir'
 action has.
 
    The '-delete' action was introduced by the BSD family of operating
 systems.
 
 10.1.7 Improving things still further
 -------------------------------------
 
 Is it possible to improve things still further?  Not without either
 modifying the system library to the operating system or having more
 specific knowledge of the layout of the filesystem and disk I/O
 subsystem, or both.
 
    The 'find' command traverses the filesystem, reading directories.  It
 then issues a separate system call for each file to be deleted.  If we
 could modify the operating system, there are potential gains that could
 be made:
 
    * We could have a system call to which we pass more than one filename
      for deletion
    * Alternatively, we could pass in a list of inode numbers (on
      GNU/Linux systems, 'readdir()' also returns the inode number of
      each directory entry) to be deleted.
 
    The above possibilities sound interesting, but from the kernel's
 point of view it is difficult to enforce standard Unix access controls
 for such processing by inode number.  Such a facility would probably
 need to be restricted to the superuser.
 
    Another way of improving performance would be to increase the
 parallelism of the process.  For example if the directory hierarchy we
 are searching is actually spread across a number of disks, we might
 somehow be able to arrange for 'find' to process each disk in parallel.
 In practice GNU 'find' doesn't have such an intimate understanding of
 the system's filesystem layout and disk I/O subsystem.
 
    However, since the system administrator can have such an
 understanding they can take advantage of it like so:
 
      find /var/tmp/stuff1 -mtime +90 -delete &
      find /var/tmp/stuff2 -mtime +90 -delete &
      find /var/tmp/stuff3 -mtime +90 -delete &
      find /var/tmp/stuff4 -mtime +90 -delete &
      wait
 
    In the example above, four separate instances of 'find' are used to
 search four subdirectories in parallel.  The 'wait' command simply waits
 for all of these to complete.  Whether this approach is more or less
 efficient than a single instance of 'find' depends on a number of
 things:
 
    * Are the directories being searched in parallel actually on separate
      disks?  If not, this parallel search might just result in a lot of
      disk head movement and so the speed might even be slower.
    * Other activity - are other programs also doing things on those
      disks?
 
 10.1.8 Conclusion
 -----------------
 
 The fastest and most secure way to delete files with the help of 'find'
 is to use '-delete'.  Using 'xargs -0 -P N' can also make effective use
 of the disk, but it is not as secure.
 
    In the case where we're doing things other than deleting files, the
 most secure alternative is '-execdir ... +', but this is not as portable
 as the insecure action '-exec ... +'.
 
    The '-delete' action is not completely portable, but the only other
 possibility which is as secure ('-execdir') is no more portable.  The
 most efficient portable alternative is '-exec ...+', but this is
 insecure and isn't supported by versions of GNU findutils prior to
 4.2.12.
Info Catalog
find: Worked Examples
find: Copying A Subset of Files