Info: (find) Controlling Parallelism

Info Catalog
find: Limiting Command Size
find: Multiple Files
find: Interspersing File Names
find: Controlling Parallelism

 
 3.3.2.5 Controlling Parallelism
 ...............................
 
 Normally, ‘xargs’ runs one command at a time.  This is called "serial"
 execution; the commands happen in a series, one after another.  If you'd
 like ‘xargs’ to do things in "parallel", you can ask it to do so, either
 when you invoke it, or later while it is running.  Running several
 commands at one time can make the entire operation go more quickly, if
 the commands are independent, and if your system has enough resources to
 handle the load.  When parallelism works in your application, ‘xargs’
 provides an easy way to get your work done faster.
 
 ‘--max-procs=MAX-PROCS’
 ‘-P MAX-PROCS’
      Run up to MAX-PROCS processes at a time; the default is 1.  If
      MAX-PROCS is 0, ‘xargs’ will run as many processes as possible at a
      time.  Use the ‘-n’, ‘-s’, or ‘-L’ option with ‘-P’; otherwise
      chances are that the command will be run only once.
 
    For example, suppose you have a directory tree of large image files
 and a ‘makeallsizes’ script that takes a single file name and creates
 various sized images from it (thumbnail-sized, web-page-sized,
 printer-sized, and the original large file).  The script is doing enough
 work that it takes significant time to run, even on a single image.  You
 could run:
 
      find originals -name '*.jpg' | xargs -l makeallsizes
 
    This will run ‘makeallsizes FILENAME’ once for each ‘.jpg’ file in
 the ‘originals’ directory.  However, if your system has two central
 processors, this script will only keep one of them busy.  Instead, you
 could probably finish in about half the time by running:
 
      find originals -name '*.jpg' | xargs -l -P 2 makeallsizes
 
    ‘xargs’ will run the first two commands in parallel, and then
 whenever one of them terminates, it will start another one, until the
 entire job is done.
 
    The same idea can be generalized to as many processors as you have
 handy.  It also generalizes to other resources besides processors.  For
 example, if ‘xargs’ is running commands that are waiting for a response
 from a distant network connection, running a few in parallel may reduce
 the overall latency by overlapping their waiting time.
 
    If you are running commands in parallel, you need to think about how
 they should arbitrate access to any resources that they share.  For
 example, if more than one of them tries to print to stdout, the output
 will be produced in an indeterminate order (and very likely mixed up)
 unless the processes collaborate in some way to prevent this.  Using
 some kind of locking scheme is one way to prevent such problems.  In
 general, using a locking scheme will help ensure correct output but
 reduce performance.  If you don't want to tolerate the performance
 difference, simply arrange for each process to produce a separate output
 file (or otherwise use separate resources).
 
    ‘xargs’ also allows "turning up" or "turning down" its parallelism in
 the middle of a run.  Suppose you are keeping your four-processor system
 busy for hours, processing thousands of images using ‘-P 4’.  Now, in
 the middle of the run, you or someone else wants you to reduce your load
 on the system, so that something else will run faster.  If you interrupt
 ‘xargs’, your job will be half-done, and it may take significant manual
 work to resume it only for the remaining images.  If you suspend ‘xargs’
 using your shell's job controls (e.g.  ‘control-Z’), then it will get no
 work done while suspended.
 
    Find out the process ID of the ‘xargs’ process, either from your
 shell or with the ‘ps’ command.  After you send it the signal ‘SIGUSR2’,
 ‘xargs’ will run one fewer command in parallel.  If you send it the
 signal ‘SIGUSR1’, it will run one more command in parallel.  For
 example:
 
      shell$ xargs <allimages -l -P 4 makeallsizes &
      [4] 27643
         ... at some later point ...
      shell$ kill -USR2 27643
      shell$ kill -USR2 %4
 
    The first ‘kill’ command will cause ‘xargs’ to wait for two commands
 to terminate before starting the next command (reducing the parallelism
 from 4 to 3).  The second ‘kill’ will reduce it from 3 to 2.  (‘%4’
 works in some shells as a shorthand for the process ID of the background
 job labeled ‘[4]’.)
 
    Similarly, if you started a long ‘xargs’ job without parallelism, you
 can easily switch it to start running two commands in parallel by
 sending it a ‘SIGUSR1’.
 
    ‘xargs’ will never terminate any existing commands when you ask it to
 run fewer processes.  It merely waits for the excess commands to finish.
 If you ask it to run more commands, it will start the next one
 immediately (if it has more work to do).  If the degree of parallelism
 is already 1, sending ‘SIGUSR2’ will have no further effect (since
 ‘--max-procs=0’ means that there should be no limit on the number of
 processes to run).
 
    There is an implementation-defined limit on the number of processes.
 This limit is shown with ‘xargs --show-limits’.  The limit is at least
 127 on all systems (and on the author's system it is 2147483647).
 
    If you send several identical signals quickly, the operating system
 does not guarantee that each of them will be delivered to ‘xargs’.  This
 means that you can't rapidly increase or decrease the parallelism by
 more than one command at a time.  You can avoid this problem by sending
 a signal, observing the result, then sending the next one; or merely by
 delaying for a few seconds between signals (unless your system is very
 heavily loaded).
 
    Whether or not parallel execution will work well for you depends on
 the nature of the commmand you are running in parallel, on the
 configuration of the system on which you are running the command, and on
 the other work being done on the system at the time.
Info Catalog
find: Limiting Command Size
find: Multiple Files
find: Interspersing File Names