find: Old Database Format

 
 4.2.4 Old Database Format
 -------------------------
 
 The old database format is used by Unix 'locate' and 'find' programs and
 pre-4.0 releases of GNU findutils.  'locate' understands this format,
 though 'updatedb' will no longer produce it.
 
    The old format differs from 'LOCATE02' in the following ways.
 Instead of each entry starting with an offset-differential count byte
 and ending with a null, byte values from 0 through 28 indicate
 offset-differential counts from -14 through 14.  The byte value
 indicating that a long offset-differential count follows is 0x1e (30),
 not 0x80.  The long counts are stored in host byte order, which is not
 necessarily network byte order, and host integer word size, which is
 usually 4 bytes.  They also represent a count 14 less than their value.
 The database lines have no termination byte; the start of the next line
 is indicated by its first byte having a value <= 30.
 
    In addition, instead of starting with a dummy entry, the old database
 format starts with a 256 byte table containing the 128 most common
 bigrams in the file list.  A bigram is a pair of adjacent bytes.  Bytes
 in the database that have the high bit set are indexes (with the high
 bit cleared) into the bigram table.  The bigram and offset-differential
 count coding makes these databases 20-25% smaller than the new format,
 but makes them not 8-bit clean.  Any byte in a file name that is in the
 ranges used for the special codes is replaced in the database by a
 question mark, which not coincidentally is the shell wildcard to match a
 single character.  The old format therefore cannot faithfully store
 entries with non-ASCII characters.
 
    Because the long counts are stored as native-order machine words, the
 database format is not easily used in environments which differ in terms
 of byte order.  If locate databases are to be shared between machines,
 the 'LOCATE02' database format should be used.  This has other benefits
 as discussed above.  However, the length of the filename currently being
 processed can normally be used to place reasonable limits on the long
 counts and so this information is used by locate to help it guess the
 byte ordering of the old format database.  Unless it finds evidence to
 the contrary, 'locate' will assume that the byte order of the database
 is the same as the native byte order of the machine running 'locate'.
 The output of 'locate --statistics' also includes information about the
 byte order of old-format databases.
 
    The output of 'locate --statistics' will give an incorrect count of
 the number of file names containing newlines or high-bit characters for
 old-format databases.
 
    Old versions of GNU 'locate' fail to correctly handle very long file
 names, possibly leading to security problems relating to a heap buffer
 overrun.  ⇒Security Considerations for locate, for a detailed
 explanation.