Binsort - sort files by binary similarity

Copyright (c) 2011 by Timm S. Müller
Licensed under the 3-clause BSD license, see COPYRIGHT

Scans the contents of a directory, groups the files by binary
similarity, generates a filelist and prints the list to stdout. 

One possible application is to pass the list to an archiving tool, 

$ binsort <dir> | tar -T- --no-recursion -czf out.tar.gz

This can improve compression rates considerably, although sorting is
in no way optimized for a particular compression algorithm.

Usage: binsort [options] dir
  -o          Optimization level [1...1000], default: 15
  -t          Number of threads [1...128], default: 3
  -q          Quiet operation, no progress indicators
  -d          Do not include directories in the output list
  -h  --help  This help

Note: Results are not stable unless you specify -t 1.

Memory consumption: Converges to the squared number of files, in
bytes. Please consider this before sorting e.g. 100.000 files.

Performance example:

Binsorting the distribution of abiword 2.8.6 (44029103 bytes in 3391
files) on a quadcore CPU takes approx. 12s, and produces a tar.gz
more than 14% smaller than without processing.

Further notes:

This is a research project combining threshold accepting,
shingleprinting, and massive multithreading. It uses simhash by Bart
Massey [1], and Tiny Mersenne Twister by Mutsuo Saito and Makoto
Matsumoto [2]. See COPYRIGHT for the respective copyright holders'
licensing terms.

Binsort performs three time-consuming substeps: Hashing, distance
calculation, and optimizing. All three have been parallelized and can
be performed by any number of threads simultaneously. 

Hashing has linear time complexity, but may be time-consuming
nevertheless. It is the only substep that may be I/O bound. Delta
calculation and optimizing have quadratic time complexity. 

Hashing and distances calculation are completely parallelizable, and
do not suffer from an increased overhead for synchronization. During
optimization, threads compete for certain fields in the same data
structure, but locking can be reduced to short code sections, so this
step is (to some extent) parallalizable as well. Too many threads,
however, can impair the performance.

Results are not stable with more than one thread because binsort uses
a heuristic optimization algorithm, which depends on large numbers of
high-quality pseudo random numbers. Each thread operates on a number
generator of its own, and if two threads race for an overlapping
range of data points, one is locked out and will compute a new pair
of random numbers, hence their unpredictability.

Timm S. Müller <>



Source code repository:

References and further reading:

See also bibliography in simhash.c