Binsort - sort files by binary similarity Copyright (c) 2011 by Timm S. Müller Licensed under the 3-clause BSD license, see COPYRIGHT Scans the contents of a directory, groups the files by binary similarity, generates a filelist and prints the list to stdout. One possible application is to pass the list to an archiving tool, e.g.: $ binsort <dir> | tar -T- --no-recursion -czf out.tar.gz This can improve compression rates considerably, although sorting is in no way optimized for a particular compression algorithm. Usage: binsort [options] dir Options: -o Optimization level [1...1000], default: 15 -t Number of threads [1...128], default: 3 -q Quiet operation, no progress indicators -d Do not include directories in the output list -h --help This help Note: Results are not stable unless you specify -t 1. Memory consumption: Converges to the squared number of files, in bytes. Please consider this before sorting e.g. 100.000 files. Performance example: Binsorting the distribution of abiword 2.8.6 (44029103 bytes in 3391 files) on a quadcore CPU takes approx. 12s, and produces a tar.gz more than 14% smaller than without processing. Further notes: This is a research project combining threshold accepting, shingleprinting, and massive multithreading. It uses simhash by Bart Massey [1], and Tiny Mersenne Twister by Mutsuo Saito and Makoto Matsumoto [2]. See COPYRIGHT for the respective copyright holders' licensing terms. Binsort performs three time-consuming substeps: Hashing, distance calculation, and optimizing. All three have been parallelized and can be performed by any number of threads simultaneously. Hashing has linear time complexity, but may be time-consuming nevertheless. It is the only substep that may be I/O bound. Delta calculation and optimizing have quadratic time complexity. Hashing and distances calculation are completely parallelizable, and do not suffer from an increased overhead for synchronization. During optimization, threads compete for certain fields in the same data structure, but locking can be reduced to short code sections, so this step is (to some extent) parallalizable as well. Too many threads, however, can impair the performance. Results are not stable with more than one thread because binsort uses a heuristic optimization algorithm, which depends on large numbers of high-quality pseudo random numbers. Each thread operates on a number generator of its own, and if two threads race for an overlapping range of data points, one is locked out and will compute a new pair of random numbers, hence their unpredictability. Contact: Timm S. Müller <tmueller@schulze-mueller.de> Homepage: http://neoscientists.org/~tmueller/binsort/ Download: http://neoscientists.org/~tmueller/binsort/download/ Source code repository: http://hg.neoscientists.org/binsort/ References and further reading: [1] http://svcs.cs.pdx.edu/gitweb/simhash.git [2] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/TINYMT/index.html See also bibliography in simhash.c