Patrick Useldinger a écrit :
Brent Frère wrote:

Do a find. For each file, compute a md5sum. Do a sort of it. Detect the sets of files having matching md5sum. Do a binary compare of each couple of such files. If it matches, you found it !

I am going to write one, as I haven't found what I was looking for.

However, I haven't found a reason why I should use md5sum. It means that I have to read each file at least once entirely to compute the hash, and possibly twice if the hashes match.
Why not compare them directly (blockwise) if their length matches? And stop as soon as they differ?

Maybe a (non patentable) idea here: what about computing the md5sum on the first KB of each file ?

Roughly speaking:

# find . -type f -exec head --bytes=1k {} | md5sum | sort > md5sum.lst \;
# uniq md5sum.lst > md5sum.uniq
# for each couple in `diff md5sum.lst md5sum.uniq`; do
>    cmp $1 $2
>    done

Tuning the 1k value is a question of compromise between first pass file reading (the find command) and the risk to have unecessary cmp commands in the "for" loop. The chance of having files sharing the first 1kB is low in usual situations (but maybe with some M$ obscure formats ?) and should be fast (any read on a HD implies several blocks of 512B each).
It is irrelevant as long as the files have non related contents. Otherwise, you can consider re-running à find with larger 'head' size, if the amount of couples in the for is too large, or trying tail instead of head.

I leave you this part of the job. If you do so, why not intergrating it into RSYNC, so that the renaming of folders will no more kill the performances of the tool ?
Brent Frère

Private e-mail:

Postal address: 5, rue de Mamer
                L-8280 Kehlen
                Grand-Duchy of Luxembourg
                European Union

Mobile: +352-021/29.05.98
Fax:    +352-
Home:   +352-307.341

If you have problem with my digital signature, please install the appropriate authority certificate by browsing