[Lilux-help] Duplicate files

Brent Frère brent at bfrere.net
Sun Feb 20 14:58:03 CET 2005


Patrick Useldinger a écrit :

> Brent Frère wrote:
>
>> Do a find. For each file, compute a md5sum. Do a sort of it. Detect 
>> the sets of files having matching md5sum. Do a binary compare of each 
>> couple of such files. If it matches, you found it !
>
>
> I am going to write one, as I haven't found what I was looking for.
>
> However, I haven't found a reason why I should use md5sum. It means 
> that I have to read each file at least once entirely to compute the 
> hash, and possibly twice if the hashes match.
> Why not compare them directly (blockwise) if their length matches? And 
> stop as soon as they differ?
>
> -pu
> _______________________________________________
> Lilux-help mailing list
> Lilux-help at lilux.lu
> http://lilux.lu/mailman/listinfo/lilux-help

Great idea.

You have 100.000 files in an average filesystem. If you compare each 
couple, this gives you
100.000 * (100.000-1) / 2 combinations, so about 5.000.000.000 file 
comparaisons, or 10.000.000.000 file read. As example, the first file 
only will have to be compared do 99.999 others...
Instead, the md5sum stuff reads all the files (indeed) but only once, 
leading to 100.000 file read. About 100.000 times more performant that 
your proposal.
The actual file comparison is only a confirmation that the files are 
indeed exactly the same, and not only sharing the same md5sum by chance 
(very unlikely), so will be done more than probably on files that HAVE 
the same content. No waste of time then, because it is the first time 
the two involved files will be actually compared. You don't wish to flag 
as identical files the ones that are just sharing the same md5sum and 
file length, I guess ? Doing so would lead to a M$t-like system: 
something that works properly sometimes, and has strange behaviour in 
some unpredictable, unidentified circumstances, and even sometimes a non 
causal behaviour. Do your choice.

For very small "filesystems" (~10 files), your algorithm might be more 
efficient, depending on the file content (your comparaison might indeed 
not read the file up to the end if properly implemented), but computer 
science learns us that an algorithm having a lower computational 
complexity 
<http://en.wikipedia.org/wiki/Computational_complexity_theory> is always 
the good choice in long-term.

Yours,

-- 
Brent Frère

Private e-mail:  Brent at BFrere.net

Postal address: 5, rue de Mamer
                L-8280 Kehlen
                Grand-Duchy of Luxembourg
                European Union

Mobile: +352-021/29.05.98
Fax:    +352-26.30.05.96
Home:   +352-307.341
URL:    http://BFrere.net

If you have problem with my digital signature, please install the appropriate authority certificate by browsing https://www.cacert.org/certs/root.crt.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lilux.lu/pipermail/lilux-help/attachments/20050220/c1223f71/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brent.vcf
Type: text/x-vcard
Size: 216 bytes
Desc: not available
URL: <http://lilux.lu/pipermail/lilux-help/attachments/20050220/c1223f71/attachment.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3383 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lilux.lu/pipermail/lilux-help/attachments/20050220/c1223f71/attachment.bin>


More information about the Lilux-help mailing list