[Lilux-help] Duplicate files
Brent Frère
brent at bfrere.net
Sun Feb 20 14:58:03 CET 2005
Patrick Useldinger a écrit :
> Brent Frère wrote:
>
>> Do a find. For each file, compute a md5sum. Do a sort of it. Detect
>> the sets of files having matching md5sum. Do a binary compare of each
>> couple of such files. If it matches, you found it !
>
>
> I am going to write one, as I haven't found what I was looking for.
>
> However, I haven't found a reason why I should use md5sum. It means
> that I have to read each file at least once entirely to compute the
> hash, and possibly twice if the hashes match.
> Why not compare them directly (blockwise) if their length matches? And
> stop as soon as they differ?
>
> -pu
> _______________________________________________
> Lilux-help mailing list
> Lilux-help at lilux.lu
> http://lilux.lu/mailman/listinfo/lilux-help
Great idea.
You have 100.000 files in an average filesystem. If you compare each
couple, this gives you
100.000 * (100.000-1) / 2 combinations, so about 5.000.000.000 file
comparaisons, or 10.000.000.000 file read. As example, the first file
only will have to be compared do 99.999 others...
Instead, the md5sum stuff reads all the files (indeed) but only once,
leading to 100.000 file read. About 100.000 times more performant that
your proposal.
The actual file comparison is only a confirmation that the files are
indeed exactly the same, and not only sharing the same md5sum by chance
(very unlikely), so will be done more than probably on files that HAVE
the same content. No waste of time then, because it is the first time
the two involved files will be actually compared. You don't wish to flag
as identical files the ones that are just sharing the same md5sum and
file length, I guess ? Doing so would lead to a M$t-like system:
something that works properly sometimes, and has strange behaviour in
some unpredictable, unidentified circumstances, and even sometimes a non
causal behaviour. Do your choice.
For very small "filesystems" (~10 files), your algorithm might be more
efficient, depending on the file content (your comparaison might indeed
not read the file up to the end if properly implemented), but computer
science learns us that an algorithm having a lower computational
complexity
<http://en.wikipedia.org/wiki/Computational_complexity_theory> is always
the good choice in long-term.
Yours,
--
Brent Frère
Private e-mail: Brent at BFrere.net
Postal address: 5, rue de Mamer
L-8280 Kehlen
Grand-Duchy of Luxembourg
European Union
Mobile: +352-021/29.05.98
Fax: +352-26.30.05.96
Home: +352-307.341
URL: http://BFrere.net
If you have problem with my digital signature, please install the appropriate authority certificate by browsing https://www.cacert.org/certs/root.crt.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lilux.lu/pipermail/lilux-help/attachments/20050220/c1223f71/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brent.vcf
Type: text/x-vcard
Size: 216 bytes
Desc: not available
URL: <http://lilux.lu/pipermail/lilux-help/attachments/20050220/c1223f71/attachment.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3383 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lilux.lu/pipermail/lilux-help/attachments/20050220/c1223f71/attachment.bin>
More information about the Lilux-help
mailing list