If you are dealing with data containing numerous duplicates, I highly recommend checking out rdfind. This utility replaces duplicates with symlinks or removes them entirely.
If you are looking for deduplication specifically for backups, I suggest addressing the root cause by using BorgBackup. BorgBackup natively handles deduplication, compression, and encryption.
My Workflow
- Consolidate potentially duplicated files into a single location. (Extract all archives first!).
- Run
updatedbto ensure fast indexing withlocate. - Manually review and remove files that are definitely safe to delete (calculating checksums is time consuming).
Note
The following directory/file suggestions are subjective and serve only as a general guideline. Some may contain critical data depending on your environment. Ensure you know exactly what you are doing.
.cache
.cargo
go
usr
bin
sbin
boot
lib
lib64
tmp
nix
.next
/var/cache/distfiles
/var/cache/pacman
/var/db/repos
SecLists
node_modules
*initrd*
*linux*
*vmlinuz*
restore
*~
to_remove
*.iso
*.img
*.img.zst
cache2
Cache
.npm or .npm/_cacache
*/_build
.venv
venv
Deduplication with rdfind
Once you have cleared out the obvious clutter:
rdfind -checksum md5 -deterministic false -deleteduplicates true *```
Using md5 and disabling determinism carries a marginal risk of collisions, adjust these parameters based on the criticality of your data.
The command above deletes duplicates instead of replacing them with links to avoid unnecessary overhead, such as redundant inodes.