If you are dealing with data containing numerous duplicates, I highly recommend checking out rdfind. This utility replaces duplicates with symlinks or removes them entirely.

If you are looking for deduplication specifically for backups, I suggest addressing the root cause by using BorgBackup. BorgBackup natively handles deduplication, compression, and encryption.

My Workflow

  1. Consolidate potentially duplicated files into a single location. (Extract all archives first!).
  2. Run updatedb to ensure fast indexing with locate.
  3. Manually review and remove files that are definitely safe to delete (calculating checksums is time consuming).

Note

The following directory/file suggestions are subjective and serve only as a general guideline. Some may contain critical data depending on your environment. Ensure you know exactly what you are doing.

.cache
.cargo
go
usr
bin
sbin
boot
lib
lib64
tmp
nix
.next

/var/cache/distfiles
/var/cache/pacman
/var/db/repos
SecLists

node_modules
*initrd*
*linux*
*vmlinuz*

restore
*~
to_remove

*.iso
*.img
*.img.zst

cache2
Cache
.npm or .npm/_cacache
*/_build
.venv
venv

Deduplication with rdfind

Once you have cleared out the obvious clutter:

rdfind -checksum md5 -deterministic false -deleteduplicates true *```

Using md5 and disabling determinism carries a marginal risk of collisions, adjust these parameters based on the criticality of your data.

The command above deletes duplicates instead of replacing them with links to avoid unnecessary overhead, such as redundant inodes.