Coding Challenge #4: File Duplicate Finder
Your Downloads folder has grown to 50GB and your laptop is running out of space. You suspect there are tons of duplicate files eating up storage - the same PDF downloaded 5 times, duplicate photos, identical software installers with different names.
Manual cleanup would take forever, and you can't trust filenames since document.pdf
and final_report_v2.pdf
might be identical. You need a tool that can find true duplicates by analyzing file content, not just names.
Your Mission
Build a duplicate file detector that uses content hashing to identify identical files, even when they have different names or are in different folders.
Requirements
Your tool must:
Scan directories recursively for all files
Calculate content hashes (MD5 or SHA256) to identify identical files
Group duplicate files together and show potential space savings
Handle large files efficiently (don't load entire file into memory)
Skip symbolic links and handle permission errors gracefully
Generate a summary report with duplicate groups and sizes
Optionally delete duplicates (with confirmation prompts)
Support exclusion patterns (ignore certain file types or folders)
Sample Directory Structure
test_folder/
├── documents/
│ ├── report.pdf (1.2MB)
│ ├── final_report.pdf (1.2MB) [duplicate of report.pdf]
│ └── meeting_notes.txt (5KB)
├── downloads/
│ ├── installer.exe (50MB)
│ ├── software_v2.exe (50MB) [duplicate of installer.exe]
│ ├── photo.jpg (2.3MB)
│ └── vacation.jpg (2.3MB) [duplicate of photo.jpg]
└── backup/
├── old_report.pdf (1.2MB) [duplicate of report.pdf]
└── unique_file.docx (800KB)
Expected Output
File Duplicate Analysis Report
==============================
📁 Scanned: test_folder/
📊 Total files: 8
📦 Total size: 109.5 MB
⏱️ Scan time: 1.2 seconds
🔍 Duplicate Groups Found:
==========================
Group 1: report.pdf (3 duplicates)
--------------------------------
Size: 1.2 MB each | Total waste: 2.4 MB
├── documents/report.pdf (original)
├── documents/final_report.pdf
└── backup/old_report.pdf
Group 2: installer.exe (2 duplicates)
------------------------------------
Size: 50.0 MB each | Total waste: 50.0 MB
├── downloads/installer.exe (original)
└── downloads/software_v2.exe
Group 3: photo.jpg (2 duplicates)
---------------------------------
Size: 2.3 MB each | Total waste: 2.3 MB
├── downloads/photo.jpg (original)
└── downloads/vacation.jpg
📈 Summary:
===========
Duplicate groups: 3
Duplicate files: 5
Wasted space: 54.7 MB (49.9% of total)
Potential savings: 54.7 MB
🗑️ Delete duplicates? [y/N]:
Python Solution
Keep reading with a 7-day free trial
Subscribe to Crack That Weekly to keep reading this post and get 7 days of free access to the full post archives.