Crack That Weekly

Crack That Weekly

Share this post

Crack That Weekly
Crack That Weekly
Coding Challenge #4: File Duplicate Finder

Coding Challenge #4: File Duplicate Finder

Sharon Sahadevan's avatar
Sharon Sahadevan
Jul 25, 2025
∙ Paid

Share this post

Crack That Weekly
Crack That Weekly
Coding Challenge #4: File Duplicate Finder
Share

Your Downloads folder has grown to 50GB and your laptop is running out of space. You suspect there are tons of duplicate files eating up storage - the same PDF downloaded 5 times, duplicate photos, identical software installers with different names.

Manual cleanup would take forever, and you can't trust filenames since document.pdf and final_report_v2.pdf might be identical. You need a tool that can find true duplicates by analyzing file content, not just names.

Your Mission

Build a duplicate file detector that uses content hashing to identify identical files, even when they have different names or are in different folders.

Requirements

Your tool must:

  • Scan directories recursively for all files

  • Calculate content hashes (MD5 or SHA256) to identify identical files

  • Group duplicate files together and show potential space savings

  • Handle large files efficiently (don't load entire file into memory)

  • Skip symbolic links and handle permission errors gracefully

  • Generate a summary report with duplicate groups and sizes

  • Optionally delete duplicates (with confirmation prompts)

  • Support exclusion patterns (ignore certain file types or folders)

Sample Directory Structure

test_folder/
├── documents/
│   ├── report.pdf (1.2MB)
│   ├── final_report.pdf (1.2MB) [duplicate of report.pdf]
│   └── meeting_notes.txt (5KB)
├── downloads/
│   ├── installer.exe (50MB)
│   ├── software_v2.exe (50MB) [duplicate of installer.exe]
│   ├── photo.jpg (2.3MB)
│   └── vacation.jpg (2.3MB) [duplicate of photo.jpg]
└── backup/
    ├── old_report.pdf (1.2MB) [duplicate of report.pdf]
    └── unique_file.docx (800KB)

Expected Output

File Duplicate Analysis Report
==============================
📁 Scanned: test_folder/
📊 Total files: 8
📦 Total size: 109.5 MB
⏱️  Scan time: 1.2 seconds

🔍 Duplicate Groups Found:
==========================

Group 1: report.pdf (3 duplicates)
--------------------------------
Size: 1.2 MB each | Total waste: 2.4 MB
├── documents/report.pdf (original)
├── documents/final_report.pdf
└── backup/old_report.pdf

Group 2: installer.exe (2 duplicates)  
------------------------------------
Size: 50.0 MB each | Total waste: 50.0 MB
├── downloads/installer.exe (original)
└── downloads/software_v2.exe

Group 3: photo.jpg (2 duplicates)
---------------------------------
Size: 2.3 MB each | Total waste: 2.3 MB
├── downloads/photo.jpg (original)
└── downloads/vacation.jpg

📈 Summary:
===========
Duplicate groups: 3
Duplicate files: 5
Wasted space: 54.7 MB (49.9% of total)
Potential savings: 54.7 MB

🗑️  Delete duplicates? [y/N]: 

Python Solution

Keep reading with a 7-day free trial

Subscribe to Crack That Weekly to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Sharon Sahadevan
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share