🦞
file-deduplicator

Find and remove duplicate files intelligently.
SKILL.md
---
name: file-deduplicator
description: Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management.
metadata:
  {
    "openclaw":
      {
        "version": "1.0.0",
        "author": "Vernox",
        "license": "MIT",
        "tags": ["deduplication", "storage", "cleanup", "file-management", "duplicate", "disk-space"],
        "category": "tools"
      }
  }
---

# File-Deduplicator - Find and Remove Duplicates

**Vernox Utility Skill - Clean up your digital hoard.**

## Overview

File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.

## Features

### ✅ Duplicate Detection
- Content-based hashing (MD5) for fast comparison
- Size-based detection (exact match, near match)
- Name-based detection (similar filenames)
- Directory scanning (recursive)
- Exclude patterns (.git, node_modules, etc.)

### ✅ Removal Options
- Auto-delete duplicates (keep newest/oldest)
- Interactive review before deletion
- Move to archive instead of delete
- Preserve permissions and metadata
- Dry-run mode (preview changes)

### ✅ Analysis Tools
- Duplicate count summary
- Space savings estimation
- Largest duplicate files
- Most common duplicate patterns
- Detailed report generation

### ✅ Safety Features
- Confirmation prompts before deletion
- Backup to archive folder
- Size threshold (don't remove huge files by mistake)
- Whitelist important directories
- Undo functionality (log for recovery)

## Installation

```bash
clawhub install file-deduplicator
```

## Quick Start

### Find Duplicates in Directory

```javascript
const result = await findDuplicates({
  directories: ['./documents', './downloads', './projects'],
  options: {
    method: 'content',  // content-based comparison
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate groups`);
console.log(`Potential space savings: ${result.spaceSaved}`);
```

### Remove Duplicates Automatically

```javascript
const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',  // keep newest, delete oldest
    action: 'delete',  // or 'move' to archive
    autoConfirm: false  // show confirmation for each
  }
});

console.log(`Removed ${result.filesRemoved} duplicates`);
console.log(`Space saved: ${result.spaceSaved}`);
```

### Dry-Run Preview

```javascript
const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',
    action: 'delete',
    dryRun: true  // Preview without actual deletion
  }
});

console.log('Would remove:');
result.duplicates.forEach((dup, i) => {
  console.log(`${i+1}. ${dup.file}`);
});
```

## Tool Functions

### `findDuplicates`
Find duplicate files across directories.

**Parameters:**
- `directories` (array|string, required): Directory paths to scan
- `options` (object, optional):
  - `method` (string): 'content' | 'size' | 'name' - comparison method
  - `includeSubdirs` (boolean): Scan recursively (default: true)
  - `minSize` (number): Minimum size in bytes (default: 0)
  - `maxSize` (number): Maximum size in bytes (default: 0)
  - `excludePatterns` (array): Glob patterns to exclude (default: ['.git', 'node_modules'])
  - `whitelist` (array): Directories to never scan (default: [])

**Returns:**
- `duplicates` (array): Array of duplicate groups
  - `duplicateCount` (number): Number of duplicate groups found
  - `totalFiles` (number): Total files scanned
  - `scanDuration` (number): Time taken to scan (ms)
  - `spaceWasted` (number): Total bytes wasted by duplicates
  - `spaceSaved` (number): Potential savings if duplicates removed

### `removeDuplicates`
Remove duplicate files based on findings.

**Parameters:**
- `directories` (array|string, required): Same as findDuplicates
- `options` (object, optional):
  - `keep` (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep
  - `action` (string): 'delete' | 'move' | 'archive'
  - `archivePath` (string): Where to move files when action='move'
  - `dryRun` (boolean): Preview without actual action
  - `autoConfirm` (boolean): Auto-confirm deletions
  - `sizeThreshold` (number): Don't remove files larger than this

**Returns:**
- `filesRemoved` (number): Number of files removed/moved
- `spaceSaved` (number): Bytes saved
- `groupsProcessed` (number): Number of duplicate groups handled
- `logPath` (string): Path to action log
- `errors` (array): Any errors encountered

### `analyzeDirectory`
Analyze a single directory for duplicates.

**Parameters:**
- `directory` (string, required): Path to directory
- `options` (object, optional): Same as findDuplicates options

**Returns:**
- `fileCount` (number): Total files in directory
- `totalSize` (number): Total bytes in directory
- `duplicateSize` (number): Bytes in duplicate files
- `duplicateRatio` (number): Percentage of files that are duplicates

## Use Cases

### Digital Hoarder Cleanup
- Find duplicate photos/videos
- Identify wasted storage space
- Remove old duplicates, keep newest
- Clean up download folders

### Document Management
- Find duplicate PDFs, docs, reports
- Keep latest version, archive old versions
- Prevent version confusion
- Reduce backup bloat

### Project Cleanup
- Find duplicate source files
- Remove duplicate build artifacts
- Clean up node_modules duplicates
- Save storage on SSD/HDD

### Backup Optimization
- Find duplicate backup files
- Remove redundant backups
- Identify what's actually duplicated
- Save space on backup drives

## Configuration

### Edit `config.json`:
```json
{
  "detection": {
    "defaultMethod": "content",
    "sizeTolerancePercent": 0,  // exact match only
    "nameSimilarity": 0.7,  // 0-1, lower = more similar
    "includeSubdirs": true
  },
  "removal": {
    "defaultAction": "delete",
    "defaultKeep": "newest",
    "archivePath": "./archive",
    "sizeThreshold": 10485760,  // 10MB threshold
    "autoConfirm": false,
    "dryRunDefault": false
  },
  "exclude": {
    "patterns": [".git", "node_modules", ".vscode", ".idea"],
    "whitelist": ["important", "work", "projects"]
  }
}
```

## Methods

### Content-Based (Recommended)
- Fast MD5 hashing
- Detects exact duplicates regardless of filename
- Works across renamed files
- Perfect for documents, code, archives

### Size-Based
- Compares file sizes
- Faster than content hashing
- Good for media files where content hashing is slow
- Finds near-duplicates (similar but not exact)

### Name-Based
- Compares filenames
- Detects similar named files
- Good for finding version duplicates (file_v1, file_v2)

## Examples

### Find Duplicates in Documents
```javascript
const result = await findDuplicates({
  directories: '~/Documents',
  options: {
    method: 'content',
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate sets`);
result.duplicates.slice(0, 5).forEach((set, i) => {
  console.log(`Set ${i+1}: ${set.files.length} files`);
  console.log(`  Total size: ${set.totalSize} bytes`);
});
```

### Remove Duplicates, Keep Newest
```javascript
const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    keep: 'newest',
    action: 'delete'
  }
});

console.log(`Removed ${result.filesRemoved} files`);
console.log(`Saved ${result.spaceSaved} bytes`);
```

### Move to Archive Instead of Delete
```javascript
const result = await removeDuplicates({
  directories: '~/Downloads',
  options: {
    keep: 'newest',
    action: 'move',
    archivePath: '~/Documents/Archive'
  }
});

console.log(`Archived ${result.filesRemoved} files`);
console.log(`Safe in: ~/Documents/Archive`);
```

### Dry-Run Preview Changes
```javascript
const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    dryRun: true  // Just show what would happen
  }
});

console.log('=== Dry Run Preview ===');
result.duplicates.forEach((set, i) => {
  console.log(`Would delete: ${set.toDelete.join(', ')}`);
});
```

## Performance

### Scanning Speed
- **Small directories** (<1000 files): <1s
- **Medium directories** (1000-10000 files): 1-5s
- **Large directories** (10000+ files): 5-20s

### Detection Accuracy
- **Content-based:** 100% (exact duplicates)
- **Size-based:** Fast but may miss renamed files
- **Name-based:** Detects naming patterns only

### Memory Usage
- **Hash cache:** ~1MB per 100,000 files
- **Batch processing:** Processes 1000 files at a time
- **Peak memory:** ~200MB for 1M files

## Safety Features

### Size Thresholding
Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

### Archive Mode
Move files to archive directory instead of deleting. No data loss, full recoverability.

### Action Logging
All deletions/moves are logged to file for recovery and audit.

### Undo Functionality
Log file can be used to restore accidentally deleted files (limited undo window).

## Error Handling

### Permission Errors
- Clear error message
- Suggest running with sudo
- Skip files that can't be accessed

### File Lock Errors
- Detect locked files
- Skip and report
- Suggest closing applications using files

### Space Errors
- Check available disk space before deletion
- Warn if space is critically low
- Prevent disk-full scenarios

## Troubleshooting

### Not Finding Expected Duplicates
- Check detection method (content vs size vs name)
- Verify exclude patterns aren't too broad
- Check if files are in whitelisted directories
- Try with includeSubdirs: false

### Deletion Not Working
- Check write permissions on directories
- Verify action isn't 'delete' with autoConfirm: true
- Check size threshold isn't blocking all deletions
- Check file locks (is another program using files?)

### Slow Scanning
- Reduce includeSubdi

... (truncated)