🦞
skill-reviewer

Review and audit agent skills (SKILL.md files)
SKILL.md
---
name: skill-reviewer
description: Review and audit agent skills (SKILL.md files) for quality, correctness, and effectiveness. Use when evaluating a skill before publishing, reviewing someone else's skill, scoring skill quality, identifying defects in skill content, or improving an existing skill.
metadata: {"clawdbot":{"emoji":"🔍","requires":{"anyBins":["npx"]},"os":["linux","darwin","win32"]}}
---

# Skill Reviewer

Audit agent skills (SKILL.md files) for quality, correctness, and completeness. Provides a structured review framework with scoring rubric, defect checklists, and improvement recommendations.

## When to Use

- Reviewing a skill before publishing to the registry
- Evaluating a skill you downloaded from the registry
- Auditing your own skills for quality improvements
- Comparing skills in the same category
- Deciding whether a skill is worth installing

## Review Process

### Step 1: Structural Check

Verify the skill has the required structure. Read the file and check each item:

```
STRUCTURAL CHECKLIST:
[ ] Valid YAML frontmatter (opens and closes with ---)
[ ] `name` field present and is a valid slug (lowercase, hyphenated)
[ ] `description` field present and non-empty
[ ] `metadata` field present with valid JSON
[ ] `metadata.clawdbot.emoji` is a single emoji
[ ] `metadata.clawdbot.requires.anyBins` lists real CLI tools
[ ] Title heading (# Title) immediately after frontmatter
[ ] Summary paragraph after title
[ ] "When to Use" section present
[ ] At least 3 main content sections
[ ] "Tips" section present at the end
```

### Step 2: Frontmatter Quality

#### Description field audit

The description is the most impactful field. Evaluate it against these criteria:

```
DESCRIPTION SCORING:

[2] Starts with what the skill does (active verb)
    GOOD: "Write Makefiles for any project type."
    BAD:  "This skill covers Makefiles."
    BAD:  "A comprehensive guide to Make."

[2] Includes trigger phrases ("Use when...")
    GOOD: "Use when setting up build automation, defining multi-target builds"
    BAD:  No trigger phrases at all

[2] Specific scope (mentions concrete tools, languages, or operations)
    GOOD: "SQLite/PostgreSQL/MySQL — schema design, queries, CTEs, window functions"
    BAD:  "Database stuff"

[1] Reasonable length (50-200 characters)
    TOO SHORT: "Make things" (no search surface)
    TOO LONG:  300+ characters (gets truncated)

[1] Contains searchable keywords naturally
    GOOD: "cron jobs, systemd timers, scheduling"
    BAD:  Keywords stuffed unnaturally

Score: __/8
```

#### Metadata audit

```
METADATA SCORING:

[1] emoji is relevant to the skill topic
[1] requires.anyBins lists tools the skill actually uses (not generic tools like "bash")
[1] os array is accurate (don't claim win32 if commands are Linux-only)
[1] JSON is valid (test with a JSON parser)

Score: __/4
```

### Step 3: Content Quality

#### Example density

Count code blocks and total lines:

```
EXAMPLE DENSITY:

Lines:       ___
Code blocks: ___
Ratio:       1 code block per ___ lines

TARGET: 1 code block per 8-15 lines
< 8  lines per block: possibly over-fragmented
> 20 lines per block: needs more examples
```

#### Example quality

For each code block, check:

```
EXAMPLE QUALITY CHECKLIST:

[ ] Language tag specified (```bash, ```python, etc.)
[ ] Command is syntactically correct
[ ] Output shown in comments where helpful
[ ] Uses realistic values (not foo/bar/baz)
[ ] No placeholder values left (TODO, FIXME, xxx)
[ ] Self-contained (doesn't depend on undefined variables)
    OR setup is shown/referenced
[ ] Covers the common case (not just edge cases)
```

Score each example 0-3:
- **0**: Broken or misleading
- **1**: Works but minimal (no output, no context)
- **2**: Good (correct, has output or explanation)
- **3**: Excellent (copy-pasteable, realistic, covers edge case)

#### Section organization

```
ORGANIZATION SCORING:

[2] Organized by task/scenario (not by abstract concept)
    GOOD: "## Encode and Decode" → "## Inspect Characters" → "## Convert Formats"
    BAD:  "## Theory" → "## Types" → "## Advanced"

[2] Most common operations come first
    GOOD: Basic usage → Variations → Advanced → Edge cases
    BAD:  Configuration → Theory → Finally the basic usage

[1] Sections are self-contained (can be used independently)

[1] Consistent depth (not mixing h2 with h4 randomly)

Score: __/6
```

#### Cross-platform accuracy

```
PLATFORM CHECKLIST:

[ ] macOS differences noted where relevant
    (sed -i '' vs sed -i, brew vs apt, BSD vs GNU flags)
[ ] Linux distro variations noted (apt vs yum vs pacman)
[ ] Windows compatibility addressed if os includes "win32"
[ ] Tool version assumptions stated (Docker v2 syntax, Python 3.x)
```

### Step 4: Actionability Assessment

The core question: can an agent follow these instructions to produce correct results?

```
ACTIONABILITY SCORING:

[3] Instructions are imperative ("Run X", "Create Y")
    NOT: "You might consider..." or "It's recommended to..."

[3] Steps are ordered logically (prerequisites before actions)

[2] Error cases addressed (what to do when something fails)

[2] Output/result described (how to verify it worked)

Score: __/10
```

### Step 5: Tips Section Quality

```
TIPS SCORING:

[2] 5-10 tips present

[2] Tips are non-obvious (not "read the documentation")
    GOOD: "The number one Makefile bug: spaces instead of tabs"
    BAD:  "Make sure to test your code"

[2] Tips are specific and actionable
    GOOD: "Use flock to prevent overlapping cron runs"
    BAD:  "Be careful with concurrent execution"

[1] No tips contradict the main content

[1] Tips cover gotchas/footguns specific to this topic

Score: __/8
```

## Scoring Summary

```
SKILL REVIEW SCORECARD
═══════════════════════════════════════
Skill: [name]
Reviewer: [agent/human]
Date: [date]

Category              Score    Max
─────────────────────────────────────
Structure             __       11
Description           __        8
Metadata              __        4
Example density       __        3*
Example quality       __        3*
Organization          __        6
Actionability         __       10
Tips                  __        8
─────────────────────────────────────
TOTAL                 __       53+

* Example density and quality are per-sample,
  not summed. Use the average across all examples.

RATING:
  45+  Excellent — publish-ready
  35-44 Good — minor improvements needed
  25-34 Fair — significant gaps to address
  < 25  Poor — needs major rework

VERDICT: [PUBLISH / REVISE / REWORK]
```

## Common Defects

### Critical (blocks publishing)

```
DEFECT: Invalid frontmatter
DETECT: YAML parse error, missing required fields
FIX:    Validate YAML, ensure name/description/metadata all present

DEFECT: Broken code examples
DETECT: Syntax errors, undefined variables, wrong flags
FIX:    Test every command in a clean environment

DEFECT: Wrong tool requirements
DETECT: metadata.requires lists tools not used in content, or omits tools that are used
FIX:    Grep content for command names, update requires to match

DEFECT: Misleading description
DETECT: Description promises coverage the content doesn't deliver
FIX:    Align description with actual content, or add missing content
```

### Major (should fix before publishing)

```
DEFECT: No "When to Use" section
IMPACT: Agent doesn't know when to activate the skill
FIX:    Add 4-8 bullet points describing trigger scenarios

DEFECT: Text walls without examples
DETECT: Any section > 10 lines with no code block
FIX:    Add concrete examples for every concept described

DEFECT: Examples missing language tags
DETECT: ``` without language identifier
FIX:    Add bash, python, javascript, yaml, etc. to every code fence

DEFECT: No Tips section
IMPACT: Missing the distilled expertise that makes a skill valuable
FIX:    Add 5-10 non-obvious, actionable tips

DEFECT: Abstract organization
DETECT: Sections named "Theory", "Overview", "Background", "Introduction"
FIX:    Reorganize by task/operation: what the user is trying to DO
```

### Minor (nice to fix)

```
DEFECT: Placeholder values
DETECT: foo, bar, baz, example.com, 1.2.3.4, TODO, FIXME
FIX:    Replace with realistic values (myapp, api.example.com, 192.168.1.100)

DEFECT: Inconsistent formatting
DETECT: Mixed heading levels, inconsistent code block style
FIX:    Standardize heading hierarchy and formatting

DEFECT: Missing cross-references
DETECT: Mentions tools/concepts covered by other skills without referencing them
FIX:    Add "See the X skill for more on Y" notes

DEFECT: Outdated commands
DETECT: docker-compose (v1), python (not python3), npm -g without npx alternative
FIX:    Update to current tool versions and syntax
```

## Comparative Review

When comparing skills in the same category:

```
COMPARATIVE CRITERIA:

1. Coverage breadth
   Which skill covers more use cases?

2. Example quality
   Which has more runnable, realistic examples?

3. Depth on common operations
   Which handles the 80% case better?

4. Edge case coverage
   Which addresses more gotchas and failure modes?

5. Cross-platform support
   Which works across more environments?

6. Freshness
   Which uses current tool versions and syntax?

WINNER: [skill A / skill B / tie]
REASON: [1-2 sentence justification]
```

## Quick Review Template

For a fast review when you don't need full scoring:

```markdown
## Quick Review: [skill-name]

**Structure**: [OK / Issues: ...]
**Description**: [Strong / Weak: reason]
**Examples**: [X code blocks across Y lines — density OK/low/high]
**Actionability**: [Agent can/cannot follow these instructions because...]
**Top defect**: [The single most impactful thing to fix]
**Verdict**: [PUBLISH / REVISE / REWORK]
```

## Review Workflow

### Reviewing your own skill before publishing

```bash
# 1. Validate frontmatter
head -20 skills/my-skill/SKILL.md
# Visually confirm YAML is valid

# 2. Count code blocks
grep -c '```' skills/my-skill/SKILL.md
# Divide total lines by thi

... (truncated)