Back to Skills
    🦞

    error-guard

    System safety and control-plane skill that prevents agent deadlocks

    By @amar1432
    View on GitHub
    SKILL.md
    ---
    name: error-guard
    description: >
      System safety and control-plane skill that prevents agent deadlocks and freezes.
      Provides non-LLM control commands to inspect task state, flush message queues,
      cancel long-running work, and recover safely without restarting the container.
      Use when implementing or operating long-running tasks, sub-agents, benchmarks,
      background monitors (e.g., Moltbook, PNR checks), or when the system becomes
      unresponsive and needs immediate recovery controls.
    ---
    
    # error-guard
    
    ⚠️ **System‑level skill (Advanced users)**
    
    This skill defines the **control‑plane safety primitives** for OpenClaw.
    It is intentionally minimal, non‑blocking, and designed to prevent agent freezes, deadlocks, and unrecoverable states when running long‑lived or high‑risk workloads.
    
    ## Design Principles
    
    > **Warning:** This skill operates at the agent control‑plane level.
    > It should be installed only by users who understand OpenClaw’s execution model and are running workloads that can block, hang, or run for extended periods.
    
    - **Main agent never blocks**: no long exec, no external I/O, no LLM calls.
    - **Event-driven**: workers emit events; the control plane listens.
    - **Fail-safe first**: recovery commands must always respond.
    - **Minimal state**: track only task metadata (never payloads).
    
    ## Command Surface (Phase 1)
    
    ### /status
    
    Report current system health and task registry state.
    
    Returns:
    - Active tasks (taskId, type, state)
    - Start time and last heartbeat
    - Flags for stalled or overdue tasks
    
    Constraints:
    - Must run in constant time
    - Must not call any model or external API
    
    ### /flush
    
    Emergency stop.
    
    Immediately:
    - Cancel all active tasks
    - Kill active exec/process sessions
    - Clear pending message queue
    - Reset in-memory task registry
    
    Constraints:
    - Must always respond
    - No waiting on workers
    - No model calls
    
    ### /recover
    
    Safe recovery sequence.
    
    Steps:
    1. Execute `/flush`
    2. Reset control-plane state
    3. Optionally reload skills/state (no container restart)
    
    ## Future Extensions (Not Implemented Yet)
    
    - Sub-agent runner helper (event-driven)
    - Task watchdogs with TTL and silence detection
    - Structured event protocol (task.started, task.heartbeat, task.completed, ...)
    - Back-pressure and task classes (interactive / batch / background)
    
    ## Security & Privacy
    
    - This skill **does not** store payloads, prompts, messages, or model outputs
    - Only minimal task metadata is persisted (taskId, timestamps, state)
    - No API keys, credentials, or user data are read or written
    - Safe to publish and share publicly
    
    ## Non-Goals
    
    - No business logic
    - No background polling loops
    - No user-facing features
    - No LLM reasoning paths
    
    This skill is the **last line of defense**. Keep it small, fast, and reliable.