Back to Skills
    🦞

    anycrawl

    AnyCrawl API integration for OpenClaw - Scrape, Crawl, and Search web

    By @techlaai
    View on GitHub
    SKILL.md
    # AnyCrawl Skill
    
    AnyCrawl API integration for OpenClaw - Scrape, Crawl, and Search web content with high-performance multi-threaded crawling.
    
    ## Setup
    
    ### Method 1: Environment variable (Recommended)
    
    ```bash
    export ANYCRAWL_API_KEY="your-api-key"
    ```
    
    Make it permanent by adding to `~/.bashrc` or `~/.zshrc`:
    ```bash
    echo 'export ANYCRAWL_API_KEY="your-api-key"' >> ~/.bashrc
    source ~/.bashrc
    ```
    
    Get your API key at: https://anycrawl.dev
    
    ### Method 2: OpenClaw gateway config
    
    ```bash
    openclaw config.patch --set ANYCRAWL_API_KEY="your-api-key"
    ```
    
    ## Functions
    
    ### 1. anycrawl_scrape
    
    Scrape a single URL and convert to LLM-ready structured data.
    
    **Parameters:**
    - `url` (string, required): URL to scrape
    - `engine` (string, optional): Scraping engine - `"cheerio"` (default), `"playwright"`, `"puppeteer"`
    - `formats` (array, optional): Output formats - `["markdown"]`, `["html"]`, `["text"]`, `["json"]`, `["screenshot"]`
    - `timeout` (number, optional): Timeout in milliseconds (default: 30000)
    - `wait_for` (number, optional): Delay before extraction in ms (browser engines only)
    - `wait_for_selector` (string/object/array, optional): Wait for CSS selectors
    - `include_tags` (array, optional): Include only these HTML tags (e.g., `["h1", "p", "article"]`)
    - `exclude_tags` (array, optional): Exclude these HTML tags
    - `proxy` (string, optional): Proxy URL (e.g., `"http://proxy:port"`)
    - `json_options` (object, optional): JSON extraction with schema/prompt
    - `extract_source` (string, optional): `"markdown"` (default) or `"html"`
    
    **Examples:**
    
    ```javascript
    // Basic scrape with default cheerio
    anycrawl_scrape({ url: "https://example.com" })
    
    // Scrape SPA with Playwright
    anycrawl_scrape({ 
      url: "https://spa-example.com",
      engine: "playwright",
      formats: ["markdown", "screenshot"]
    })
    
    // Extract structured JSON
    anycrawl_scrape({
      url: "https://product-page.com",
      engine: "cheerio",
      json_options: {
        schema: {
          type: "object",
          properties: {
            product_name: { type: "string" },
            price: { type: "number" },
            description: { type: "string" }
          },
          required: ["product_name", "price"]
        },
        user_prompt: "Extract product details from this page"
      }
    })
    ```
    
    ### 2. anycrawl_search
    
    Search Google and return structured results.
    
    **Parameters:**
    - `query` (string, required): Search query
    - `engine` (string, optional): Search engine - `"google"` (default)
    - `limit` (number, optional): Max results per page (default: 10)
    - `offset` (number, optional): Number of results to skip (default: 0)
    - `pages` (number, optional): Number of pages to retrieve (default: 1, max: 20)
    - `lang` (string, optional): Language locale (e.g., `"en"`, `"zh"`, `"vi"`)
    - `safe_search` (number, optional): 0 (off), 1 (medium), 2 (high)
    - `scrape_options` (object, optional): Scrape each result URL with these options
    
    **Examples:**
    
    ```javascript
    // Basic search
    anycrawl_search({ query: "OpenAI ChatGPT" })
    
    // Multi-page search in Vietnamese
    anycrawl_search({ 
      query: "hướng dẫn Node.js",
      pages: 3,
      lang: "vi"
    })
    
    // Search and auto-scrape results
    anycrawl_search({
      query: "best AI tools 2026",
      limit: 5,
      scrape_options: {
        engine: "cheerio",
        formats: ["markdown"]
      }
    })
    ```
    
    ### 3. anycrawl_crawl_start
    
    Start crawling an entire website (async job).
    
    **Parameters:**
    - `url` (string, required): Seed URL to start crawling
    - `engine` (string, optional): `"cheerio"` (default), `"playwright"`, `"puppeteer"`
    - `strategy` (string, optional): `"all"`, `"same-domain"` (default), `"same-hostname"`, `"same-origin"`
    - `max_depth` (number, optional): Max depth from seed URL (default: 10)
    - `limit` (number, optional): Max pages to crawl (default: 100)
    - `include_paths` (array, optional): Path patterns to include (e.g., `["/blog/*"]`)
    - `exclude_paths` (array, optional): Path patterns to exclude (e.g., `["/admin/*"]`)
    - `scrape_paths` (array, optional): Only scrape URLs matching these patterns
    - `scrape_options` (object, optional): Per-page scrape options
    
    **Examples:**
    
    ```javascript
    // Crawl entire website
    anycrawl_crawl_start({ 
      url: "https://docs.example.com",
      engine: "cheerio",
      max_depth: 5,
      limit: 50
    })
    
    // Crawl only blog posts
    anycrawl_crawl_start({
      url: "https://example.com",
      strategy: "same-domain",
      include_paths: ["/blog/*"],
      exclude_paths: ["/blog/tags/*"],
      scrape_options: {
        formats: ["markdown"]
      }
    })
    
    // Crawl product pages only
    anycrawl_crawl_start({
      url: "https://shop.example.com",
      strategy: "same-domain",
      scrape_paths: ["/products/*"],
      limit: 200
    })
    ```
    
    ### 4. anycrawl_crawl_status
    
    Check crawl job status.
    
    **Parameters:**
    - `job_id` (string, required): Crawl job ID
    
    **Example:**
    ```javascript
    anycrawl_crawl_status({ job_id: "7a2e165d-8f81-4be6-9ef7-23222330a396" })
    ```
    
    ### 5. anycrawl_crawl_results
    
    Get crawl results (paginated).
    
    **Parameters:**
    - `job_id` (string, required): Crawl job ID
    - `skip` (number, optional): Number of results to skip (default: 0)
    
    **Example:**
    ```javascript
    // Get first 100 results
    anycrawl_crawl_results({ job_id: "xxx", skip: 0 })
    
    // Get next 100 results
    anycrawl_crawl_results({ job_id: "xxx", skip: 100 })
    ```
    
    ### 6. anycrawl_crawl_cancel
    
    Cancel a running crawl job.
    
    **Parameters:**
    - `job_id` (string, required): Crawl job ID
    
    ### 7. anycrawl_search_and_scrape
    
    Quick helper: Search Google then scrape top results.
    
    **Parameters:**
    - `query` (string, required): Search query
    - `max_results` (number, optional): Max results to scrape (default: 3)
    - `scrape_engine` (string, optional): Engine for scraping (default: `"cheerio"`)
    - `formats` (array, optional): Output formats (default: `["markdown"]`)
    - `lang` (string, optional): Search language
    
    **Example:**
    ```javascript
    anycrawl_search_and_scrape({
      query: "latest AI news",
      max_results: 5,
      formats: ["markdown"]
    })
    ```
    
    ## Engine Selection Guide
    
    | Engine | Best For | Speed | JS Rendering |
    |--------|----------|-------|--------------|
    | `cheerio` | Static HTML, news, blogs | ⚡ Fastest | ❌ No |
    | `playwright` | SPAs, complex web apps | 🐢 Slower | ✅ Yes |
    | `puppeteer` | Chrome-specific, metrics | 🐢 Slower | ✅ Yes |
    
    ## Response Format
    
    All responses follow this structure:
    
    ```json
    {
      "success": true,
      "data": { ... },
      "message": "Optional message"
    }
    ```
    
    Error response:
    ```json
    {
      "success": false,
      "error": "Error type",
      "message": "Human-readable message"
    }
    ```
    
    ## Common Error Codes
    
    - `400` - Bad Request (validation errors)
    - `401` - Unauthorized (invalid API key)
    - `402` - Payment Required (insufficient credits)
    - `404` - Not Found
    - `429` - Rate limit exceeded
    - `500` - Internal server error
    
    ## API Limits
    
    - Rate limits apply based on your plan
    - Crawl jobs expire after 24 hours
    - Max crawl limit: depends on credits
    
    ## Links
    
    - API Docs: https://docs.anycrawl.dev
    - Website: https://anycrawl.dev
    - Playground: https://anycrawl.dev/playground