Back to Skills
    🦞

    tabstack-extractor

    Extract structured data from websites using Tabstack

    By @noblepayne
    View on GitHub
    SKILL.md
    ---
    name: tabstack-extractor
    description: Extract structured data from websites using Tabstack API. Use when you need to scrape job listings, news articles, product pages, or any structured web content. Provides JSON schema-based extraction and clean markdown conversion. Requires TABSTACK_API_KEY environment variable.
    ---
    
    # Tabstack Extractor
    
    ## Overview
    
    This skill enables structured data extraction from websites using the Tabstack API. It's ideal for web scraping tasks where you need consistent, schema-based data extraction from job boards, news sites, product pages, or any structured content.
    
    ## Quick Start
    
    ### 1. Install Babashka (if needed)
    ```bash
    # Option A: From GitHub (recommended for sharing)
    curl -s https://raw.githubusercontent.com/babashka/babashka/master/install | bash
    
    # Option B: From Nix
    nix-shell -p babashka
    
    # Option C: From Homebrew
    brew install borkdude/brew/babashka
    ```
    
    ### 2. Set up API Key
    
    **Option A: Environment variable (recommended)**
    ```bash
    export TABSTACK_API_KEY="your_api_key_here"
    ```
    
    **Option B: Configuration file**
    ```bash
    mkdir -p ~/.config/tabstack
    echo '{:api-key "your_api_key_here"}' > ~/.config/tabstack/config.edn
    ```
    
    **Get an API key:** Sign up at [Tabstack Console](https://console.tabstack.ai/signup)
    
    ### 3. Test Connection
    ```bash
    bb scripts/tabstack.clj test
    ```
    
    ### 4. Extract Markdown (Simple)
    ```bash
    bb scripts/tabstack.clj markdown "https://example.com"
    ```
    
    ### 5. Extract JSON (Start Simple)
    ```bash
    # Start with simple schema (fast, reliable)
    bb scripts/tabstack.clj json "https://example.com" references/simple_article.json
    
    # Try more complex schemas (may be slower)
    bb scripts/tabstack.clj json "https://news.site" references/news_schema.json
    ```
    
    ### 6. Advanced Features
    ```bash
    # Extract with retry logic (3 retries, 1s delay)
    bb scripts/tabstack.clj json-retry "https://example.com" references/simple_article.json
    
    # Extract with caching (24-hour cache)
    bb scripts/tabstack.clj json-cache "https://example.com" references/simple_article.json
    
    # Batch extract from URLs file
    echo "https://example.com" > urls.txt
    echo "https://example.org" >> urls.txt
    bb scripts/tabstack.clj batch urls.txt references/simple_article.json
    ```
    
    ## Core Capabilities
    
    ### 1. Markdown Extraction
    Extract clean, readable markdown from any webpage. Useful for content analysis, summarization, or archiving.
    
    **When to use:** When you need the textual content of a page without the HTML clutter.
    
    **Example use cases:**
    - Extract article content for summarization
    - Archive webpage content
    - Analyze blog post content
    
    ### 2. JSON Schema Extraction
    Extract structured data using JSON schemas. Define exactly what data you want and get it in a consistent format.
    
    **When to use:** When scraping job listings, product pages, news articles, or any structured data.
    
    **Example use cases:**
    - Scrape job listings from BuiltIn/LinkedIn
    - Extract product details from e-commerce sites
    - Gather news articles with consistent metadata
    
    ### 3. Schema Templates
    Pre-built schemas for common scraping tasks. See `references/` directory for templates.
    
    **Available schemas:**
    - Job listing schema (see `references/job_schema.json`)
    - News article schema
    - Product page schema
    - Contact information schema
    
    ## Workflow: Job Scraping Example
    
    Follow this workflow to scrape job listings:
    
    1. **Identify target sites** - BuiltIn, LinkedIn, company career pages
    2. **Choose or create schema** - Use `references/job_schema.json` or customize
    3. **Test extraction** - Run a single page to verify schema works
    4. **Scale up** - Process multiple URLs
    5. **Store results** - Save to database or file
    
    **Example job schema:**
    ```json
    {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "company": {"type": "string"},
        "location": {"type": "string"},
        "description": {"type": "string"},
        "salary": {"type": "string"},
        "apply_url": {"type": "string"},
        "posted_date": {"type": "string"},
        "requirements": {"type": "array", "items": {"type": "string"}}
      }
    }
    ```
    
    ## Integration with Other Skills
    
    ### Combine with Web Search
    1. Use `web_search` to find relevant URLs
    2. Use Tabstack to extract structured data from those URLs
    3. Store results in Datalevin (future skill)
    
    ### Combine with Browser Automation
    1. Use `browser` tool to navigate complex sites
    2. Extract page URLs
    3. Use Tabstack for structured extraction
    
    ## Error Handling
    
    Common issues and solutions:
    
    1. **Authentication failed** - Check `TABSTACK_API_KEY` environment variable
    2. **Invalid URL** - Ensure URL is accessible and correct
    3. **Schema mismatch** - Adjust schema to match page structure
    4. **Rate limiting** - Add delays between requests
    
    ## Resources
    
    ### scripts/
    - `tabstack.clj` - **Main API wrapper in Babashka** (recommended, has retry logic, caching, batch processing)
    - `tabstack_curl.sh` - Bash/curl fallback (simple, no dependencies)
    - `tabstack_api.py` - Python API wrapper (requires requests module)
    
    ### references/
    - `job_schema.json` - Template schema for job listings
    - `api_reference.md` - Tabstack API documentation
    
    ## Best Practices
    
    1. **Start small** - Test with single pages before scaling
    2. **Respect robots.txt** - Check site scraping policies
    3. **Add delays** - Avoid overwhelming target sites
    4. **Validate schemas** - Test schemas on sample pages
    5. **Handle errors gracefully** - Implement retry logic for failed requests
    
    ## Teaching Focus: How to Create Schemas
    
    This skill is designed to teach agents how to use Tabstack API effectively. The key is learning to create appropriate JSON schemas for different websites.
    
    ### Learning Path
    1. **Start Simple** - Use `references/simple_article.json` (4 basic fields)
    2. **Test Extensively** - Try schemas on multiple page types
    3. **Iterate** - Add fields based on what the page actually contains
    4. **Optimize** - Remove unnecessary fields for speed
    
    See [Schema Creation Guide](references/schema_guide.md) for detailed instructions and examples.
    
    ### Common Mistakes to Avoid
    - **Over-complex schemas** - Start with 2-3 fields, not 20
    - **Missing fields** - Don't require fields that don't exist on the page
    - **No testing** - Always test with example.com first, then target sites
    - **Ignoring timeouts** - Complex schemas take longer (45s timeout)
    
    ## Babashka Advantages
    
    Using Babashka for this skill provides:
    
    1. **Single binary** - Easy to share/install (GitHub releases, brew, nix)
    2. **Fast startup** - No JVM warmup, ~50ms startup time
    3. **Built-in HTTP client** - No external dependencies
    4. **Clojure syntax** - Familiar to you (Wes), expressive
    5. **Retry logic & caching** - Built into the skill
    6. **Batch processing** - Parallel extraction for multiple URLs
    
    ## Example User Requests
    
    **For this skill to trigger:**
    - "Scrape job listings from Docker careers page"
    - "Extract the main content from this article"
    - "Get structured product data from this e-commerce page"
    - "Pull all the news articles from this site"
    - "Extract contact information from this company page"
    - "Batch extract job listings from these 20 URLs"
    - "Get cached results for this page (avoid API calls)"