Back to Skills
    šŸ¦ž

    atl-mobile

    Mobile browser and native app automation via ATL (iOS Simulator).

    By @jordancoin
    View on GitHub
    SKILL.md
    ---
    name: atl-browser
    description: Mobile browser and native app automation via ATL (iOS Simulator). Navigate, click, screenshot, and automate web and native app tasks on iPhone/iPad simulators.
    metadata:
      openclaw:
        emoji: "šŸ“±"
        requires:
          bins: ["xcrun", "xcodebuild", "curl"]
        install:
          - id: "atl-clone"
            kind: "shell"
            command: "git clone https://github.com/JordanCoin/Atl ~/Atl"
            label: "Clone ATL repository"
          - id: "atl-setup"
            kind: "shell" 
            command: "~/.openclaw/skills/atl-browser/scripts/setup.sh"
            label: "Build and install ATL to simulator"
    ---
    
    # ATL — Agent Touch Layer
    
    > The automation layer between AI agents and iOS
    
    ATL provides HTTP-based automation for iOS Simulator — both **browser** (mobile Safari) and **native apps**. Think Playwright, but for mobile.
    
    ## šŸ”€ Two Servers: Browser & Native
    
    ATL uses **two separate servers** for browser and native app automation:
    
    | Server | Port | Use Case | Key Commands |
    |--------|------|----------|--------------|
    | **Browser** | `9222` | Web automation in mobile Safari | `goto`, `markElements`, `clickMark`, `evaluate` |
    | **Native** | `9223` | iOS app automation (Settings, Contacts, any app) | `openApp`, `snapshot`, `tapRef`, `find` |
    
    ```
    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
    │  BROWSER SERVER (9222)     │     NATIVE SERVER (9223)      │
    │  (mobile Safari/WebView)   │     (iOS apps via XCTest)     │
    │                            │                                │
    │  markElements + clickMark  │     snapshot + tapRef         │
    │  CSS selectors             │     accessibility tree        │
    │  DOM evaluation            │     element references        │
    │  tap, swipe, screenshot    │     tap, swipe, screenshot    │
    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
    ```
    
    **Why two ports?** Native app automation requires XCTest APIs (XCUIApplication, XCUIElement) which are only available in UI Test bundles. The native server runs as a UI Test that exposes an HTTP API.
    
    ### Starting the Servers
    
    ```bash
    # Browser server (starts automatically with AtlBrowser app)
    xcrun simctl launch booted com.atl.browser
    curl http://localhost:9222/ping  # → {"status":"ok"}
    
    # Native server (run as UI Test)
    cd ~/Atl/core/AtlBrowser
    xcodebuild test -workspace AtlBrowser.xcworkspace \
      -scheme AtlBrowser \
      -destination 'id=<SIMULATOR_UDID>' \
      -only-testing:AtlBrowserUITests/NativeServer/testNativeServer &
      
    # Wait for it to start, then:
    curl http://localhost:9223/ping  # → {"status":"ok","mode":"native"}
    ```
    
    ### Quick Port Reference
    
    | Task | Port | Example |
    |------|------|---------|
    | Browse websites | 9222 | `curl localhost:9222/command -d '{"method":"goto",...}'` |
    | Open native app | 9223 | `curl localhost:9223/command -d '{"method":"openApp",...}'` |
    | Screenshot (browser) | 9222 | `curl localhost:9222/command -d '{"method":"screenshot"}'` |
    | Screenshot (native) | 9223 | `curl localhost:9223/command -d '{"method":"screenshot"}'` |
    
    ---
    
    ## šŸ“± Native App Automation (Port 9223)
    
    Native automation uses **port 9223** and automates **any iOS app** using the accessibility tree — no DOM, no JavaScript, just direct element interaction.
    
    ### Opening & Closing Apps
    
    ```bash
    # Open an app by bundle ID
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"openApp","params":{"bundleId":"com.apple.Preferences"}}'
    # → {"success":true,"result":{"bundleId":"com.apple.Preferences","mode":"native","state":"running"}}
    
    # Check current app state
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"appState"}'
    # → {"success":true,"result":{"mode":"native","bundleId":"com.apple.Preferences","state":"running"}}
    
    # Close current app
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"closeApp"}'
    # → {"success":true,"result":{"closed":true}}
    ```
    
    ### Common Bundle IDs
    
    | App | Bundle ID |
    |-----|-----------|
    | Settings | `com.apple.Preferences` |
    | Contacts | `com.apple.MobileAddressBook` |
    | Calculator | `com.apple.calculator` |
    | Calendar | `com.apple.mobilecal` |
    | Photos | `com.apple.mobileslideshow` |
    | Notes | `com.apple.mobilenotes` |
    | Reminders | `com.apple.reminders` |
    | Clock | `com.apple.mobiletimer` |
    | Maps | `com.apple.Maps` |
    | Safari | `com.apple.mobilesafari` |
    
    ### The `snapshot` Command
    
    `snapshot` returns the accessibility tree — all visible elements with their properties and tap-able references.
    
    ```bash
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"snapshot","params":{"interactiveOnly":true}}' | jq '.result'
    ```
    
    **Example output:**
    ```json
    {
      "count": 12,
      "elements": [
        {
          "ref": "e0",
          "type": "cell",
          "label": "Wi-Fi",
          "value": "MyNetwork",
          "identifier": "",
          "x": 0,
          "y": 142,
          "width": 393,
          "height": 44,
          "isHittable": true,
          "isEnabled": true
        },
        {
          "ref": "e1",
          "type": "cell",
          "label": "Bluetooth",
          "value": "On",
          "identifier": "",
          "x": 0,
          "y": 186,
          "width": 393,
          "height": 44,
          "isHittable": true,
          "isEnabled": true
        },
        {
          "ref": "e2",
          "type": "button",
          "label": "Back",
          "value": null,
          "identifier": "Back",
          "x": 0,
          "y": 44,
          "width": 80,
          "height": 44,
          "isHittable": true,
          "isEnabled": true
        }
      ]
    }
    ```
    
    **Parameters:**
    - `interactiveOnly` (bool, default: `false`) — Only return hittable elements
    - `maxDepth` (int, optional) — Limit tree traversal depth
    
    ### The `tapRef` Command
    
    Tap an element by its reference from the last `snapshot`:
    
    ```bash
    # Take snapshot first
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"snapshot","params":{"interactiveOnly":true}}'
    
    # Tap element e0 (Wi-Fi cell from example above)
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"tapRef","params":{"ref":"e0"}}'
    # → {"success":true}
    ```
    
    ### The `find` Command
    
    Find and interact with elements by text — no need to parse snapshot manually:
    
    ```bash
    # Find and tap "Wi-Fi"
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"find","params":{"text":"Wi-Fi","action":"tap"}}'
    # → {"success":true,"result":{"found":true,"ref":"e0"}}
    
    # Check if an element exists
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"find","params":{"text":"Bluetooth","action":"exists"}}'
    # → {"success":true,"result":{"found":true,"ref":"e1"}}
    
    # Find and fill a text field
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"find","params":{"text":"First name","action":"fill","value":"John"}}'
    
    # Get element info without interacting
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"find","params":{"text":"Cancel","action":"get"}}'
    # → {"success":true,"result":{"found":true,"ref":"e5","element":{...}}}
    ```
    
    **Parameters:**
    - `text` (string) — Text to search for (matches label, value, or identifier)
    - `action` (string) — One of: `tap`, `fill`, `exists`, `get`
    - `value` (string, optional) — Text to fill (required for `action:"fill"`)
    - `by` (string, optional) — Narrow search: `label`, `value`, `identifier`, `type`, or `any` (default)
    
    ---
    
    ## šŸ”„ Native App Workflow Example
    
    Here's a complete flow: open Settings, navigate to Wi-Fi, take a screenshot:
    
    ```bash
    # 1. Open Settings app
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"openApp","params":{"bundleId":"com.apple.Preferences"}}'
    
    # 2. Wait for app to launch
    sleep 1
    
    # 3. Take snapshot to see available elements
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"snapshot","params":{"interactiveOnly":true}}' | jq '.result.elements[:5]'
    
    # 4. Find and tap Wi-Fi
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"find","params":{"text":"Wi-Fi","action":"tap"}}'
    
    # 5. Wait for navigation
    sleep 0.5
    
    # 6. Take screenshot of Wi-Fi settings
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"screenshot"}' | jq -r '.result.data' | base64 -d > /tmp/wifi-settings.png
    
    # 7. Navigate back (swipe right from left edge)
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"swipe","params":{"direction":"right"}}'
    
    # 8. Close the app
    curl -s -X POST http://localhost:9223/command \
      -d '{"method":"closeApp"}'
    ```
    
    ### Helper Script Version
    
    ```bash
    source ~/.openclaw/skills/atl-browser/scripts/atl-helper.sh
    
    atl_openapp "com.apple.Preferences"
    sleep 1
    atl_find "Wi-Fi" tap
    sleep 0.5
    atl_screenshot /tmp/wifi-settings.png
    atl_swipe right
    atl_closeapp
    ```
    
    ---
    
    ## šŸ’” Core Insight: Vision-Free Automation
    
    ATL's killer feature is **spatial understanding without vision models**:
    
    ```
    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
    │  markElements + captureForVision = COMPLETE PAGE KNOWLEDGE  │
    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
    
    1. markElements  → Numbers every interactive element [1] [2] [3]
    2. captureForVision → PDF with text layer + element coordinates
    3. tap x=234 y=567 → Pixel-perfect touch at exact position
    ```
    
    **Why this matters:**
    - **No vision API calls** — zero token cost for "seeing" the page
    - **Faster** — no round-trip to GPT-4V/Claude Vision
    - **Deterministic** — same page = same coordinates, every time
    - **Reliable** — pixel-perfect coordinates vs. vision interpretation
    
    ### The Vision-Free Workflow
    
    ```bash
    # 1. Mark elements (adds numbered labels + stores coordinates)
    curl -s -X POST http://localhost:9222/command \
      -d '{"id":"1","method":"markElements","params":{}}'
    
    # 2. Capture PDF with text layer (machine-readable, has coordinates)
    curl -s -X POST http://localhost:9222/command \
      -d '{"id":"2","method":"captureForVision","params":{"savePath":"/tmp","name":"page"}}' \
      | jq -r '.result.path'
    # → /tmp/page.pdf (text-selectable, contains element positions)
    
    # 3. Get specific element's position by mark label
    curl -s -X POST http://localhost:9222/command \
      -d '{"id":"3","method":"getMarkInfo","params":{"label":5}}' | jq '
    
    ... (truncated)