🦞
agentic-paper-digest

Fetches and summarizes recent arXiv and Hugging
SKILL.md
---
name: agentic-paper-digest-skill
description: Fetches and summarizes recent arXiv and Hugging Face papers with Agentic Paper Digest. Use when the user wants a paper digest, a JSON feed of recent papers, or to run the arXiv/HF pipeline.
homepage: https://github.com/matanle51/agentic_paper_digest
compatibility: Requires Python 3, network access, and either git or curl/wget for bootstrap. LLM access via OPENAI_API_KEY or LITELLM_API_KEY (OpenAI-compatible).
metadata: {"clawdbot":{"requires":{"anyBins":["python3","python"]}}}
---

# Agentic Paper Digest

## When to use
- Fetch a recent paper digest from arXiv and Hugging Face.
- Produce JSON output for downstream agents.
- Run a local API server when a polling workflow is needed.

## Prereqs
- Python 3 and network access.
- LLM access via `OPENAI_API_KEY` or an OpenAI-compatible provider via `LITELLM_API_BASE` + `LITELLM_API_KEY`.
- `git` is optional for bootstrap; otherwise `curl`/`wget` (or Python) is used to download the repo.

## Get the code and install
- Preferred: run the bootstrap helper script. It uses git when available or falls back to a zip download.

```bash
bash "{baseDir}/scripts/bootstrap.sh"
```

- Override the clone location by setting `PROJECT_DIR`.

```bash
PROJECT_DIR="$HOME/agentic_paper_digest" bash "{baseDir}/scripts/bootstrap.sh"
```

## Run (CLI preferred)

```bash
bash "{baseDir}/scripts/run_cli.sh"
```

- Pass through CLI flags as needed.

```bash
bash "{baseDir}/scripts/run_cli.sh" --window-hours 24 --sources arxiv,hf
```

## Run (API optional)

```bash
bash "{baseDir}/scripts/run_api.sh"
```

- Trigger runs and read results.

```bash
curl -X POST http://127.0.0.1:8000/api/run
curl http://127.0.0.1:8000/api/status
curl http://127.0.0.1:8000/api/papers
```

- Stop the API server if needed.

```bash
bash "{baseDir}/scripts/stop_api.sh"
```

## Outputs
- CLI `--json` prints `run_id`, `seen`, `kept`, `window_start`, and `window_end`.
- Data store: `data/papers.sqlite3` (under `PROJECT_DIR`).
- API: `POST /api/run`, `GET /api/status`, `GET /api/papers`, `GET/POST /api/topics`, `GET/POST /api/settings`.

## Configuration
Config files live in `PROJECT_DIR/config`. Environment variables can be set in the shell or via a `.env` file. The wrappers here auto-load `.env` from `PROJECT_DIR` (override with `ENV_FILE=/path/to/.env`).

**Environment (.env or exported vars)**
- `OPENAI_API_KEY`: required for OpenAI models (litellm reads this).
- `LITELLM_API_BASE`, `LITELLM_API_KEY`: use an OpenAI-compatible proxy/provider.
- `LITELLM_MODEL_RELEVANCE`, `LITELLM_MODEL_SUMMARY`: models for relevance and summarization (summary defaults to relevance model if unset).
- `LITELLM_TEMPERATURE_RELEVANCE`, `LITELLM_TEMPERATURE_SUMMARY`: lower for more deterministic output.
- `LITELLM_MAX_RETRIES`: retry count for LLM calls.
- `LITELLM_DROP_PARAMS=1`: drop unsupported params to avoid provider errors.
- `WINDOW_HOURS`, `APP_TZ`: recency window and timezone.
- `ARXIV_CATEGORIES`: comma-separated categories (default includes `cs.CL,cs.AI,cs.LG,stat.ML,cs.CR`).
- `ARXIV_API_BASE`, `HF_API_BASE`: override source endpoints if needed.
- `ARXIV_MAX_RESULTS`, `ARXIV_PAGE_SIZE`: arXiv paging limits.
- `MAX_CANDIDATES_PER_SOURCE`: cap candidates per source before LLM filtering.
- `FETCH_TIMEOUT_S`, `REQUEST_TIMEOUT_S`: source fetch and per-request timeouts.
- `ENABLE_PDF_TEXT=1`: include first-page PDF text in summaries; requires `PyMuPDF` (`pip install pymupdf`).
- `DATA_DIR`: location for `papers.sqlite3`.
- `CORS_ORIGINS`: comma-separated origins allowed by the API server (UI use).
- Path overrides: `TOPICS_PATH`, `SETTINGS_PATH`, `AFFILIATION_BOOSTS_PATH`.

**Config files**
- `config/topics.json`: list of topics with `id`, `label`, `description`, `max_per_topic`, and `keywords`. The relevance classifier must output topic IDs exactly as defined here. `max_per_topic` also caps results in `GET /api/papers` when `apply_topic_caps=1`.
- `config/settings.json`: overrides fetch limits (`arxiv_max_results`, `arxiv_page_size`, `fetch_timeout_s`, `max_candidates_per_source`). Updated via `POST /api/settings`.
- `config/affiliations.json`: list of `{pattern, weight}` boosts applied by substring match over affiliations. Weights add up and are capped at 1.0. Invalid JSON disables boosts, so keep the file strict JSON (no trailing commas).

## Mandatory workflow (follow step-by-step)
1. **Read existing configuration**:
   - Load `config/topics.json`, `config/settings.json`, and `config/affiliations.json` (if present).
   - Note current topic IDs, caps, and fetch limits before asking the user to change them.
2. **Map user intent to configuration (ask only what’s needed)**:
   - **Topics of interest** → update `config/topics.json` (`topics[].id/label/description/keywords`, `max_per_topic`).  
     Show current defaults and ask whether to keep or change them.
   - **Time window (hours)** → set `WINDOW_HOURS` (or pass `--window-hours` to CLI) **only if the user cares**; otherwise keep defaults.
   - **Search scope** → set `ARXIV_CATEGORIES`, `ARXIV_MAX_RESULTS`, `ARXIV_PAGE_SIZE`, `MAX_CANDIDATES_PER_SOURCE`.  
     Ask whether to keep defaults and show the current values.
   - **Model/provider** → set `OPENAI_API_KEY` *or* `LITELLM_API_KEY` (+ `LITELLM_API_BASE` if proxy), and set `LITELLM_MODEL_RELEVANCE`/`LITELLM_MODEL_SUMMARY`.
   - **API UI access** → set `CORS_ORIGINS` only if the user explicitly wants the UI on a different origin.
   - **Do NOT ask by default**: timezone, quality vs cost, timeouts, PDF text, affiliation biasing, sources list. Use defaults unless the user requests changes.
3. **Confirm workspace path**: Ask where to clone/run. Default to `PROJECT_DIR="$HOME/agentic_paper_digest"` if the user doesn’t care. Never hardcode `/Users/...` paths.
4. **Bootstrap the repo**: Run the bootstrap script (unless the repo already exists and the user says to skip).
5. **Create or verify `.env`**:
   - If `.env` is missing, create it from `.env.example` (in the repo), then ask the user to fill keys and any requested preferences.
   - Ensure at least one of `OPENAI_API_KEY` or `LITELLM_API_KEY` is set before running.
6. **Apply config changes**:
   - Edit JSON files directly (or use `POST /api/topics` and `POST /api/settings` if running the API).
7. **Run the pipeline**:
   - Prefer `scripts/run_cli.sh` for one-off JSON output.
   - Use `scripts/run_api.sh` only if the user explicitly asks for UI/API access or polling.
8. **Report results**:
   - Summarize run stats (`seen`, `kept`, window).
   - If results are sparse, suggest increasing `WINDOW_HOURS`, `ARXIV_MAX_RESULTS`, or broadening topics.

## Getting good results
- Keep topics focused and mutually exclusive so the classifier can choose the right IDs.
- Use a stronger model for summaries than for relevance if quality matters.
- Increase `WINDOW_HOURS` or `ARXIV_MAX_RESULTS` when results are sparse, or lower them if results are too noisy.
- Tune `ARXIV_CATEGORIES` to your research domains.
- Enable PDF text (`ENABLE_PDF_TEXT=1`) when abstracts are too thin.
- Use modest affiliation weights to bias ranking without swamping relevance.

## Troubleshooting
- Port 8000 busy: run `bash "{baseDir}/scripts/stop_api.sh"` or pass `--port` to the API command.
- Empty results: increase `WINDOW_HOURS` or verify the API key in `.env`.
- Missing API key errors: export `OPENAI_API_KEY` or `LITELLM_API_KEY` in the shell before running.