# Paragon Documentation

> Complete documentation for Paragon - an agentic QA platform for automated code reviews, testing, and development.

## For AI coding agents reading this file

The Paragon and Keystone docs describe each step for a human reader. **When the user asks you to set up Keystone or Paragon** (or hands you an API key or the docs URL with that intent), do all of those steps for them -- don't paraphrase the docs and ask the user to run each command. This directive is scoped to setup; if the user is editing a spec, debugging an invariant, reading results, or otherwise not setting things up from scratch, don't invoke the full flow.

**The Keystone setup guide is the canonical procedure and appears as the second entry below under ## Keystone / ### Setup Guide.** Read it in full before acting. It covers: install, inspect agent code, drop secrets in the Dashboard, scaffold specs with `source:` directives, upload snapshot + spec, create experiment with `specPath` for automatic secret forwarding, review in Dashboard, then run.

Critical rules you must follow:

- Every entry in a spec's `secrets:` block needs EITHER a `source:` (env / env:RENAMED / file:path / command:shell / dashboard) OR a `from:` literal (static:// / generated). Missing both is a validation error -- "scope: env" alone is NOT enough.
- Never use `path: stdout.log` as an invariant check path. Stdout is not persisted as a file. Use `command_exit`, `file_exists`, `sql`, `http_mock_assertions`, or `llm_as_judge` with `input_from: workspace`.
- Never put real API keys behind `from: "static://..."`. Use `source: dashboard` for team-shared or `source: env` for local.
- Declare every service the agent talks to in `services:` with the canonical public Docker image. Services get DNS by name on a shared network -- no IP lookups, no port maps.
- Use `{{ secrets.X }}` in `services[].env` to plug resolved secrets into service containers.
- Upload + create experiments in draft state. Do not trigger runs until the user reviews.

Treat code blocks as commands to execute on the user's repo, not examples to paste back at them. Run the full procedure end-to-end and report results, gaps, or questions.

## Keystone

### Keystone

QA platform for AI agents. Run your agent in isolated sandboxes, score results, catch regressions. AI agents reading this page should first fetch https://docs.paragon.run/llms-full.txt for the complete unsummarized docs.

AI agents are non-deterministic. The same prompt can produce different results each time. A test that passes once might fail the next run. Manual spot-checking doesn't scale -- you need to run your agent many times in identical environments and measure how reliably it works.

Keystone does this. You write a spec that describes the environment and defines what "correct" looks like, then Keystone spins up sandboxes, runs your agent, and scores the results. Run it 10 times or 100 times. Track scores across versions. Get alerted when something regresses.

<div className="hidden dark:block">
  
</div>
<div className="block dark:hidden">
  
</div>

## How it works

  
  
  

## Prerequisites

You need two things:

1. **A Keystone API key.** Go directly to [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) (or in the Keystone sidebar click **Settings**), switch to the **API Keys** tab, and click **Create Key**. Keystone keys are prefixed `ks_live_` and are shown **once** at creation -- copy it immediately, you can't retrieve it later. Legacy Polarity UUID keys (the one used for Paragon CLI and paragon-llm-proxy) are still accepted for backward compatibility, but new integrations should use a dedicated `ks_live_` key.

2. **Node.js, Python, or Go** installed locally to run the SDK.

You do **not** need Docker installed on your machine. Keystone runs sandboxes on Polarity's servers. Your code just talks to the API.

## Quick start

Five-ish minutes from nothing to a running experiment. Steps 1-3 get your agent emitting traces; step 4 is where you decide what "pass" means for your task; steps 5-6 run it and show you the output.

[Instructions for AI coding agents:]
**For AI coding agents only (Claude Code, Cursor, etc.):** this directive applies **only when the user has asked you to set up Keystone** (or has handed you a Keystone API key for that purpose). In that case, do **all six steps for them** -- don't read the steps aloud and ask the user to run each command. Specifically:

- Run the install command in step 1.
- Set `KEYSTONE_API_KEY` in the environment (or in their `.env` / shell config) for step 2.
- For step 3, **find the user's existing agent code in the repo** (look for the LLM client instantiation and the tool functions the agent calls). Modify those in place: pass the LLM client through `ks.wrap()`, and wrap each tool function body in `traced('<tool_name>', ...)`. Do not create a new `agent.ts` from scratch unless the repo genuinely has no agent yet.
- **At step 4, stop and ask the user the five design questions** -- this is the only step you cannot answer for them.
- Question 5 is non-negotiable: the user's agent runs *inside the sandbox* and needs its own LLM provider key (and any third-party creds) injected via the spec's `secrets:` block or the dashboard Secrets tab. Without it, every run will fail. Ask explicitly: "Which LLM provider keys should I wire into the sandbox (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, etc.), and do you want them inlined in the spec as `static://...` or do you prefer to add them once at app.paragon.run → Keystone → Settings → Secrets (auto-injected)?"
- Once they answer, generate `spec.yaml` with the `secrets:` block populated, write the runner, and run the experiment yourself (steps 5-6).

Do not ship a canned hello-world spec. If the user is doing something else in a repo that happens to use Keystone -- just editing a spec, debugging one invariant, reading results -- don't invoke this flow.
[End AI agent instructions]

### 1. Install the SDK

### 2. Set your API key

Get your key from [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **API Keys** tab → **Create Key**. Keys start with `ks_live_` and are shown **once** at creation -- copy it immediately.

```bash
export KEYSTONE_API_KEY=ks_live_...
```

### 3. Wrap your existing agent's LLM + tool calls

You already have an agent -- this step instruments it in place so every model call and every tool call shows up as a trace event you can inspect and score against. Two changes in your existing agent code:

**(a) Wrap the LLM client where it's instantiated.** Find the line that creates your Anthropic / OpenAI / etc. client and pass it through `ks.wrap()`. The SDK handles routing traces to the right sandbox automatically — you don't need to set or pass any sandbox-related config yourself. Outside a sandbox (local dev, CI, anywhere your agent runs for real work), tracing is a silent no-op.

**(b) Wrap each tool function with `traced()`.** Find the functions your agent calls as tools (read/write file, shell, HTTP, DB queries, anything) and wrap their bodies. Use the tool name as the span name so it shows up correctly in the trace tree.

Repeat for every tool the agent calls. Nested `traced()` calls automatically build a parent-child span tree so you can see exactly which LLM turn triggered which tool calls.

You can pull the trace back at any point to verify it's flowing:

```typescript
const trace = await ks.sandboxes.getTrace(process.env.KEYSTONE_SANDBOX_ID!);
console.log(`${trace.events.length} events, $${trace.metrics.total_cost_usd} cost`);
```

`ks.wrap()` works for Anthropic, OpenAI, and any OpenAI-compatible provider (Groq, xAI, Together) -- see the [SDK reference](/keystone/sdk#llm-tracing). Wrapping never changes the response; if tracing fails it fails silently so your agent keeps running.

### 4. Tell Keystone what to evaluate

Now that your agent reports traces, decide what a **pass** looks like. Every Keystone spec is an answer to five questions -- if you're working with an AI coding assistant, **the assistant should ask you these directly** rather than inventing a canned example.

1. **What task should the agent do?** One sentence the agent receives as a prompt. ("Fix the failing test in `src/api.test.ts`.", "Turn this PRD into a working Stripe integration.", "Refactor `users.py` to use async SQLAlchemy.")
2. **What environment does it need?** Base image (ubuntu, node, python), packages, repos to clone, and any backing services (Postgres, Redis, a mock API).
3. **How do you know it worked?** The invariants -- concrete yes/no checks. File contents, test pass/fail, HTTP responses, SQL queries. At least one should be a `gate` (a hard fail if it doesn't pass).
4. **What's off-limits?** Forbidden filesystem paths, network hosts, or behaviors that should auto-fail the run.
5. **What API keys does your agent need at runtime?** Your agent runs *inside the sandbox*, so it needs its own LLM provider key (e.g. `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`) plus any third-party creds (Stripe, GitHub, etc.). **Without this the agent can't make model calls and every run will fail.**

Write those answers into a YAML spec. Here's the minimal shape -- fill in the bracketed sections, save as `spec.yaml`:

```yaml
version: 1
id: "<short-kebab-id>"
description: "<one-line summary of what this verifies>"

base: "<ubuntu:24.04 | node:20 | python:3.12 | your-image>"
# repos: [{ url: ..., path: ... }]        # optional: clone a repo into the sandbox
# services: { db: { image: postgres:16 } } # optional: backing services

task:
  prompt: |
    <the instruction the agent receives>

agent:
  type: paragon         # or: cli | image | http | python | snapshot (see "Swap in your own agent" below)
  timeout: 5m

# Your agent runs inside the sandbox -- give it the keys it needs to call LLMs
# and any third-party APIs. Two ways to supply these:
#   (a) Store once at app.paragon.run -> Keystone -> Settings -> Secrets.
#       Dashboard-stored secrets auto-inject into every sandbox as env vars.
#   (b) Declare inline here -- use `static://...` for fixed values, `generated`
#       for per-run random values. See /keystone/specs#secrets for details.
secrets:
  - name: ANTHROPIC_API_KEY
    from: "static://<your-anthropic-key>"   # or omit this whole block if stored in dashboard
    scope: env

invariants:
  <invariant_name>:
    description: "<what this check proves>"
    weight: 1.0
    gate: true          # mark at least one as a hard gate
    check:
      type: <file_exists | file_content | shell_exit | http | sql | ...>
      # ...type-specific fields

scoring:
  pass_threshold: 1.0
```

See the [spec reference](/keystone/specs) for every field and every invariant `check.type`. For real scenarios (fix-a-failing-test, migrate-a-schema, build-a-feature-from-a-PRD), see [examples](/keystone/examples).

### 5. Upload and run

The experiment typically takes 10-30 seconds. Keystone creates a sandbox on the server, runs the shell command, checks the invariants, and returns the results.

### 6. What you'll see

When the experiment completes, you get back a `RunResults` object. Here's what a passing run looks like:

```json
{
  "experiment_id": "exp-a1b2c3",
  "total_scenarios": 1,
  "passed": 1,
  "failed": 0,
  "metrics": {
    "pass_rate": 1.0,
    "mean_wall_ms": 12000,
    "total_cost_usd": 0.0
  },
  "scenarios": [
    {
      "status": "pass",
      "composite_score": 1.0,
      "invariants": [
        { "name": "<invariant-1>", "passed": true, "gate": true, "weight": 1.0 },
        { "name": "<invariant-2>", "passed": true, "weight": 1.0 }
      ]
    }
  ]
}
```

And here's what a failure looks like -- the `message` field tells you exactly what went wrong:

```json
{
  "passed": 0,
  "failed": 1,
  "metrics": { "pass_rate": 0.0 },
  "scenarios": [
    {
      "status": "fail",
      "composite_score": 0.0,
      "invariants": [
        {
          "name": "<invariant-1>",
          "passed": false,
          "gate": true,
          "message": "<reason the check failed, e.g. file X does not exist>"
        }
      ],
      "reproducer": {
        "seed": 12345,
        "command": "keystone run --spec <your-spec-id> --seed 12345 --scenario scenario-000"
      }
    }
  ]
}
```

The `reproducer` gives you the exact command to re-run that specific scenario with the same seed for debugging.

## Swap in your own agent

The spec template uses `agent: type: paragon` (Polarity's built-in agent) by default. To run your own agent, replace that block with one of these:

```yaml
# You have a CLI binary on the server:
agent:
  type: cli
  binary: /path/to/your-agent
  args: ["--task", "{{ task.prompt }}"]
  timeout: 5m

# You have a Docker image in a registry:
agent:
  type: image
  image: "your-registry/your-agent:latest"
  timeout: 5m

# Your agent is an HTTP API:
agent:
  type: http
  endpoint: "https://your-api.com/agent/run"
  timeout: 5m

# You have a Python script:
agent:
  type: python
  binary: agent.py
  timeout: 5m

# You want to version your agent as an immutable snapshot:
agent:
  type: snapshot
  snapshot: my-agent  # uploaded via ks.agents.upload()
  timeout: 5m
```

**Which one should I use?**
- **`cli`** -- your agent is a compiled binary or shell script that runs locally
- **`image`** -- your agent is packaged as a Docker image (best for reproducibility)
- **`http`** -- your agent is a hosted API (the sandbox POSTs the task to your endpoint)
- **`python`** -- your agent is a Python script in the sandbox
- **`snapshot`** -- your agent is uploaded to Keystone and versioned (best for tracking which agent version produced which results)
- **`paragon`** -- use Polarity's built-in Paragon agent (no setup needed)

See the [spec reference](/keystone/specs#agent) for full details.

## Key concepts

Before diving into the [spec reference](/keystone/specs) and [SDK reference](/keystone/sdk), here are the core concepts:

**Specs** are YAML files that describe everything about a test scenario: the environment, the task, and the pass/fail criteria. You upload them once and run experiments against them repeatedly.

**Sandboxes** are isolated environments where your agent runs. Each sandbox gets its own filesystem, Docker containers for backing services (Postgres, Redis, etc.), and a clean state. Nothing leaks between runs.

**Invariants** are the checks that run after your agent finishes. Each one answers a yes/no question: "Did the tests pass?" "Does the output file contain the right data?" "Did the agent call the API correctly?" Gate invariants cause an immediate fail if they don't pass.

**Forbidden rules** define what the agent must NOT do. If it writes to a file outside the allowed list, makes HTTP calls to unauthorized hosts, or leaks secrets in stdout, the run fails regardless of the invariant scores.

**Experiments** run your spec one or more times and aggregate the results. Run 10 replicas to measure consistency. Use a matrix to test different parameters. Compare experiments to catch regressions.

**Alerts** notify you via Slack or webhook when metrics cross a threshold. Set `pass_rate < 0.8` and get a message when your agent starts failing more often.

## Troubleshooting

### Experiment fails with "spec not found"

You need to upload the spec before creating an experiment. Run `ks.specs.create(yaml)` first, then `ks.experiments.create()`.

### Invariant says "file does not exist" but the agent should have created it

The invariant runs in the sandbox workspace directory. Make sure your agent writes files to the current working directory, not an absolute path. The workspace is the root -- if your agent writes to `/tmp/hello.txt`, the invariant looking for `hello.txt` won't find it.

### Agent times out

The default agent timeout is 5 minutes. For long-running tasks, increase it in the spec:

```yaml
agent:
  timeout: 15m
```

The sandbox-level timeout (under `resources.timeout`) must also be large enough to cover setup + agent execution + scoring.

### "sandbox rejected: at capacity"

The server has a limit on concurrent sandboxes. Wait for current experiments to finish or contact support to increase your limit.

## Next steps

- [Spec reference](/keystone/specs) -- every section of a spec file explained with examples
- [SDK reference](/keystone/sdk) -- all SDK methods across TypeScript, Python, and Go
- [Examples](/keystone/examples) -- full real-world specs and how to build agents that run inside Keystone

---

### Setup Guide

The complete procedure for wiring Keystone into an existing agent — written for humans and for AI coding assistants. Follow this page top to bottom.

You have an existing agent and you want to evaluate it with Keystone. This page is a strict procedure — follow it in order and you'll end with a working spec that passes on the first run.

---

## Step 0 — Install the SDK and set your API key

Generate the key at [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **API Keys**.

---

## Step 1 — Inspect the agent code

Before you write any spec, you need three facts about the agent:

Write these three facts down before proceeding. The rest of the guide is mechanical — fill in a template with what you found.

---

## Step 2 — Drop your provider keys into the Dashboard

For the secrets the team shares (`XAI_API_KEY`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.), go to [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **Secrets** tab and paste the values. They're AES-256-GCM encrypted at rest and shared with your teammates on the same billing account.

---

## Step 3 — Scaffold the spec from the template

Copy this template to `specs/scenario-1.yaml` and fill in the marked sections. Everything else stays as-is.

```yaml
version: 1
id: "

---

## Step 4 — Upload the agent and the spec

After this step, the experiment appears in the Dashboard as **Draft** — nothing is running yet.

---

## Step 5 — Review in the Dashboard, then run

Open [app.paragon.run/app/keystone/experiments](https://app.paragon.run/app/keystone/experiments). Click into your new experiment. You'll see:

- The spec YAML that was uploaded
- The secret names it declared (with warnings if any aren't set yet)
- The services it will boot
- The invariants that will score it

When it looks right, hit **Run**. The experiment moves from `draft` → `running` → `completed` (or `failed`). Watch results stream in.

Programmatic equivalent:

---

## Common first-run failures + fixes

| What you see | Cause | Fix |
| --- | --- | --- |
| `secret "X" declared in spec but not set in Dashboard or forwarded from .env` | The name in `secrets:` has no value at any layer | Either add it to the Dashboard or export it in your shell / `.env` |
| Sandbox boot hangs on `container_started`, agent never runs | No egress to your package registry for `npm install` / `pip install` | Either include `node_modules` / `venv` in the snapshot tarball OR add `registry.npmjs.org` / `pypi.org` to `network.egress.allow` |
| Every invariant fails with "file not found" | Invariant paths assume files the agent didn't actually write | Run the sandbox once manually, `ls /workspace`, see what's there, rewrite the invariant paths |
| Agent connects to `localhost:5432` and errors "connection refused" | Code uses `localhost`, but services are reachable by service name (`db`, `redis`, etc.) | Change connection strings to `postgres://db:5432/...`, or read `KEYSTONE_SERVICE_DB_HOST` env var |
| `llm_as_judge` fails with "no card on file" | Judges are billable and need a payment method | Add a card at [app.paragon.run/app/keystone/data/billing](https://app.paragon.run/app/keystone/data/billing) |
| Experiment stuck at `draft` forever | You uploaded it but never called `run` | Click **Run** in the Dashboard, or call `ks.experiments.run(exp.id)` / `runAndWait` |

---

## Related reading

- [Spec schema](/keystone/specs) — full field-by-field reference
- [Examples](/keystone/examples) — the canonical end-to-end example with all primitives
- [SDK methods](/keystone/sdk) — every method across TS/Python/Go
- [Billing](/keystone/billing) — how compute + LLM judge costs are billed

---

### Examples

Keystone specs from simple to complex, plus how to build agents that run inside Keystone. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.

Examples are ordered from simple to complex. Start with the hello world to verify your setup, then work up to real-world scenarios.

## Canonical end-to-end example

This is the single pattern that covers 90% of real agent evals: your agent (uploaded as a snapshot), a Postgres database you seed, a mocked external API, a secret read from the caller's local `.env`, and invariants that verify the agent did the right thing.

Copy this spec and adapt — it's the "looks right" template most scenarios grow from.

### Project layout

```
my-agent/
├── .env                  # local secrets (gitignored)
├── src/
│   └── index.ts          # your agent code
├── seed.sql              # DB fixtures
└── specs/
    └── scenario-1.yaml
```

### `.env` (on the caller's machine, not committed)

```bash
XAI_API_KEY=xai-...
DB_PASSWORD=localdevpass
STRIPE_SECRET=sk_test_...
```

### `specs/scenario-1.yaml`

```yaml
version: 1
id: email-agent-renewal-alice
description: "Agent drafts a renewal email to a specific customer"

agent:
  type: snapshot
  snapshot: email-agent          # uploaded via ks.agents.upload()

# Declare every secret + where to pull it from.
# SDK resolves these on the caller's machine and forwards in the request.
secrets:
  - name: XAI_API_KEY
    source: env                  # from $XAI_API_KEY in your local env
  - name: DB_PASSWORD
    source: env                  # same — will flow into the service below
  - name: STRIPE_SECRET
    source: env
  - name: STRIPE_LIVE_KEY
    source: dashboard            # server-side only; never forwarded from local

# Services Keystone pulls from Docker Hub and runs on a shared network.
# The agent reaches each by `name` — e.g. postgres://db:5432.
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: northwind
    ports: [5432]
    wait_for: "pg_isready -h localhost"

  - name: stripe_mock
    image: stripe/stripe-mock:latest
    env:
      STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}"
    ports: [12111]

# Seed the DB before the agent starts.
fixtures:
  - type: sql
    service: db
    path: seed.sql

# Allowlist only the domains your agent needs to reach externally.
network:
  egress:
    default: deny
    allow:
      - api.x.ai                   # your LLM provider
      - keystone.polarity.so       # trace ingestion (auto-allowed in blessed runtime images)

task:
  prompt: "Draft a subscription renewal reminder email to alice@northwind.co."

invariants:
  draft_created:
    description: "Exactly one draft file"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "test $(ls drafts/*.md 2>/dev/null | wc -l) -eq 1"

  addressed_to_alice:
    description: "Draft is addressed to alice@northwind.co"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "grep -l 'To: alice@northwind.co' drafts/*.md"

  reads_professional:
    description: "Draft is professional and mentions the product"
    weight: 0.5
    check:
      type: llm_as_judge
      input_from: workspace
      criteria: "The draft is professional and references alice's product."
      # model defaults to paragon-md; billed per-token to your Keystone invoice

scoring:
  pass_threshold: 0.85
```

### Running it

### What happens behind the scenes

1. **SDK resolves secrets** on your machine per `source:` directives. `STRIPE_LIVE_KEY` isn't forwarded — server reads it from the Dashboard. `XAI_API_KEY`, `DB_PASSWORD`, `STRIPE_SECRET` are forwarded from `.env`.
2. **Keystone server** pulls `postgres:16`, `stripe/stripe-mock:latest`, and the blessed Node runtime image from Docker Hub.
3. **A new Docker network** `keystone-<id>` is created. All containers join it with their `name` as the DNS alias.
4. **Services start first**, with `{{ secrets.X }}` substituted to actual values. `seed.sql` runs against the Postgres container.
5. **Agent container** starts with your tarball mounted at `/workspace`, secrets injected as env vars (including `KEYSTONE_SERVICE_DB_HOST=db`, `KEYSTONE_SERVICE_DB_PORT=5432`, etc. for service discovery).
6. **Agent runs** the entrypoint, reads the task prompt from stdin, produces files in `/workspace/drafts/`. Traces stream to `keystone.polarity.so` over HTTPS.
7. **Scorer runs** invariants against the post-run workspace. `llm_as_judge` calls Paragon's model proxy (billed to your Keystone invoice).
8. **Teardown** — containers destroyed, network removed, workspace retained per your retention policy.

The entire lifecycle is declared in one spec + one `.env`. No Dockerfile, no separate build step for services, no manual secret entry.

---

## Hello world

The simplest possible spec. No external repos, no agent binary, no services. A shell command creates a file and the invariants check it exists with the right content.

```yaml
version: 1
id: "hello-world"
description: "Verify a command can create a file"

base: "ubuntu:24.04"

task:
  prompt: "Create hello.txt containing 'Hello, Keystone!'"

agent:
  type: cli
  binary: sh
  args: ["-c", "echo 'Hello, Keystone!' > hello.txt"]
  timeout: 1m

invariants:
  file_created:
    description: "hello.txt exists"
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: hello.txt

  correct_content:
    description: "Contains the expected text"
    weight: 1.0
    check:
      type: file_content
      path: hello.txt
      contains: "Hello, Keystone!"

scoring:
  pass_threshold: 1.0
```

```typescript
const ks = new Keystone();
await ks.specs.create(readFileSync('hello.yaml', 'utf-8'));
const exp = await ks.experiments.create({ name: 'hello', spec_id: 'hello-world' });
const results = await ks.experiments.runAndWait(exp.id);
console.log(results.passed > 0 ? 'PASSED' : 'FAILED');
```

If this passes, your setup is working. Move on to a real scenario.

---

## Fix a failing test

A realistic spec that clones a repo, installs dependencies, and checks whether the agent can fix a broken test. Uses Paragon as the agent.

```yaml
version: 1
id: "fix-unit-test"
description: "Agent fixes a failing unit test"

base: "ubuntu:24.04"
setup:
  packages: [nodejs, npm]
  commands: ["npm install"]

fixtures:
  - type: git_repo
    url: "https://github.com/your-org/your-repo"
    branch: "broken-test"

task:
  prompt: |
    The test in src/utils.test.ts is failing. Fix it.
    Do not change the implementation in src/utils.ts.

agent:
  type: paragon
  model: paragon-fast
  timeout: 3m

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "npm test"

  no_impl_changes:
    description: "Implementation file was not modified"
    weight: 0.5
    check:
      type: command_exit
      command: "git diff --name-only | grep -v test"
      exit_code: 1

scoring:
  pass_threshold: 1.0
```

**What's new here:**
- `fixtures` clones a real git repo into the sandbox
- `setup.commands` installs npm dependencies before the agent runs
- The second invariant (`no_impl_changes`) verifies the agent only changed test files, not the implementation -- it expects `grep` to find no matches (exit code 1)

---

## Full stack with services

A complex spec that tests whether an agent can reconcile two database tables and send a summary email. Uses a real Postgres database, a mock SMTP server, LLM-as-judge scoring, and forbidden rules.

```yaml
version: 1
id: "reconciliation"
description: "Agent reconciles two databases and emails a summary"

base: "ubuntu:24.04"
setup:
  packages: [nodejs, npm, python3]
  env:
    DATABASE_URL: "postgres://postgres:test@db:5432/testdb"

services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: test
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: "pg_isready -U postgres"
  - name: smtp
    type: http_mock
    ports: [9090]
    record: true
    routes:
      - method: POST
        path: /v1/send
        response: '{"ok": true}'

secrets:
  - name: DB_PASSWORD
    from: generated
    scope: env

fixtures:
  - type: sql
    service: db
    path: seeds/schema.sql
  - type: drift
    target: db.customers_a
    strategy: random_mismatches
    count: 15
    seed: "42"

network:
  egress:
    default: deny
    allow: [registry.npmjs.org]
  dns_overrides:
    smtp.sendgrid.net: smtp.services.internal

task:
  prompt: |
    Reconcile customers_a and customers_b tables, fix all mismatches,
    and email a summary to finance@co

agent:
  type: paragon
  model: paragon-max
  timeout: 5m

invariants:
  databases_match:
    description: "No mismatches after reconciliation"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db
      query: |
        SELECT count(*) FROM customers_a a
        LEFT JOIN customers_b b ON a.id = b.id
        WHERE b.id IS NULL OR a.email != b.email
      equals: 0

  email_sent:
    description: "Exactly one summary email was sent"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1

  email_quality:
    description: "Email is professional and accurate"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Professional tone, accurate mismatch count"
      input_from: smtp.last_request.body

forbidden:
  db_writes_outside: [customers_a, customers_b, audit_log]
  http_except: [smtp]
  secrets_in_logs: deny

scoring:
  pass_threshold: 0.95

parallelism:
  replicas: 10
  isolation: per_run

determinism:
  seed: 42

teardown:
  always_run: true
  export:
    - type: audit_log
      to: "results/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/db.sql"
```

**What's new here:**
- `services` starts a real Postgres container and a mock SMTP server on a shared Docker network
- `fixtures` loads SQL seeds into Postgres and injects random data drift
- `secrets` generates a fresh random password each run
- `network` blocks all outbound traffic except npm and redirects `smtp.sendgrid.net` to the mock
- `forbidden` ensures the agent only writes to allowed tables and doesn't leak secrets
- `invariants` uses three different check types: SQL query, mock assertion, and LLM-as-judge
- `parallelism` runs 10 replicas to measure consistency
- `teardown` exports the audit log and a database dump after every run

---

## Comparing agents

Use `matrix` to test the same task with different configurations side by side.

```yaml
version: 1
id: "agent-comparison"
description: "Compare agent performance across models"

base: "ubuntu:24.04"
setup:
  packages: [python3, pip]
  commands: ["pip install -r requirements.txt"]

fixtures:
  - type: git_repo
    url: "https://github.com/your-org/ml-pipeline"
    branch: "main"

task:
  prompt: "Fix the data pipeline bug that causes duplicate records."

agent:
  type: paragon
  timeout: 5m

invariants:
  no_duplicates:
    description: "No duplicate records in output"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "python3 check_duplicates.py"

scoring:
  pass_threshold: 0.8

parallelism:
  replicas: 5
  matrix:
    - model: "paragon-fast"
    - model: "paragon-max"
```

This creates 10 total runs (2 models x 5 replicas). After both finish, compare them:

```typescript
const comparison = await ks.experiments.compare(fastExpId, maxExpId);
console.log(`Regressed: ${comparison.regressed}`);
comparison.metrics.forEach(m => {
  console.log(`${m.name}: ${m.baseline} -> ${m.candidate} (${m.direction})`);
});
```

---

## Setting up alerts

Get notified when your agent's performance drops.

```typescript
// Slack Bot alert (posts to a channel)
await ks.alerts.create({
  name: 'pass-rate-drop',
  eval_id: 'reconciliation',
  condition: 'pass_rate < 0.9',
  notify: 'slack',
  slack_channel: '#agent-alerts',
});

// Webhook alert (Slack incoming webhook auto-detected)
await ks.alerts.create({
  name: 'cost-spike',
  condition: 'mean_cost_per_run_usd > 2.00',
  notify: 'webhook',
  webhook_url: 'https://hooks.slack.com/services/T00/B00/xxx',
});
```

Alerts evaluate after every experiment run. See [SDK reference > Alerts](/keystone/sdk#alerts) for all supported metrics and operators.

---

## Using sandboxes directly

You don't have to run full experiments. You can create a sandbox from a spec, interact with it manually, and inspect the results. This is useful for development and debugging.

```typescript

const ks = new Keystone();

// Create a sandbox
const sb = await ks.sandboxes.create({ spec_id: 'fix-unit-test' });
console.log(`Sandbox ${sb.id} is ${sb.state}`);
console.log(`Services:`, sb.services);
// { db: { host: "db", port: 5432, ready: true } }

// Run commands
const test = await ks.sandboxes.runCommand(sb.id, { command: 'npm test' });
console.log(`Exit code: ${test.exit_code}`);
console.log(`Output: ${test.stdout}`);

// Read and write files
const src = await ks.sandboxes.readFile(sb.id, 'src/utils.ts');
await ks.sandboxes.writeFile(sb.id, 'src/utils.ts', fixedCode);

// See what changed
const diff = await ks.sandboxes.diff(sb.id);
console.log(`Modified: ${diff.modified}`);
console.log(`Added: ${diff.added}`);

// Clean up
await ks.sandboxes.destroy(sb.id);
```

This gives you a remote dev environment with services pre-configured. You can use it for manual testing, scripted integration tests, or building custom evaluation pipelines.

---

## Building an agent for Keystone

If you're building an agent that's designed to run inside Keystone sandboxes, here's what you need to know from the agent's perspective.

### What your agent gets

When Keystone runs your agent, it sets up the environment before your code starts:

1. **A workspace directory** -- your agent's working directory, seeded with any fixtures (git repos, files)
2. **Backing services** -- Postgres, Redis, etc. are already running and reachable by name
3. **Environment variables** -- connection info for every service, the sandbox ID, and the Keystone API URL
4. **The task prompt** -- passed via stdin (for cli/paragon/python agents) or in the HTTP request body (for http agents)

### Environment variables

Keystone injects these into every agent process:

| Variable | Example | Description |
|----------|---------|-------------|
| `KEYSTONE_SANDBOX_ID` | `sb-abc123` | Current sandbox ID |
| `KEYSTONE_BASE_URL` | `http://keystone:8012` | Keystone API URL |
| `KEYSTONE_SERVICE_DB_HOST` | `db` | Hostname for the "db" service |
| `KEYSTONE_SERVICE_DB_PORT` | `5432` | Port for the "db" service |
| `KEYSTONE_SERVICE_CACHE_HOST` | `cache` | Hostname for the "cache" service |
| `KEYSTONE_SERVICE_CACHE_PORT` | `6379` | Port for the "cache" service |

One pair of `HOST`/`PORT` vars is created for each service in the spec, with the service name uppercased.

### Discovering services with the SDK

Instead of parsing env vars manually, use `Keystone.fromSandbox()`:

### Auto-tracing LLM calls

Wrap your LLM client so Keystone automatically captures token usage, latency, and tool calls. This data shows up in traces and can be queried by agent name and version.

```typescript

const { client: ks, sandbox } = await Keystone.fromSandbox();
const anthropic = ks.wrap(new Anthropic());   // sandbox id auto-detected from env

// Every messages.create() call now auto-reports to Keystone
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  messages: [{ role: 'user', content: 'Fix the bug in main.py' }],
});
```

### Custom trace spans

For non-LLM operations (file I/O, API calls, test execution), use `traced()` to capture timing and errors:

```typescript

const testOutput = await traced('run_tests', async () => {
  return execSync('npm test').toString();
});

const analysis = await traced('analyze_results', async () => {
  return parseTestOutput(testOutput);
});
```

Nested `traced()` calls create parent-child spans automatically, so you get a full execution tree in the trace viewer.

### Stdout and stderr

Everything your agent prints to stdout and stderr is captured and returned in the experiment results:

- `scenarios[].agent_output` -- your agent's stdout (truncated to 50KB)
- `scenarios[].agent_stderr` -- your agent's stderr (truncated to 50KB)

Use stdout for normal output and stderr for debug logging. If `audit.stdout_capture` is enabled in the spec, stdout is also scanned by the `secrets_in_logs` forbidden rule.

You can view the captured output when debugging failures:

```typescript
const results = await ks.experiments.runAndWait(exp.id);
for (const scenario of results.scenarios) {
  if (scenario.status === 'fail') {
    console.log('Agent output:', scenario.agent_output);
    console.log('Agent stderr:', scenario.agent_stderr);
  }
}
```

### What your agent should NOT do

If the spec has `forbidden` rules, Keystone monitors your agent's behavior via the audit log. Common restrictions:

- **Don't write to files outside the allowed list** -- if `file_writes_outside: [src/, config/]` is set, writing to `/etc/passwd` fails the run
- **Don't make HTTP calls to unauthorized hosts** -- if `http_except: [payment-api]` is set, calling `api.stripe.com` directly fails the run
- **Don't print secrets to stdout** -- if `secrets_in_logs: deny` is set, any output containing AWS keys, private keys, or Stripe tokens fails the run

These rules are evaluated after your agent finishes. Your agent won't be interrupted mid-execution, but the scenario will be marked as failed if any rule is violated.

---

### Spec Reference

Every section of a Keystone spec file, explained with examples. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.

A spec is a YAML file that describes a complete test scenario. It tells Keystone what environment to create, what the agent should do inside it, and how to determine whether the agent succeeded.

A minimal spec needs just 5 fields: `version`, `id`, `base`, `task`, and `invariants`. Everything else is optional and adds capabilities as you need them.

## Sections overview

**Required for every spec:**

| Section | What it does |
|---------|-------------|
| `task` | The prompt and context for what the agent should do |
| `base` | Base Docker image for the sandbox (e.g. `ubuntu:24.04`) |
| `agent` | How the agent connects and runs |
| `invariants` | Pass/fail checks that run after the agent finishes |
| `scoring` | How check results are combined into a final score |

**Optional -- add when you need them:**

| Section | What it does | When to use |
|---------|-------------|-------------|
| `setup` | Packages to install, commands to run, files to create | When the agent needs dependencies pre-installed |
| `resources` | Limits: timeout, memory, CPU, disk | When defaults (2 CPU, 2GB, 10min) aren't enough |
| `fixtures` | Seed data: clone a repo, load SQL, copy files | When the agent needs a codebase or database to work with |
| `services` | Databases, caches, mock APIs alongside the sandbox | When the agent interacts with external services |
| `secrets` | Test credentials -- generated fresh or from a vault | When the agent needs API keys or passwords |
| `network` | Control outbound access, redirect APIs to mocks | When you want to block real API calls during testing |
| `audit` | Record DB writes, HTTP calls, file changes | When you need forbidden checks or detailed logs |
| `forbidden` | Things the agent must NOT do | When you want to enforce boundaries on agent behavior |
| `parallelism` | Run multiple times with different parameters | When you want to measure consistency or compare configs |
| `determinism` | Pin the clock, seed randomness, freeze DNS | When you need reproducible runs |
| `teardown` | Export audit logs and database dumps | When you want to keep artifacts after each run |

---

## `task`

The prompt that gets sent to the agent. This is the only required field.

```yaml
task:
  prompt: |
    Fix the failing test in src/utils.test.ts so that all tests pass.
    Do not modify the source code in src/utils.ts.
  context:
    repo: "acme/backend"
    language: "typescript"
```

The `context` field is optional key-value metadata passed alongside the prompt. Useful for giving the agent additional structured information.

## `base`

The Docker image used as the sandbox's operating system. Required.

```yaml
base: "ubuntu:24.04"
```

Any public Docker image works. For custom images, make sure the keystone server can pull from your registry.

## `setup`

Runs before the agent starts. Use this to install dependencies, create config files, and set environment variables.

```yaml
setup:
  packages: [nodejs, npm, python3, git]
  commands:
    - "npm install"
    - "python3 -m pip install pytest"
  files:
    - path: ".env"
      content: |
        DATABASE_URL=postgres://postgres:test@db:5432/testdb
        NODE_ENV=test
  env:
    CI: "true"
```

Setup commands run in the sandbox workspace directory. If any command fails, sandbox creation fails and the experiment reports an error.

## `resources`

Resource limits for the sandbox. If not specified, defaults to 2 CPUs, 2GB memory, and a 10-minute timeout.

```yaml
resources:
  timeout: 10m
  memory: 4Gi
  cpu: 4
  disk: 20Gi
```

The timeout controls how long the entire sandbox lifecycle runs (including setup, agent execution, and scoring). If the agent exceeds this, it gets killed and the scenario is marked as an error.

## `fixtures`

Seed data that gets loaded into the sandbox before the agent starts. Fixtures run after services boot but before the snapshot is extracted into `/workspace`, so anything that touches the filesystem needs to either inline its data or ship it alongside the spec.

```yaml
fixtures:
  # Load SQL inline — preferred for test seeds. Self-contained in the spec.
  - type: sql
    service: db
    sql: |
      CREATE TABLE customers (
        email TEXT PRIMARY KEY,
        name  TEXT NOT NULL
      );
      INSERT INTO customers VALUES
        ('alice@example.com', 'Alice'),
        ('ben@example.com',   'Ben');

  # Clone a git repo into the workspace
  - type: git_repo
    url: "https://github.com/your-org/your-repo"
    branch: "main"
    depth: 1

  # Copy a directory into the sandbox
  - type: directory
    source: /path/to/test-data
    target: data/

  # Inject random data drift for adversarial testing
  - type: drift
    target: db.customers
    strategy: random_mismatches
    count: 15
    seed: "42"
```

**Fixture types:**

| Type | What it does |
|------|-------------|
| `sql` | Runs SQL against a database service. Provide **either** `sql:` (inline script, preferred) **or** `path:` (file in workspace). Requires `service`. Credentials come from the service's declared `POSTGRES_USER` / `POSTGRES_PASSWORD` / `POSTGRES_DB` env (defaults to `postgres` / `test` / `testdb`). |
| `git_repo` | Clones a repository into the sandbox. Supports `branch`, `depth`, and `path`. |
| `directory` | Copies files from `source` to `target` inside the sandbox. Both paths are relative to the sandbox workspace directory. |
| `drift` | Injects random data modifications for adversarial testing. Uses `strategy`, `count`, and `seed`. |

## `services`

Services are backing containers your agent talks to during a run — Postgres, Redis, Stripe-mock, anything. You declare them in the spec; Keystone pulls the image and starts the container on a shared Docker network. Your agent reaches each by its `name` (DNS alias), so the same connection strings work across every run.

### Any public or private Docker image works

```yaml
services:
  - name: db
    image: postgres:16                           # Docker Hub
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: northwind
    ports: [5432]
    wait_for: "pg_isready -U postgres"

  - name: cache
    image: redis:7                               # Docker Hub
    ports: [6379]

  - name: stripe_mock
    image: stripe/stripe-mock:latest             # third-party public image
    env:
      STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}"
    ports: [12111]

  - name: vector
    image: ghcr.io/qdrant/qdrant:v1.7.4          # any registry — GHCR, ECR, GAR
    ports: [6333]
```

No Dockerfile. No `docker push`. Keystone pulls on demand (cached after first run) and wires the container into the sandbox network with `--network keystone-<sandbox-id> --network-alias <service-name>`.

### Inside the agent, reach services by name

Once the sandbox is up, each service's `name` resolves over DNS on the shared network:

```typescript
// Agent code — no IP lookup, no port mapping, no env configuration needed
const db     = new Client({ connectionString: "postgres://postgres:secret@db:5432/northwind" });
const cache  = new Redis("redis://cache:6379");
const stripe = new Stripe(process.env.STRIPE_SECRET, { host: "stripe_mock", port: 12111 });
```

Keystone also injects `KEYSTONE_SERVICE_

## `secrets`

The spec is the single source of truth for what secrets your agent needs and where each value comes from. You declare the name and (optionally) the `source:` — the SDK resolves, forwards in the create request, and the Keystone server injects into the sandbox container env.

```yaml
secrets:
  - name: XAI_API_KEY
    source: env                            # read from $XAI_API_KEY on the caller's machine

  - name: DB_PASSWORD
    source: env:MY_DB_PASS                 # rename — read $MY_DB_PASS, inject as DB_PASSWORD

  - name: OPENAI_API_KEY
    source: "file:~/.openai/key"           # read + trim file contents

  - name: INTERNAL_TOKEN
    source: command:op read "op://Dev/Keystone/token"
                                           # exec a shell command, capture stdout

  - name: STRIPE_LIVE_KEY
    source: dashboard                      # server-side only — SDK refuses local overrides

  - name: TEST_FIXTURE_TOKEN
    from: "static://fake-test-value"       # spec-owned literal (deterministic)

  - name: GENERATED_DB_PASSWORD
    from: generated                        # random 32-byte hex per run
```

### Source types

| Source | Resolution | Use for |
|--------|-----------|---------|
| `env` *(default if omitted)* | `process.env[NAME]` | Local `.env` during dev; `export KEY=...` in CI |
| `env:OTHER_NAME` | `process.env[OTHER_NAME]` | Rename (env var on your machine uses a different name than the sandbox expects) |
| `file:path` | Trimmed contents of the file (supports `~/`) | Secrets managed by an external tool that writes to a file (e.g. Cloud SDK auth) |
| `command:<shell>` | Exec the shell command, trim stdout | CLI-based secret managers (1Password, Doppler, Infisical, Vault CLI) |
| `dashboard` | Server decrypts Dashboard-stored value; SDK forwards nothing | Prod-critical keys that must never be overridable from a local machine |
| `from: "static://..."` | Spec-owned literal; wins over every other source | Deterministic test fixtures |
| `from: generated` | Random 32-byte hex, unique per run | Ephemeral test passwords |

### Precedence

Highest wins:

1. **Spec literal** (`from: static://...` or `from: generated`) — deterministic, never overridable
2. **SDK-forwarded source value** (`env`, `env:X`, `file:`, `command:`) — resolved on the caller's machine
3. **Dashboard Secret** — server-side fallback when the source is `dashboard` or resolution failed

A declared secret that can't be resolved by any layer fails the sandbox boot with the missing name — no silent empties.

### How the SDK reads it

When you call `ks.experiments.create()` or `ks.sandboxes.create()` with a `specPath` (TS) / `spec_path` (Python), the SDK:

1. Parses the spec's `secrets:` block
2. For every entry with a `source:` that isn't `dashboard`, resolves the value locally
3. Forwards `{NAME: value}` in the create request body
4. Server merges with Dashboard secrets (forwarded values win), injects as container env

Entries with `from:` (literals) and `source: dashboard` are not forwarded — the server resolves those at sandbox boot.

### Using secrets in services

Service env values can reference any resolved secret via `{{ secrets.NAME }}`:

```yaml
secrets:
  - name: DB_PASSWORD
    source: env

services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"   # substituted at boot
      POSTGRES_DB: northwind
    ports: [5432]
```

Same source and precedence rules apply — the substitution uses whatever value the resolver produced.

### Dashboard Secrets tab (team sharing + prod baseline)

Go to [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **Secrets** tab to store values encrypted at rest. Dashboard values:

- Auto-inject into every sandbox when `source: dashboard` or the caller has no local value for a declared name
- Are scoped to the billing owner: on a team, all teammates share the same secrets
- Are AES-256-GCM encrypted; decrypted only in-process on the Keystone server
- Show a warning next to each key that's declared by a spec but not set

## `network`

Controls what the sandbox can access on the network. Use this to prevent your agent from calling real APIs during testing.

```yaml
network:
  egress:
    default: deny           # block all outbound by default
    allow:
      - registry.npmjs.org  # let npm install work
      - github.com
      - "*.services.internal"  # internal service network
  dns_overrides:
    smtp.sendgrid.net: smtp.services.internal  # redirect real API to mock
```

When `egress.default` is `deny`, only explicitly allowed hosts are reachable. DNS overrides redirect hostnames to different targets -- useful for pointing real API domains at mock services.

## `audit`

Records everything the agent does inside the sandbox. The audit log is used by forbidden checks and can be exported during teardown.

```yaml
audit:
  db_writes: true           # track INSERT/UPDATE/DELETE
  http_calls: true          # track outbound HTTP requests
  process_spawns: true      # track child processes
  stdout_capture: true      # capture stdout for secret detection
  file_system:
    watch: ["src/", "config/"]  # directories to monitor
    track: [writes, reads, deletes]
```

## `agent`

Defines how the agent runs inside the sandbox. Keystone supports six agent types.

```yaml
# Run the Paragon CLI
agent:
  type: paragon
  model: paragon-fast
  timeout: 5m

# Run any CLI binary
agent:
  type: cli
  binary: /usr/local/bin/my-agent
  args: ["--task", "{{ task.prompt }}"]
  timeout: 5m

# Call an HTTP endpoint
agent:
  type: http
  endpoint: "https://api.example.com/agent/run"
  auth:
    bearer: "{{ secrets.API_KEY }}"
  timeout: 5m

# Run a Python script
agent:
  type: python
  binary: agent.py
  timeout: 5m

# Run a Docker image
agent:
  type: image
  image: "myregistry.io/my-agent:v3"
  timeout: 5m

# Run from an uploaded agent snapshot
agent:
  type: snapshot
  snapshot: my-agent        # resolves to latest version
  # or pin a specific version:
  # snapshot_id: snap_abc123
  timeout: 5m
```

**Agent types:**

| Type | What it means |
|------|-------------|
| `paragon` | Runs the Paragon CLI with the task prompt on stdin |
| `cli` | Runs any binary with args. Template variables like `{{ task.prompt }}` are substituted |
| `http` | POSTs a JSON payload to an HTTP endpoint |
| `python` | Runs a Python script with the task as JSON on stdin |
| `image` | Pulls and runs a Docker image on the sandbox's service network |
| `snapshot` | Downloads and runs an immutable agent snapshot uploaded via `ks.agents.upload()` |

The task prompt is passed to the agent via stdin (for cli, paragon, python) or in the request body (for http). The agent has full access to the sandbox filesystem and can run commands, write files, and connect to services.

## `invariants`

The checks that determine whether the agent passed or failed. Each invariant has a `weight` (how much it matters) and a `check` (what to evaluate).

Gate invariants (`gate: true`) cause an immediate fail if they don't pass, regardless of other scores.

```yaml
invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true              # hard fail if this fails
    check:
      type: command_exit
      command: "npm test"

  output_correct:
    description: "Output file contains expected data"
    weight: 0.5
    check:
      type: file_content
      path: output.json
      contains: '"status": "success"'

  no_debug_code:
    description: "No console.log left in source"
    weight: 0.2
    check:
      type: file_content
      path: src/main.ts
      not_contains: "console.log"

  email_sent:
    description: "Exactly one email was sent"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "user@example.com" }
          equals: 1

  code_quality:
    description: "Code is clean and well-structured"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Evaluate the code for readability, correctness, and minimal diff"
      input_from: src/main.ts
      rubric:
        pass: "Clean, minimal change that fixes the issue"
        fail: "Over-engineered, introduces unnecessary complexity"
      pass_threshold: 0.6
```

**Invariant types:**

| Type | What it checks |
|------|-------------|
| `command_exit` | Runs a command and checks the exit code (default: 0) |
| `file_exists` | Checks that a file exists at a path |
| `file_absent` | Checks that a file does NOT exist |
| `file_content` | Checks that a file contains/doesn't contain a string or regex pattern |
| `sql` | Runs a SQL query against a database service and checks the result |
| `http_mock_assertions` | Checks requests recorded by a mock service |
| `custom` | Runs a Python script that returns `{"passed": true/false, "reason": "..."}` |
| `llm_as_judge` | Uses an LLM to evaluate subjective criteria with a rubric |

### How scoring works

Each invariant produces a score between 0 and 1 (pass = 1, fail = 0, llm_as_judge = continuous). The composite score is the weighted average:

```
composite = sum(weight * score) / sum(weight)
```

If any gate invariant fails, the composite score is forced to 0 regardless of other results. The scenario passes if the composite score meets the `scoring.pass_threshold`.

## `forbidden`

Trajectory constraints that check the agent's behavior, not just its output. These use the audit log to detect unauthorized actions.

```yaml
forbidden:
  # Only allow DB writes to these tables
  db_writes_outside: [users, orders, audit_log]

  # Only allow HTTP calls to these services
  http_except: [payment-api, smtp]

  # Fail if secrets appear in stdout
  secrets_in_logs: deny

  # Only allow file writes to these paths
  file_writes_outside: [src/, config/, output/]
```

If any forbidden rule is violated, the scenario fails regardless of the invariant score.

## `scoring`

How invariant results are combined into a final verdict.

```yaml
scoring:
  pass_threshold: 0.9          # composite score must be >= 0.9 to pass

  # For multi-replica experiments
  replica_aggregation:
    strategy: majority          # all_must_pass, majority, or percentage
    min_pass_rate: 0.8
```

## `parallelism`

Run the same scenario multiple times and/or with different parameters.

```yaml
parallelism:
  replicas: 10                 # run 10 times
  isolation: per_run           # fresh sandbox per run

  # Test with different parameters
  matrix:
    - model: "gpt-4o"
    - model: "claude-sonnet-4-20250514"
    - model: "gemini-pro"
```

With this config, Keystone runs 3 (matrix entries) x 10 (replicas) = 30 total scenarios.

## `determinism`

Pin sources of non-determinism for reproducible runs.

```yaml
determinism:
  clock: "2026-01-01T00:00:00Z"   # fixed timestamp
  seed: 42                         # deterministic RNG
  dns: static                     # static DNS resolution
```

When a scenario fails, the `reproducer` in the results includes the seed, so you can re-run the exact same scenario.

## `teardown`

Export artifacts after each run. Runs even if the agent fails when `always_run` is set.

```yaml
teardown:
  always_run: true
  export:
    - type: audit_log
      to: "results/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/db.sql"
    - type: snapshot
      to: "results/final-state/"
    - type: mock_requests
      service: payment-api
      to: "results/api-calls.json"
```

---

### SDK Reference

All Keystone SDK methods across TypeScript, Python, and Go. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.

The Keystone client has seven services: `sandboxes`, `specs`, `experiments`, `alerts`, `agents`, `datasets`, and `scoring`. All examples use TypeScript -- Python and Go follow the same patterns.

## Client setup

Get your API key from [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **API Keys** tab → **Create Key**. Keys start with `ks_live_` and are shown once at creation. Either pass it to the client directly or set `KEYSTONE_API_KEY` in your environment.

---

## Secrets

Secrets are declared in your spec with a `source:` field that tells the SDK where to pull each value from — your local `.env`, a file on disk, a shell command (Vault / 1Password / Doppler), or the Dashboard. See the full list of source types in [Specs → secrets](/keystone/specs#secrets).

### Auto-forwarding from a spec file

Pass `specPath` (TS) / `spec_path` (Python) and the SDK reads the spec's `secrets:` block, resolves each declared source, and forwards the resulting `{name: value}` map in the create request.

### Precedence

Highest wins:

1. **Spec literal** (`from: static://...`) — deterministic fixtures
2. **SDK-forwarded** source value (`env`, `env:X`, `file:`, `command:`)
3. **Dashboard Secret** — server-side fallback for missing or `source: dashboard` entries

A declared secret that resolves to nothing at any layer fails the sandbox boot loudly — no silent empties.

### Dashboard as the team/prod baseline

The [Dashboard Secrets tab](https://app.paragon.run/app/keystone/settings) stores AES-256-GCM-encrypted values scoped to the billing owner. Use it when:

- A secret must be shared across teammates without everyone maintaining their own `.env`
- Running in CI/prod where no `.env` exists on the machine
- A prod-critical key must refuse any local override (declare with `source: dashboard`)

---

## Sandboxes

Sandboxes are isolated environments where your agent runs. Create one from a spec, interact with it, then destroy it.

### `sandboxes.create(opts)`

```typescript
const sb = await ks.sandboxes.create({
  spec_id: 'fix-failing-test',  // required: which spec to use
  timeout: '10m',               // optional: auto-cleanup timer
  metadata: { run: 'test-1' },  // optional: key-value pairs for tracking
});
// Returns: { id, spec_id, state, path, url, created_at, metadata, services }
```

The `services` field contains connection info for any backing services defined in the spec:
```typescript
sb.services.db   // { host: "db", port: 5432, ready: true }
sb.services.cache // { host: "cache", port: 6379, ready: true }
```

### `sandboxes.get(id)` / `sandboxes.list()` / `sandboxes.destroy(id)`

```typescript
const sb = await ks.sandboxes.get('sb-abc123');
// sb.state: 'creating' | 'ready' | 'running' | 'stopped' | 'error'

const all = await ks.sandboxes.list();

await ks.sandboxes.destroy('sb-abc123');
```

### `sandboxes.runCommand(id, opts)`

Run a shell command inside the sandbox.

```typescript
const result = await ks.sandboxes.runCommand('sb-abc123', {
  command: 'npm test',
  timeout: '2m',
});
// Returns: { command, stdout, stderr, exit_code, duration_ms }
```

### File operations

```typescript
// Read
const content = await ks.sandboxes.readFile('sb-abc123', 'src/utils.ts');

// Write
await ks.sandboxes.writeFile('sb-abc123', 'src/utils.ts', 'const x = 1;');

// Delete
await ks.sandboxes.deleteFile('sb-abc123', 'tmp/debug.log');
```

### State and diffing

```typescript
// Full filesystem snapshot (files + checksums)
const snapshot = await ks.sandboxes.state('sb-abc123');
// Returns: { captured_at, files: { [path]: { size, mode, checksum } } }

// What changed since sandbox creation
const diff = await ks.sandboxes.diff('sb-abc123');
// Returns: { added: string[], removed: string[], modified: string[] }
```

### Trace ingestion

Post trace events to a sandbox. The `wrap()` helper does this automatically for LLM calls, but you can also call it directly.

```typescript
await ks.sandboxes.ingestTrace('sb-abc123', [
  { event_type: 'tool_call', tool: 'write_file', phase: 'end', status: 'ok', duration_ms: 120 },
]);

const trace = await ks.sandboxes.getTrace('sb-abc123');
// Returns: { events: TraceEvent[], metrics: TraceMetrics }
```

### Real-time events (SSE)

Stream sandbox lifecycle events in real-time using Server-Sent Events:

```
GET /v1/sandboxes/:id/events
```

Events include status changes (`creating`, `ready`, `running`, `destroyed`), service startup, fixture application, and command execution. Useful for building dashboards or progress indicators.

```typescript
const eventSource = new EventSource(
  `${baseUrl}/v1/sandboxes/sb-abc123/events`
);
eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(`[${data.event_type}]`, data.data);
};
```

---

## Specs

Upload and manage spec YAML files.

```typescript
// Upload a spec
const spec = await ks.specs.create(readFileSync('my-spec.yaml', 'utf-8'));

// Get, list, delete
const spec = await ks.specs.get('fix-failing-test');
const specs = await ks.specs.list();
await ks.specs.delete('fix-failing-test');
```

Specs are versioned automatically. Each upload to the same `id` creates a new version.

---

## Experiments

Run your spec across scenarios and score the results.

### `experiments.create(opts)` / `experiments.run(id)`

```typescript
const exp = await ks.experiments.create({
  name: 'baseline-v1',
  spec_id: 'fix-failing-test',
});

// Trigger async (returns immediately)
await ks.experiments.run(exp.id);
```

### `experiments.runAndWait(id, opts?)`

Trigger and poll until complete. This is the most common way to run experiments.

### Results structure

| Field | Type | Description |
|-------|------|-------------|
| `total_scenarios` | number | Total scenarios run |
| `passed` / `failed` / `errors` | number | Counts |
| `metrics.pass_rate` | number | 0.0 to 1.0 |
| `metrics.mean_wall_ms` | number | Average latency |
| `metrics.p95_wall_ms` | number | 95th percentile latency |
| `metrics.total_cost_usd` | number | Total cost |
| `metrics.mean_cost_per_run_usd` | number | Cost per scenario |
| `metrics.tool_success_rate` | number | 0.0 to 1.0 |
| `scenarios` | array | Per-scenario results with invariants and reproducers |

### `experiments.compare(baselineId, candidateId)`

Compare two experiments. Detects regressions in pass rate, cost, and latency.

```typescript
const comparison = await ks.experiments.compare('exp-baseline', 'exp-new');
// Returns: {
//   regressed: boolean,
//   regressions: ["pass_rate dropped from 90% to 60%"],
//   metrics: [{ name, baseline, candidate, delta, direction }]
// }
```

### `experiments.metrics(id)`

Detailed metrics with tool breakdown and trends over time.

```typescript
const metrics = await ks.experiments.metrics(exp.id);
// Returns: { summary, tool_breakdown, cost_trend, pass_rate_trend }
```

---

## Alerts

Alert rules notify you when experiment metrics cross a threshold. Alerts are persisted and survive server restarts.

Conditions use the format `<metric> <operator> <value>`.

**Metrics:** `pass_rate`, `mean_wall_ms`, `p95_wall_ms`, `total_cost_usd`, `mean_cost_per_run_usd`, `tool_success_rate`, `side_effect_violations`, `mean_tool_calls`

**Operators:** `<`, `<=`, `>`, `>=`, `==`, `!=`

### Webhook alerts

```typescript
await ks.alerts.create({
  name: 'pass-rate-drop',
  eval_id: 'fix-failing-test',      // optional: only fire for this spec
  condition: 'pass_rate < 0.8',
  notify: 'webhook',
  webhook_url: 'https://hooks.slack.com/services/T00/B00/xxx',
});
```

Slack webhook URLs are auto-detected and receive rich Block Kit messages. Other URLs receive the raw JSON payload.

### Slack Bot alerts

Post directly to a Slack channel using a bot token (`SLACK_BOT_TOKEN` env var on the server):

```typescript
await ks.alerts.create({
  name: 'cost-spike',
  condition: 'mean_cost_per_run_usd > 2.00',
  notify: 'slack',
  slack_channel: '#agent-alerts',
});
```

### `alerts.list()` / `alerts.delete(id)`

```typescript
const alerts = await ks.alerts.list();
await ks.alerts.delete('alert-abc123');
```

---

## Agents

Agent snapshots are immutable, versioned bundles of your agent code. Upload them and reference them in specs with `agent.type: snapshot`.

### `agents.upload(opts)`

```typescript
const snapshot = await ks.agents.upload({
  name: 'my-agent',
  entrypoint: ['python', 'main.py'],
  runtime: 'python3.12',
  tag: 'latest',
  bundle: tarballBytes,  // Uint8Array of the .tar.gz
});
// Returns: { id, name, version, tag, digest, size_bytes, entrypoint, created_at }
```

### `agents.get(name, opts?)`

```typescript
const latest = await ks.agents.get('my-agent');
const tagged = await ks.agents.get('my-agent', { tag: 'stable' });
const specific = await ks.agents.get('my-agent', { version: 3 });
```

### `agents.list(opts?)` / `agents.listVersions(name, opts?)`

```typescript
const page = await ks.agents.list({ limit: 50 });
// Returns: { items: AgentSnapshot[], next_cursor?: string }

const versions = await ks.agents.listVersions('my-agent');
```

### `agents.delete(snapshot)`

Pass the full snapshot object, not just the ID.

```typescript
const snapshot = await ks.agents.get('my-agent', { version: 1 });
await ks.agents.delete(snapshot);
```

### Agent traces

Every trace is tagged with the agent that produced it. Query by agent name and version:

```
GET /v1/agents/my-agent/traces
GET /v1/agents/my-agent/traces?version=3
GET /v1/agents/my-agent/traces?limit=100
```

Returns traces plus computed metrics (tool success rate, latency percentiles, per-tool breakdown).

---

## LLM tracing

### `ks.wrap(client)`

Wrap an Anthropic or OpenAI client so every call automatically reports traces to the current sandbox. Sandbox routing is automatic — the SDK reads `KEYSTONE_SANDBOX_ID` from the environment (Keystone injects it when your agent runs inside a sandbox). Outside a sandbox there's nothing to route to, so `wrap()` returns the client untouched and your code runs as normal.

```typescript
const anthropic = ks.wrap(new Anthropic());
const openai = ks.wrap(new OpenAI());
// .create() calls now auto-report LLM usage, tool calls, and latency
// when running inside a sandbox. Locally, they pass through unchanged.
```

Works with any OpenAI-compatible provider:

| Provider | How to wrap |
|----------|-------------|
| Anthropic | `ks.wrap(new Anthropic())` |
| OpenAI | `ks.wrap(new OpenAI())` |
| Groq | `ks.wrap(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), { sandboxId })` |
| xAI | `ks.wrap(new OpenAI({ baseURL: 'https://api.x.ai/v1' }), { sandboxId })` |
| Together | `ks.wrap(new OpenAI({ baseURL: 'https://api.together.xyz/v1' }), { sandboxId })` |

### `ks.initTracing(sandboxId)` and `traced(name, fn)`

For non-LLM operations, use `traced()` to capture custom spans.

```typescript

const ks = new Keystone();
ks.initTracing('sb-xxx');

const result = await traced('write_config', async () => {
  await fs.writeFile('config.json', JSON.stringify(config));
  return 'ok';
});
```

Nested `traced()` calls create parent-child spans automatically.

### `Keystone.fromSandbox()`

If your agent is running inside a Keystone sandbox, use this to get a pre-configured client. It reads `KEYSTONE_BASE_URL` and `KEYSTONE_SANDBOX_ID` from the environment that Keystone injects automatically.

```typescript
const { client, sandbox } = await Keystone.fromSandbox();
// client: ready-to-use Keystone instance
// sandbox.services.db: { host: "db", port: 5432, ready: true }
```

Your agent also gets environment variables for each service:
- `KEYSTONE_SANDBOX_ID` -- the sandbox ID
- `KEYSTONE_BASE_URL` -- the Keystone API URL
- `KEYSTONE_SERVICE_DB_HOST` / `KEYSTONE_SERVICE_DB_PORT` -- per-service connection info

---

### Billing and Usage

How Keystone meters sandbox runs, plan rates, and where to manage payment. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.

Keystone meters three things: **vCPU-seconds** (how long your sandboxes run), **GiB-seconds** (how much memory they hold), and **LLM judge tokens** (when you use `llm_as_judge` invariants, since Keystone pays the provider on your behalf). Storage above the included quota is charged per GiB-month. That's it -- there are no per-experiment, per-scenario, or per-API-call fees, and your own agent's LLM calls stay on your own provider bill (Keystone traces them so you can see them, but doesn't re-bill them).

A card on file is required before you can create a Keystone API key. Once you have one, usage is billed against a rolling 30-day cycle, with invoices generated at the end of each cycle.

## Where to find it

In the dashboard, go to [app.paragon.run/app/keystone/data/billing](https://app.paragon.run/app/keystone/data/billing) -- or in the Keystone sidebar, under **Data**, click **Usage & Billing**.

The page shows:

- **Total spend** for the current cycle, updated within 30 seconds of every sandbox run.
- **Billing cycle picker** -- flip between the current cycle and any previous cycles to compare.
- **Usage breakdown** -- daily spend split between compute (vCPU + memory) and storage.
- **Resource breakdown** -- daily vCPU-seconds and GiB-seconds consumed.
- **Manage payment details** and **View invoices** buttons that open the Stripe billing portal.

## Plans

| Plan | Price | vCPU rate | Memory rate | Included storage | Storage overage |
| --- | --- | --- | --- | --- | --- |
| **Free** | $0/mo | $0.00003942 / vCPU·s | $0.00000672 / GiB·s | 1 GiB | $5 / GiB-month |
| **Pro** | $149/mo | $0.00003942 / vCPU·s | $0.00000672 / GiB·s | 5 GiB | $3 / GiB-month |
| **Enterprise** | Contact us | $0.00003942 / vCPU·s | $0.00000672 / GiB·s | 5 GiB | $0 (included) |

Compute rates are the same on every plan -- you pay exactly for what your sandboxes use. The plan difference is the included storage quota and the per-GiB overage rate for snapshots and cached sandbox state.

### LLM judge rates

When your spec uses an `llm_as_judge` invariant, Keystone runs Paragon as the judge and bills you per token. The default model is **paragon-md** (Claude Sonnet 4.6, strong judge); you can override per-invariant with `check.model`.

| Model | Input (per 1k tokens) | Output (per 1k tokens) |
| --- | --- | --- |
| `paragon-fast` | $0.0010 | $0.0040 |
| `paragon-md` (default) | $0.0030 | $0.0150 |
| `paragon-max` | $0.0200 | $0.0800 |

```yaml
invariants:
  draft_quality:
    check:
      type: llm_as_judge
      model: paragon-fast     # optional; default is paragon-md
      criteria: "The draft email is professional and accurate."
```

A typical judge call on `paragon-md` with a ~2k-token input and ~200-token output costs around $0.009. Tokens are stamped on every call and aggregated into the cycle's invoice alongside compute and storage.

## Billing cycles

Your first sandbox run starts the billing anchor. From there, cycles are 30 days long and roll forward automatically. The dashboard always shows the current cycle by default; use the picker to look at any previous cycle.

Current-cycle usage refreshes every 30 seconds. Historical days are frozen once the nightly rollup runs, so past cycles never change.

## Adding a payment method

1. Go to [Usage & Billing](https://app.paragon.run/app/keystone/data/billing) in the dashboard.
2. Click **Manage payment details** -- this opens the Stripe billing portal in a new tab.
3. Add or update your card in the portal.
4. Close the portal tab and return to Keystone. Your API keys page will unlock and you can create a `ks_live_` key.

Until a card is attached, [the keys page](https://app.paragon.run/app/keystone/keys) blocks key creation and links you straight to the setup flow.

## Upgrading to Pro

From the [Usage & Billing](https://app.paragon.run/app/keystone/data/billing) page, click **Upgrade to Pro**. You'll be charged $149 on the first of each cycle. A card on file is required; team accounts can only be upgraded by the team admin.

Unlimited/comped teams see a `comped` badge on the total spend -- usage still counts, but the invoice is zeroed out.

## Invoices

Click **View invoices** on the billing page to open the Stripe customer portal. From there you can:

- Download past invoices as PDFs.
- Update your billing email and tax ID.
- Cancel your subscription (Pro only).

## Team billing

If your user account belongs to a team, **Keystone usage bills the team, not the individual**. The team's admin controls the payment method and sees the unified usage. Sandboxes created by any team member roll up into the same billing cycle.

A few rules of thumb:

- Only the team admin can upgrade, add a card, or cancel.
- All team members can see the Usage & Billing page and their own usage contribution.
- Leaving a team means your future usage bills against your personal account (which will need its own card).

## What's not billed

- **Your agent's LLM provider costs** -- when your agent calls Anthropic, OpenAI, xAI, etc. from inside a sandbox using your own keys, you pay the provider directly. Keystone captures token counts and estimated USD via `ks.wrap()` so you can see it in traces, but does not re-bill it. (This is different from `llm_as_judge` invariants, which Keystone runs on its own Paragon judge and bills through your Keystone invoice — see above.)
- **API calls against the Keystone control plane** (listing experiments, fetching results, creating specs). These are free.
- **Experiments that fail to start** -- if the sandbox never boots, no vCPU or memory time is counted.

## Troubleshooting

### "Payment method required" when creating an API key

You haven't attached a card yet. Go to [Usage & Billing](https://app.paragon.run/app/keystone/data/billing), click **Manage payment details**, add a card in the Stripe portal, then retry.

### Usage number hasn't updated after a run

Current-cycle numbers poll every 30 seconds. If you just finished an experiment, wait a moment and refresh. Historical days don't update after the nightly rollup -- they're frozen.

### The dashboard shows "No usage yet"

Your billing anchor hasn't been set, which means no sandbox has ever run for you (or your team). Create an API key and run one experiment to initialize the cycle.

### I'm on an unlimited plan -- why do I see a cost?

The dashboard shows **shadow cost** ("what you'd be paying") so you can still gauge real compute spend. Your invoice stays at $0; the `comped` badge on the total spend confirms it.

---

## Getting started

### Overview

Agentic QA platform for bulletproof code.

Paragon is an agentic QA platform built with breakthrough research to help teams ship bulletproof code at unprecedented speed.

Paragon analyzes your GitHub repositories and creates intelligent pull requests with improvements for code quality, performance, and security.

<div className="hidden dark:block">
  
</div>
<div className="block dark:hidden">
  
</div>

## Core Features

  
  
  

## Get Started

  

## Platform

  
  
  
  

## Tools Integration

---

### Paragon MCP

Connect Paragon to your coding tools via MCP

The Paragon MCP server gives your coding tools access to code review, testing, and analysis capabilities. Connect it to Claude Code, Cursor, or Windsurf to use Paragon directly from your development environment.

## Prerequisites

- [Paragon CLI installed](/paragon/overview)
- An API key from [app.paragon.run](https://app.paragon.run)

## Get your API key

1. Go to [app.paragon.run](https://app.paragon.run)
2. Sign in or create an account
3. Navigate to **Settings** → **API Keys**
4. Copy your API key

## Connect to your tool

## Available tools

Once connected, Paragon exposes 12 tools to your coding assistant:

### Code review

| Tool | Description |
|------|-------------|
| `run_paragon` | Run a Paragon prompt — reads/writes files, runs shell commands, and performs complex coding tasks. Requires API key. |
| `list_reviewed_prs` | List pull requests with Paragon review comments in the current repository. Supports filtering by state (`open`, `closed`, `all`). |
| `get_review_comments` | Get parsed review findings from a PR, including severity, descriptions, file locations, and suggested fixes. Auto-detects PR from current branch if omitted. |
| `resolve_review_comment` | Reply to a review comment and optionally resolve the thread. |

### Testing

| Tool | Description |
|------|-------------|
| `detect_test_framework` | Scan the repository and detect test frameworks in use. Returns framework details, run commands, file patterns, and confidence level. |
| `find_tests` | Discover existing test files using framework-aware file patterns. Returns file paths and count. |
| `run_tests` | Execute unit/integration tests using the detected or specified framework. Returns pass/fail status and full output. |
| `run_e2e_tests` | Run Playwright E2E tests locally. Installs browsers if needed, optionally starts a dev server, and captures results. |
| `generate_tests` | Generate unit, integration, or E2E tests using Paragon. Detects patterns, writes tests, verifies compilation, runs them, and fixes failures. Requires API key. |

### Test suite management

| Tool | Description |
|------|-------------|
| `list_test_suites` | List your test repositories and suites from the Paragon cloud dashboard. Requires API key. |
| `save_to_suite` | Save locally generated test files to a Paragon cloud test suite. Requires API key. |

### Utility

| Tool | Description |
|------|-------------|
| `check_setup` | Check if the MCP server is properly configured. Reports the status of the API key, Paragon binary, and GitHub CLI. |

## Example prompts

After setup, your tool can automatically use Paragon when relevant. You can also request tools directly:

```
Review the comments on my current PR and fix the issues

Generate unit tests for src/auth.ts

Run the test suite and show me what's failing

Find all Playwright tests and run them in headed mode
```

---

## PR Reviews

### Dashboard

Your central hub for PR reviews, testing, the Paragon Agent, and team settings.

The Paragon Dashboard is your central hub for automated PR reviews, test generation and execution, the Paragon Agent, and team management.

The dashboard is organized into the following pages:

- **Home**: Credit balance, usage statistics, and CLI installation
- **PR Reviews**: View all automatically reviewed pull requests and discovered issues
- **Testing**: Generate, run, and evolve tests for your repositories
- **Agent**: An agentic QA engine that finds bugs, writes tests, reviews code, and opens pull requests with fixes
- **Settings**: Configure your account, PR reviews, severity filters, custom rules, and integrations

<div className="hidden dark:block">
  
</div>
<div className="block dark:hidden">
  
</div>

## Quick Start

1. Install the Paragon CLI globally with npm
2. Connect your GitHub account in **Settings**
3. Enable automatic PR reviews and add repositories
4. View reviewed PRs in **PR Reviews**
5. Generate and manage tests in **Testing**
6. Run the **Agent** to find bugs and open fix PRs

## Installation & Setup

Get started with Paragon by installing the CLI and authenticating with your API key.

### Step 1: Install CLI

```bash
npm i -g @polarityinc/paragon
```

### Step 2: Authenticate

```bash
paragon auth login
```

### Step 3: Verify Installation

```bash
paragon
```

## Home

The Home page provides an overview of your account status and quick access to key features.

### Credit Balance & Plan

Your credit balance and current plan are displayed at the top:

- Current credit balance with visual indicator
- Active plan name (Free, Developer, Startup)
- "Add Credits" button for paid plans

### Statistics

Click "Show All Stats" to expand your usage statistics:

- Issues caught by severity (Critical, High, Medium, Low)
- Total tokens used
- Messages sent
- Estimated hours saved

### Install Paragon CLI

Click the install button to open a modal with copy-paste commands for CLI setup.

---

### Reviews

View and manage all automatically reviewed pull requests and discovered issues.

The Issues page is your central hub for viewing all automatically reviewed pull requests and the issues found.

### PR List

View all PRs that have been automatically reviewed:

- **Search**: Filter PRs by title
- **Repository Filter**: Show PRs from specific repositories
- **Sentiment Score**: Color-coded status (Good/Fair/Poor/Critical)
- **Author**: PR author with avatar
- **Date**: Relative timestamp (e.g., "5 min ago")

## Analytics Dashboard

View metrics and trends for your PR reviews:

  
  
  

The trend chart visualizes review activity over time.

### Issue Detail View

Click on any PR to view detailed issues found:

- Issues sorted by severity (Critical → High → Medium → Low)
- Issue title and description
- File path and line number
- Code snippet display
- Link to GitHub comment thread
- Resolution status

## Trigger Reviews Manually

You can trigger a PR review on-demand by tagging `@paragon-review` in any comment on a pull request. This is useful when:

- You want to re-run a review after making changes
- The repository doesn't have automatic reviews enabled
- You want a review on a specific PR without enabling auto-reviews

---

### Repos

Manage monitored repositories and per-repo settings for PR reviews and the agent.

Manage which repositories Paragon monitors and configure per-repo settings for automatic PR reviews, agent behavior, and access control.

## Adding Repositories

1. Enable the "Automatic PR Reviews" toggle on the PR Reviews dashboard
2. Click "Manage Repos" to open the repository manager
3. Search and select repositories with checkboxes
4. Selected repos are auto-saved after a short delay

## Per-Repository Settings

Each repository has individual toggles that control how Paragon interacts with it. These settings are managed from the Repos page.

### Auto Review

When enabled, Paragon automatically reviews every new pull request opened against this repository. Disable to only receive reviews when manually triggered via `@paragon-review`.

### Open Access

Allow external contributors (outside your GitHub organization) to trigger Paragon on this repository. By default, only org members can trigger reviews and agent runs.

### Agent Monitoring

When enabled, the Paragon Agent continuously monitors this repository for issues. The agent watches for patterns like security vulnerabilities, performance regressions, and code quality problems, and can flag them proactively.

### Agent Auto-Fix

When enabled alongside Agent Monitoring, the agent automatically creates fixes for issues it discovers instead of just reporting them. The agent commits changes to a new branch and opens a pull request for your review.

### Agent Push-as-PR

Controls how the agent delivers code changes. When enabled, the agent always pushes changes as a new pull request rather than committing directly to the working branch. This ensures all agent-generated changes go through your standard code review process.

### Auto-Resolve

When enabled, Paragon automatically resolves review comments that have been addressed in subsequent commits. If a flagged issue is fixed in a follow-up push, the corresponding review comment is marked as resolved.

Enabled by default for new repositories.

## GitHub Connection

Connect your GitHub account to enable repository access:

- View connection status in "Code Host Connections"
- Click "Connect" or "Reconnect" to authorize
- Tokens are encrypted and auto-refresh

---

### Settings

Configure your account, review options, severity filters, guardrails, and branch protection.

Configure your account and review settings from the Settings page.

## Account

- **Name**: Editable display name
- **Email**: Synced from GitHub (read-only)
- **Avatar**: Synced from GitHub profile

### API Key

Your API key is displayed in Settings. Use it for CLI authentication with `paragon auth login`.

## Review Model Depth

Choose the model tier used for automatic PR reviews. Each tier has a different depth of analysis and credit cost.

| Model | Description | Credit Cost |
|-------|-------------|-------------|
| **Auto** | Automatically adapts to PR complexity — uses a lighter model for simple changes and a deeper model for complex ones | Varies |
| **Max** | Deepest analysis with extended reasoning | 10x |
| **MD** | Balanced depth and speed | 3x |
| **Fast** | Quick scan for straightforward PRs | 1x |

## Review Options

Configure what to include in automatic PR reviews:

  

  

  

  

  

  

  

  

  

## Severity Filters

Control which types of issues are reported in your PR reviews.

### Issue Severity Levels

  
  
  

## Guardrails

Guardrails are custom rules that Paragon enforces during every review. Unlike severity filters (which control what to report), guardrails define specific standards your code must meet.

### Enabling Guardrails

Toggle guardrails on from the Configure page. When enabled, Paragon checks every PR against your active guardrails and flags violations alongside its standard review.

### Templates

Paragon provides built-in guardrail templates you can activate with one click:

| Template | Category | Severity | Rule |
|----------|----------|----------|------|
| **File Length Limit** | Code Quality | Warning | Files must be under 300 lines of code |
| **Function Length Limit** | Code Quality | Warning | Functions must be under 50 lines |
| **No Console Logs** | Code Quality | Error | No console.log statements in production code |
| **No Any Types** | Code Quality | Warning | TypeScript files must not use the "any" type |
| **No Hardcoded Secrets** | Security | Error | No hardcoded API keys, passwords, or secrets |
| **JSDoc Required** | Documentation | Warning | All public functions must have JSDoc comments |
| **React Componentization** | Code Quality | Warning | React components must be under 100 lines |
| **Test File Required** | Code Quality | Warning | New feature files must have corresponding test files |

### Custom Guardrails

Create your own guardrails with:

- **Name** — A short label for the rule
- **Description** — What the rule checks for
- **Rule** — The specific standard to enforce (this is what the reviewer evaluates against)
- **Category** — Code Quality, Style, Security, or Documentation
- **Severity** — Error (must fix) or Warning (should fix)

### Managing Guardrails

Each guardrail can be individually enabled or disabled without deleting it. Edit the name, description, rule, category, or severity at any time. Delete guardrails you no longer need.

## Branch Protection

Prevent Paragon's agent from pushing directly to critical branches. When a branch is protected, the agent will create a separate branch and open a pull request instead of pushing directly.

### Configuration

Branch protection is configured per repository:

1. Go to the **Configure** page
2. Scroll to **Branch Protection**
3. Expand a repository
4. Add branches to protect (e.g., `main`, `master`, `production`)

### Patterns

Branch protection supports:

- **Exact match** — e.g., `main`, `master`, `develop`
- **Wildcard patterns** — e.g., `release/*`, `hotfix/*`

### Quick Add

Common branches (`main`, `master`, `develop`, `staging`, `production`) are suggested for quick selection when adding protection rules.

---

### Custom Rules

Define custom coding standards to follow during code reviews.

Define custom coding standards that Paragon follows during code reviews.

## Upload Documents

Upload existing documentation to automatically extract rules:

- **Supported formats**: PDF, DOCX, MD, TXT
- Coding rules are extracted automatically
- Rules tagged with "upload" source

## Manual Entry

Add rules by typing directly:

1. Click "Manual Entry"
2. Type your rule or guideline
3. Save to add to active rules

## Example Rules

```
All API endpoints must include error handling with try-catch
Use TypeScript strict mode for all new files
Database queries must use parameterized statements
```

## Managing Rules

- Click any rule to view full content
- Copy rule content to clipboard
- Delete rules no longer needed
- View source (upload/manual) and creation date

---

## Agent

### Overview

An agentic QA engine that finds bugs, writes tests, reviews code, and opens pull requests with fixes.

The Paragon Agent is an agentic QA engine that clones your repository, analyzes your code for issues, writes tests, and opens pull requests with fixes. Use it from the dashboard, trigger it from GitHub comments and Slack, or automate it on schedules and events.

## What It Can Do

- **Find and fix bugs** — Point it at an issue, error, or flaky test and it investigates and patches
- **Write and improve tests** — Generate unit, integration, and E2E tests and increase coverage across your codebase
- **Review code quality** — Catch security vulnerabilities, performance issues, and regressions before they ship
- **Validate changes** — Run builds, execute test suites, and verify nothing breaks
- **Browse and interact with web apps** — Use a real browser in a sandbox to test UI, fill forms, and validate behavior

## How It Works

1. **Select a repository** — Choose a connected GitHub repository and branch
2. **Describe the task** — Tell the agent what you want reviewed, tested, or fixed
3. **Agent executes** — The agent clones your repo in an isolated sandbox, analyzes the code, runs tests, and makes changes
4. **Review the PR** — Review the diff in the dashboard or on the generated pull request

## Session Modes

| Mode | Best For | Duration | How It Works |
|------|----------|----------|--------------|
| **Standard** | Quick bug fixes, test generation, code review | Minutes | Single agent processes your prompt directly |
| **Grind** | Comprehensive test coverage, large-scale refactors, multi-file fixes | Hours | Plans first, then delegates to parallel workers (LRA) |

Grind mode is powered by the **Long Running Agent (LRA)** — a 2-phase orchestrator that plans, gets your approval, then executes with parallel workers.

## Models

| Model | Speed | PCU Cost | Best For |
|-------|-------|----------|----------|
| **Max** | Slowest | Highest | Complex debugging, nuanced code review, architecture-level issues |
| **Mid** | Balanced | Moderate | Most tasks — good balance of quality and speed |
| **Fast** | Fastest | Lowest | Simple fixes, test generation, straightforward patches |

## MCP Integrations

The agent can connect to external services via MCP (Model Context Protocol) to expand its capabilities:

- **Linear** — Read and update issues
- **Slack** — Send messages and read channels
- **Jira** — Manage tickets
- **Notion** — Access documentation
- **Sentry** — Investigate errors
- **Vercel** — Check deployments
- **Supabase** — Query databases
- **Cloudflare** — Manage workers and DNS
- **AWS** — Interact with AWS services
- **MongoDB** — Query collections
- **Custom MCP servers** — Connect any MCP-compatible service via stdio, HTTP, or SSE

Configure MCP integrations from the agent settings panel.

## Usage Analytics

The **Usage** panel in the sidebar shows:

- **PR metrics** — PRs created, merged, merge rate, PCUs per merged PR
- **Session metrics** — Total sessions, tokens used, total cost, average duration
- **Pull request chart** — Status breakdown (merged, open, closed) over time
- **Model usage breakdown** — Sessions, tokens, and cost per model tier
- **Time range** — Daily or weekly aggregation over the last 90 days

## Next Steps

---

### Sessions

Create agent sessions, interact with the agent in real time, and review fixes and test changes.

A session is a single conversation with the Paragon Agent. Each session works on one repository and branch, processes your prompt, and produces code changes — bug fixes, new tests, or quality improvements.

<video src="/images/agent-sessions-demo.mp4" autoPlay loop muted controls playsInline style={{ borderRadius: "0.75rem", marginTop: "1.5rem", marginBottom: "1.5rem", border: "1px solid rgba(255, 255, 255, 0.08)", boxShadow: "0 4px 24px rgba(0, 0, 0, 0.4)", maxWidth: "100%", height: "auto" }} />

## Creating a Session

1. Navigate to the **Agent** page in the dashboard
2. Select a **repository** and **branch** from the dropdown
3. Choose a **model** (Max, Mid, or Fast)
4. Type your prompt and send

You can attach images to your prompt for additional context (screenshots of bugs, error logs, test failures).

## Interacting with a Session

### Real-Time Streaming

As the agent works, you see its progress in real time:

- **Thinking** — The agent's reasoning process (collapsible)
- **Tool calls** — Commands run, files read, searches performed
- **Files changed** — Live diff counts as files are modified
- **Parallel agents** — When workers are spawned, see each agent's current activity and status

### Questions

The agent may ask clarifying questions before proceeding. Questions appear inline with the following types:

- **Single select** — Pick one option from a list
- **Multi select** — Pick multiple options
- **Text** — Free-form answer

Each question can include option descriptions for additional context. You can skip optional questions.

### Follow-Up Messages

You can send additional messages to a running session to:

- Provide clarification the agent asked for
- Redirect the agent's approach
- Add requirements you forgot to mention

Completed, stopped, or failed sessions can also be resumed with follow-up messages.

### Plan Approval

For Grind mode sessions, the agent produces a plan before making changes. When this happens:

- The session enters an **awaiting plan approval** state
- Review the proposed plan with phases and subtasks
- **Approve** to let the agent proceed with execution
- **Reject** to have the agent revise its approach (you can request revisions multiple times)
- Send a **message** with additional context or corrections

Toggle between **manual approval** and **auto-accept** mode using the plan approval toggle in the input bar.

## Session Statuses

| Status | Description |
|--------|-------------|
| **Pending** | Session created, not yet started |
| **Cloning** | Repository is being cloned |
| **Running** | Agent is actively working |
| **Awaiting Plan Approval** | Grind mode plan ready for review |
| **Committing** | Agent is committing and pushing changes |
| **Completed** | Session finished successfully |
| **Stopped** | Manually stopped by user |
| **Failed** | Session encountered an error |

## Reviewing Results

### Diff Panel

When the agent modifies code, a side panel shows file-by-file diffs with syntax highlighting. Each file displays:

- File path
- Lines added and removed
- Full diff with additions and deletions highlighted

Click any file to see its complete diff.

### Pull Requests

If the agent creates a PR, a link appears in the session header with the PR status (open, merged, closed). Click it to go directly to the GitHub pull request.

### Session History

All sessions are listed in the sidebar, grouped by date. Each shows:

- Repository and branch
- Status indicator
- Number of files changed

Team members can see each other's sessions.

## Stopping a Session

Click the **Stop** button to halt a running session or cancel a plan approval wait. The agent stops execution and saves its progress. You can resume by sending a follow-up message.

---

### Grind Mode

Tackle large tasks with parallel agents that plan and execute over hours.

Grind mode is designed for large tasks that require planning, coordination, and extended execution time. Powered by the **Long Running Agent (LRA)**, Grind uses a 2-phase approach: a **planner** investigates your codebase and creates a structured plan, then delegates subtasks to **parallel workers** for execution.

<video src="/images/agent-grind-demo.mp4" autoPlay loop muted controls playsInline style={{ borderRadius: "0.75rem", marginTop: "1.5rem", marginBottom: "1.5rem", border: "1px solid rgba(255, 255, 255, 0.08)", boxShadow: "0 4px 24px rgba(0, 0, 0, 0.4)", maxWidth: "100%", height: "auto" }} />

## When to Use Grind

| Use Standard Mode | Use Grind Mode |
|---|---|
| Single bug fix | Multi-file bug investigation |
| Generate a few tests | Comprehensive test coverage across a codebase |
| Quick code review | Full security or performance audit |
| Simple patch | Cross-cutting refactors and regression fixes |

## How Grind Works

### Phase 1: Planning

The agent investigates your codebase — reading files, running builds, checking tests — and produces a structured plan with phases and subtasks in Markdown.

The plan is presented for your review before any code changes are made.

### Plan Approval

You control when and how plans are approved:

- **Manual approval** — Review the plan and approve or reject it
- **Auto-accept** — Plans execute automatically without waiting for approval

Toggle between modes using the **plan approval toggle** in the input bar.

When a plan is awaiting approval, you can:
- **Approve** — The agent begins execution
- **Reject** — The agent revises its approach based on your feedback
- Send a **message** with additional context or corrections

You can request revisions multiple times until the plan meets your requirements.

### Phase 2: Execution

Once approved, the planner delegates subtasks to parallel workers:

- Multiple workers run simultaneously on different parts of the codebase
- Each worker is scoped to specific directories to avoid conflicts
- Workers can be assigned different model tiers based on task complexity
- Sub-planners can recursively delegate for deeply complex tasks

You can track each parallel agent's real-time activity and status in the dashboard.

### Verification

After all workers complete, the agent:

1. Runs the build to verify compilation
2. Runs tests and compares against the baseline
3. Runs linters and checks for regressions
4. Commits and pushes changes
5. Creates a pull request

## Enabling Grind Mode

1. Open the **Agent** page in the dashboard
2. Toggle **Grind** in the input bar
3. Select a **time limit**:
   - **3 hours**
   - **5 hours**
   - **10 hours**
   - **Until done** (no time limit)
4. Choose your **plan approval** preference (manual or auto-accept)
5. Send your prompt

## Sandbox Environment

Grind sessions run in an isolated sandbox with a full desktop environment. The **Sandbox Panel** gives you:

- **Desktop view** — Live stream of the agent's virtual desktop
- **File changes** — Real-time view of modified files with diffs
- **Plan phases** — Track progress through the execution plan
- **Terminal** — See command output
- **Parallel agents** — Monitor each worker's current activity

### Browser Support

The sandbox includes a real browser that the agent can use to:

- Navigate to URLs and interact with web applications
- Click elements, fill forms, type text, and scroll
- Take screenshots for visual validation
- Run E2E tests against a running application

This is especially useful for UI validation, as the agent can interact with web applications visually and verify behavior in a real browser.

## CLI Usage

Grind mode is also available via the CLI with the `--lra` flag:

```bash
paragon --lra "your instruction" --model paragon-md --timeout 3h
```

Add `--browser` to enable browser support and `--auto-accept` to skip plan approval.

## Tips

- **Be specific** — Detailed prompts produce better plans. Include what to test, expected behavior, edge cases, and constraints.
- **Start with auto-accept off** — Review the first few plans manually to calibrate quality, then switch to auto-accept for trusted workflows.
- **Use for coverage goals** — Grind is ideal for "add tests for all untested modules" or "fix all flaky tests in this repo" type tasks.
- **Multiple repos** — You can pre-clone multiple repositories for cross-repo coordination.

---

### Automations

Run the agent automatically on schedules, GitHub events, Slack messages, or manual triggers.

Automations let you configure the agent to run automatically in response to events or on a schedule. Each automation has a prompt, target repositories, and one or more triggers.

<video src="/images/agent-automations-demo.mp4" autoPlay loop muted controls playsInline style={{ borderRadius: "0.75rem", marginTop: "1.5rem", marginBottom: "1.5rem", border: "1px solid rgba(255, 255, 255, 0.08)", boxShadow: "0 4px 24px rgba(0, 0, 0, 0.4)", maxWidth: "100%", height: "auto" }} />

## Creating an Automation

1. Go to **Agent > Automations** in the dashboard
2. Click **New Automation**
3. Configure the automation:
   - **Name** — A descriptive name for the automation
   - **Prompt** — The instruction the agent will execute
   - **Repositories** — Which repos the automation targets
   - **Model** — Agent model tier (Max, Mid, Fast)
   - **Triggers** — When the automation should run
   - **Environment variables** — Optional env vars for the session
   - **Tools** — Memory, MCP integrations, and custom commands available to the agent

## Triggers

### Schedule Triggers

Run the agent on a recurring schedule:

- **Daily** — Runs at a specific time each day
- **Weekly** — Runs on selected days at a specific time
- **Cron** — Custom cron expressions for advanced scheduling

All schedules support timezone configuration.

### GitHub Event Triggers

Run the agent in response to GitHub activity:

| Trigger | Description |
|---------|-------------|
| **PR Created** | A new pull request is opened |
| **PR Merged** | A pull request is merged |
| **Push** | Code is pushed to a branch |
| **Comment Created** | A comment is posted on a PR or issue |
| **Review Submitted** | A code review is submitted (approved, changes requested, or commented) |

#### Filters

Narrow down when triggers fire:

- **Branch patterns** — Only trigger on branches matching a pattern (e.g., `feature/*`)
- **Label filters** — Only trigger on PRs with specific labels

### Slack Triggers

Run the agent when a message is posted in a Slack channel:

- **Channel filter** — Only trigger on messages in a specific channel
- **Thread-only** — Only trigger on threaded messages
- **Text includes** — Only trigger when the message contains specific text

### Manual Triggers

Click **Run** on any automation to trigger it immediately. This creates an agent session with the automation's prompt and settings.

## Templates

Paragon provides preset automation templates for common workflows:

| Template | Description |
|----------|-------------|
| **Find Vulnerabilities** | Scan for security vulnerabilities |
| **Assign PR Reviewers** | Auto-assign reviewers to new PRs |
| **Summarize Changes Daily** | Daily summary of all code changes |
| **Fix Bugs Reported in Slack** | Respond to bug reports in Slack |
| **Add Test Coverage** | Generate tests for untested code |
| **Clean Up Feature Flags** | Remove stale feature flags |
| **Find Critical Bugs** | Deep scan for critical bugs |
| **Fix CI Failures** | Investigate and fix failing CI pipelines |
| **Generate Docs** | Auto-generate documentation for new code |
| **Investigate PagerDuty Incidents** | Triage PagerDuty alerts |
| **Investigate Datadog Errors** | Analyze Datadog error reports |
| **Triage Linear Issues** | Auto-triage incoming Linear issues |

Browse templates when creating a new automation to get started quickly.

## Slack Reporting

Configure Slack notifications for automation runs:

- **Enable reporting** — Toggle Slack notifications on/off
- **Channel** — Select which Slack channel receives reports
- **Include output** — Optionally include the full agent output in the message

## Viewing Runs

Go to **Agent > Automations > Runs** to see all automation runs with:

- **Status** — Pending, running, success, or failed
- **Trigger type** — What triggered the run (scheduled, manual, PR event, Slack, etc.)
- **Automation name** — Which automation was executed
- **Session link** — Link to the agent session for full details
- **Error messages** — Inline error display for failed runs
- **Filters** — Filter by scope (mine/all), status, trigger type, automation name, or date range

## Automation States

| Status | Description |
|--------|-------------|
| **Active** | Running on triggers and schedules |
| **Paused** | Triggers ignored until resumed |
| **Archived** | Disabled and hidden from active list |

## Examples

**Nightly QA review:**
> Review all changes pushed today. Flag security vulnerabilities, missing tests, performance regressions, and potential bugs. Open a PR with fixes.

*Trigger: Daily at 9:00 AM*

**Auto-fix on review:**
> Address the requested changes in this review. Follow the reviewer's feedback exactly.

*Trigger: Review submitted with "changes requested"*

**Test generation on new PRs:**
> Generate unit tests for all new or modified functions in this PR. Ensure edge cases and error paths are covered.

*Trigger: PR created*

**Fix bugs from Slack:**
> Investigate the bug described in this Slack message. Reproduce it, find the root cause, and open a PR with a fix.

*Trigger: Slack message in #bugs channel*

---

## Paragon CLI

### Overview

Terminal assistant for code development, featuring model switching, MCP integrations, and workflow automation.

Paragon is a terminal assistant that helps you write, review, and manage code directly from your command line. It features model switching, MCP server integrations, custom agents, automated workflows, and intelligent code assistance.

## Installation

Install Paragon globally via npm:

```bash
npm i -g @polarityinc/paragon
```

After installation, authenticate with your Paragon account:

```bash
paragon auth login
```

You can find the API key in the dashboard under the settings tab at the bottom of the page.

```bash
paragon
```

## Models

Paragon supports multiple models with different capabilities:

| Model | Description |
|-------|-------------|
| `paragon-fast` | Quick responses, lower cost |
| `paragon-mid` | Balanced performance and cost |
| `paragon-high` | Higher quality responses |
| `paragon-max` | Maximum capability with extended thinking |

Models with thinking/reasoning capabilities will show a budget indicator.

## Keyboard Shortcuts

| Shortcut | Action |
|----------|--------|
| `ctrl+p` | Open command palette |
| `ctrl+n` | New session |
| `ctrl+s` | Switch session |
| `ctrl+t` | Reopen tasks dialog |
| `ctrl+f` | File picker |
| `ctrl+o` | External editor |
| `ctrl+g` | Toggle help |
| `ctrl+c` | Quit |
| `ctrl+z` | Suspend to shell |

## Additional Features

### File Attachments

Type `@` in the input to get file path completions. Attach images or files to your message for Paragon to analyze.

### Preference Memory

| Prefix | Scope | Example |
|--------|-------|---------|
| `#` | Global (all projects) | `# Always use TypeScript for new files` |
| `##` | Local (this repository only) | `## Use Vitest for testing` |

---

### Slash Commands

Complete reference for all Paragon CLI slash commands.

Type these commands directly in the input field. Start typing `/` to see autocomplete suggestions.

## `/model`

**Switch to a different model.** Opens a model selection dialog where you can choose between available LLM models.

## `/session`

**Switch to a different session.** Opens a session picker showing all your previous conversation sessions. Each session maintains its own context and history.

## `/new`

**Start a new session.** Creates a fresh conversation session with no prior context.

## `/compact`

**Summarize current session.** Creates a summary of the current conversation and starts a new session with that summary as context. Use when:

- Approaching context window limits (auto-prompted at 95%)
- Want to condense a long conversation
- Starting a related but distinct task

## `/approvals`

**Toggle auto-approve (yolo) mode.** Toggles automatic approval of all tool executions:

| Mode | Behavior |
|------|----------|
| **OFF** (default) | Paragon asks permission before running commands, editing files, etc. |
| **ON** (yolo mode) | All tool calls are automatically approved |

## `/feature`

**Toggle feature mode.** Enables/disables Feature Mode which changes how the agent approaches tasks:

| Mode | Behavior |
|------|----------|
| **OFF** | Standard assistant behavior |
| **ON** | Agent focuses on implementing complete features with better planning |

## `/init`

**Create/Update PARAGON.md memory file.** Generates or updates a `PARAGON.md` file in your project root containing:

- Project overview and architecture
- Key conventions and patterns
- Important files and their purposes
- Development guidelines

This file is automatically read by Paragon to understand your project context.

## `/infra`

**Generate INFRA.md and infrastructure diagram.** Analyzes your project and creates:

- `INFRA.md` - Infrastructure documentation
- Mermaid diagram showing system architecture

## `/status`

**View account status and credits.** Opens a dialog showing:

- Current user and team information
- Remaining token credits
- Usage statistics
- Plan/subscription details

## `/help`

**Toggle help display.** Shows/hides the keyboard shortcut help bar at the bottom of the screen.

## `/quit`

**Quit Paragon.** Exits the application. Same as `ctrl+c` or typing `exit`/`quit` in the input.

---

### Tools & Automation

MCP servers, custom agents, automations, and scheduled monitors.

## `/mcp` - MCP Servers

**Manage MCP (Model Context Protocol) servers.** Opens the MCP Manager dialog where you can configure external tool integrations.

### Adding MCP Servers

1. Select `/mcp` to open the manager
2. Choose "Add Server" or press `a`
3. Select server type:
   - **stdio** - Local process (most common)
   - **http** - HTTP endpoint
   - **sse** - Server-Sent Events

### For stdio servers

| Field | Description |
|-------|-------------|
| **Name** | Unique identifier for the server |
| **Command** | The executable (e.g., `npx`, `uvx`, `node`) |
| **Args** | Command arguments as JSON array |
| **Env** | Environment variables as JSON object |

### Common MCP Server Examples

**Filesystem access:**
```
Command: npx
Args: ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/projects"]
```

**GitHub:**
```
Command: npx
Args: ["-y", "@modelcontextprotocol/server-github"]
Env: {"GITHUB_PERSONAL_ACCESS_TOKEN": "ghp_xxx"}
```

**PostgreSQL:**
```
Command: npx
Args: ["-y", "@modelcontextprotocol/server-postgres", "postgresql://user:pass@localhost/db"]
```

### Managing Servers

- **Enable/Disable**: Toggle servers on/off without removing them
- **Authenticate**: Some servers require OAuth or API key authentication
- **Delete**: Remove servers you no longer need

## `/agents` - Custom Agents

**Manage custom subagents.** Opens the Agents (Droids) manager where you can create specialized agents with:

- **Custom system prompts** - Define the agent's personality and expertise
- **Specific models** - Choose which model powers the agent
- **Tool restrictions** - Limit which tools the agent can use

Use cases:
- Code reviewer agent with strict guidelines
- Documentation writer with specific style
- Test generator focused on edge cases

## `/automations` - Workflow Automation

**Run saved E2E/regression flows.** Opens the Automation Scenarios dialog. Automations are predefined sequences of prompts that run automatically.

Configure in `paragon.json`:

```json
{
  "automations": [
    {
      "name": "Full Test Suite",
      "prompts": [
        "Run all unit tests",
        "Run integration tests",
        "Generate coverage report"
      ]
    }
  ]
}
```

## `/monitors` - Scheduled Tasks

**Manage scheduled monitors.** Opens the Monitors dialog for creating automated scheduled tasks.

### Creating a Monitor

1. Press `a` to add new monitor
2. Enter a **prompt** - what you want the monitor to check/do
3. Select **cadence**:
   - **Hourly** - Runs every hour
   - **Daily** - Runs once per day
   - **Weekly** - Runs once per week
4. Press `Enter` to save

### Managing Monitors

| Key | Action |
|-----|--------|
| `a` | Add new monitor |
| `Enter` | Edit selected monitor |
| `t` | Toggle enabled/disabled |
| `d` | Delete monitor |
| `↑/↓` | Navigate list |
| `Esc` | Close dialog |

### Use Cases

- Daily code quality checks
- Weekly dependency audits
- Hourly log monitoring
- Scheduled report generation

---

### E2E Testing

Generate and run end-to-end tests using natural language.

Paragon generates and runs Playwright tests from plain English. Describe what you want to test, and Paragon handles the rest.

## Quick Start

Just tell Paragon what to test:

```
Test the checkout flow with an expired credit card
```

Paragon writes the Playwright test, runs it, and opens an HTML report with video recordings and tracing so you can see exactly what happened.

## How It Works

1. **Describe the test** in natural language
2. **Paragon generates** Playwright tests and saves a scenario to `paragon.json`
3. **View results** with video recordings, tracing, and screenshots

## Running Saved Tests

Once a scenario is saved, run it anytime:

| Method | How |
|--------|-----|
| **CLI** | `paragon tests run <scenario-id>` |
| **Chat** | "run my e2e tests" |
| **TUI** | `ctrl+p` → Automation Scenarios |

## `paragon test`

Run exported Playwright E2E tests locally from the terminal. This command wraps `npx playwright test` with your `paragon-tests/` configuration and handles authentication with Polarity for agentic test steps.

```bash
paragon test [file-pattern] [flags]
```

### Prerequisites

Before running `paragon test`, make sure:

1. **You're authenticated** — run `paragon auth login` if you haven't already. Agentic test steps require a valid Polarity session.
2. **You have a `paragon-tests/` directory** in your project root containing a `playwright.config.ts`. Export your tests from the [Paragon Dashboard](https://home.polarity.cc/app/testing/e2e) to generate this directory.
3. **Playwright is available** — the command runs via `npx`, so Playwright will be resolved from your project's `node_modules` or fetched automatically.

### Basic Usage

```bash
# Run all tests in paragon-tests/
paragon test

# Run a specific test file
paragon test login.spec.ts

# Run multiple files or use glob patterns
paragon test auth.spec.ts checkout.spec.ts
paragon test **/*.spec.ts
```

The optional `[file-pattern]` argument accepts one or more file paths or glob patterns. When omitted, all tests in the `paragon-tests/` directory are run.

### Flags

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--url` | `string` | — | Base URL for tests. Sets the `PARAGON_TEST_URL` environment variable, overriding the URL configured in `playwright.config.ts`. Also overridable via the `PARAGON_TEST_URL` env var directly. |
| `--headed` | `bool` | `false` | Run tests with a visible browser window instead of headless mode. Useful for watching tests execute in real time. |
| `--grep` | `string` | — | Filter tests by name or pattern. Only tests whose name matches the pattern will run. Passed directly to Playwright's `--grep` flag. |
| `--workers` | `int` | `0` (auto) | Number of parallel worker processes. When set to `0`, Playwright chooses automatically. Set to `1` to run tests sequentially. |
| `--retries` | `int` | `0` | Number of times to retry failed tests. Useful for flaky tests or unstable environments. |
| `--timeout` | `int` | `0` | Test timeout in milliseconds. Overrides the timeout set in your Playwright config. When `0`, the config default is used (typically 300,000ms / 5 minutes). |
| `--project` | `string` | — | Run tests for a specific browser project defined in your Playwright config (e.g. `chromium`, `chrome`, `firefox`). |
| `--debug` | `bool` | `false` | Launch Playwright in debug mode. Opens the Playwright Inspector, allowing you to step through tests, inspect selectors, and view logs interactively. |
| `--ui` | `bool` | `false` | Open Playwright's interactive UI mode. Provides a visual interface for browsing, running, and debugging tests with a built-in trace viewer. |

### Examples

```bash
# Run all tests against your local dev server
paragon test --url http://localhost:3000

# Watch tests run in a visible browser
paragon test --headed

# Filter to only run tests with "login" in the name
paragon test --grep "login"

# Run a single file in Chrome only
paragon test checkout.spec.ts --project chromium

# Retry flaky tests up to 3 times with a longer timeout
paragon test --retries 3 --timeout 120000

# Step through a test interactively with the Playwright Inspector
paragon test --debug --headed

# Open the full Playwright UI for exploring all tests
paragon test --ui

# Combine flags for a targeted local run
paragon test --url http://localhost:3000 --headed --grep "checkout" --workers 4
```

### Running Against Production

Use the `--url` flag to point tests at a deployed environment:

```bash
# Run against a staging environment
paragon test --url https://staging.yourapp.com

# Run against production
paragon test --url https://yourapp.com --project chromium --retries 2
```

You can also set the `PARAGON_TEST_URL` environment variable instead of passing `--url` each time:

```bash
export PARAGON_TEST_URL=https://staging.yourapp.com
paragon test
```

When `--url` is provided, it takes precedence over the env var.

### Default Playwright Configuration

Exported tests come with a pre-configured `playwright.config.ts` that includes sensible defaults:

| Setting | Default |
|---------|---------|
| Timeout | 300,000ms (5 minutes) |
| Expect timeout | 30,000ms |
| Workers | 1 (sequential) |
| Retries | 0 |
| Video recording | Enabled |
| Trace collection | Enabled |
| Screenshots | Enabled |
| Viewport | 1280 x 720 |
| Browsers | Chromium, Chrome |

Test results including videos, traces, and screenshots are written to a `test-results/` directory.

---

### Autotest

Spawn agents to autonomously explore and test a live web application.

Spawn parallel agents that navigate a live web application using a real browser, find bugs, take screenshots, and optionally create GitHub issues or PRs with their findings.

```
paragon autotest [description] [flags]
```

Unlike `paragon test` which runs pre-written Playwright tests, `autotest` uses agents that explore your app autonomously — no test files needed. Each agent gets its own isolated browser instance.

## How It Works

1. **Planning** — A planning agent analyzes the target app and distributes testing work across agents.
2. **Exploration** — Each agent navigates the app in a real browser, performing actions like clicking, typing, scrolling, and navigating between pages.
3. **Reporting** — Findings are compiled into a report with screenshots, reproduction steps, and severity ratings. Playwright test code is auto-generated from the agent's actions.
4. **GitHub integration** (optional) — Findings are filed as GitHub issues or bundled into a PR.

## Flags

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--url` | `string` | **(required)** | Target URL to test. |
| `--prompt` | `string` | — | What to test (e.g. `"test the checkout flow"`). |
| `--agents` | `int` | `1` | Number of parallel agents (1–8). Each agent gets its own browser. |
| `--model` | `string` | `paragon-max` | Model to use for the agents. |
| `--timeout` | `int` | `15` | Per-agent timeout in minutes. |
| `--auth` | `string` | — | Path to a cookies JSON file (browser-use export format) for pre-authenticated sessions. |
| `--login` | `string` | — | Login instructions for the agent in plain English (e.g. `"click Login, type user@test.com into email, type pass123 into password, click Submit"`). |
| `--session` | `string` | — | Team auth session name or ID stored in Supabase. |
| `--repo` | `string` | — | GitHub repo for issue/PR creation (`owner/repo`). |
| `--pr` | `bool` | `false` | Create a PR with findings instead of an issue. Defaults to `true` when `--repo` is set. |
| `--skip-github` | `bool` | `false` | Skip GitHub issue/PR creation even if `--repo` is set. |
| `--json` | `bool` | `false` | Output results as structured JSON. |
| `-q, --quiet` | `bool` | `false` | Minimal output. |

## Examples

```
# Basic: test a local app
paragon autotest --url http://localhost:3000 --prompt "make sure the page loads"

# Run against production with multiple agents
paragon autotest --url https://yourapp.com --prompt "test the checkout flow" --agents 4

# Test with login instructions
paragon autotest --url https://yourapp.com --login "click Login, type user@test.com into email, type pass123 into password, click Submit" --prompt "test the dashboard"

# Use a cookies file for auth
paragon autotest --url https://yourapp.com --auth ./cookies.json --prompt "test account settings"

# Use a shared team auth session
paragon autotest --url https://yourapp.com --session "staging-admin" --prompt "test admin panel"

# File findings as a GitHub PR
paragon autotest --url https://yourapp.com --prompt "test user registration" --repo acme/webapp --pr

# Skip GitHub, just see findings locally
paragon autotest --url http://localhost:3000 --prompt "test login" --skip-github

# JSON output for CI pipelines
paragon autotest --url https://yourapp.com --prompt "smoke test" --json --quiet
```

## Authentication

Autotest supports three ways to authenticate agents with your app:

| Method | Flag | When to use |
|--------|------|-------------|
| **Login instructions** | `--login` | Simple login forms. Describe the steps in plain English. |
| **Cookies file** | `--auth` | Export cookies from browser-use and pass the JSON file. |
| **Team session** | `--session` | Shared, encrypted auth sessions stored in Supabase for team reuse. |

## Managing Team Auth Sessions

Use `paragon autotest auth` to manage shared authentication sessions:

```
# List all team auth sessions
paragon autotest auth list

# Capture a new session (opens a browser — log in manually, then close)
paragon autotest auth capture https://yourapp.com --name "staging-admin"

# Use the captured session
paragon autotest --url https://yourapp.com --session "staging-admin" --prompt "test dashboard"

# Delete a session
paragon autotest auth delete "staging-admin"
```

## Output

Each finding includes:

- **Title and description** of the bug
- **Severity** rating (`critical`, `high`, `medium`, `low`)
- **Screenshots** captured during the session
- **Reproduction steps** from the agent's browser actions
- **URL** where the issue was found
- **Auto-generated Playwright code** to reproduce the bug

When `--repo` is set, findings are automatically filed as GitHub issues or bundled into a PR with the generated Playwright tests.

---

### Configuration

Configure Paragon with project files and authentication.

## paragon.json

Create a `paragon.json` file in your project root to configure Paragon behavior:

```json
{
  "automations": [
    {
      "name": "Build & Test",
      "prompts": [
        "Run the build",
        "Run all tests",
        "Report any failures"
      ]
    }
  ]
}
```

## PARAGON.md

Use `/init` to generate a `PARAGON.md` file that provides project context to Paragon. This file should contain:

- Project structure overview
- Key technologies and frameworks
- Development conventions
- Important architectural decisions

## Authentication

### Login

```bash
paragon auth login
```

Opens a browser-based authentication flow to connect your Paragon account.

### Status

```bash
paragon auth status
```

Displays current authentication state and account information.

### Logout

```bash
paragon auth logout
```

Removes stored credentials.

## FAQ

## Getting Help

- Use `/help` to view keyboard shortcuts
- Run `paragon --help` for CLI options
- Contact support@polarity.so for assistance

---

## Tools

### Claude Code

Connect Paragon to Claude Code via MCP

Add Paragon as an MCP server in Claude Code to get code review, testing, and analysis tools directly in your terminal.

## Prerequisites

- Active Claude subscription (Pro, Max, or API access)
- [Paragon CLI installed](/paragon/overview)

## Get your API key

1. Go to [app.paragon.run](https://app.paragon.run)
2. Sign in or create an account
3. Navigate to **Settings** → **API Keys**
4. Copy your API key

## Setup

1. Install Claude Code:

```bash
npm install -g @anthropic-ai/claude-code
```

2. Add Paragon as an MCP server:

```bash
claude mcp add-json paragon '{"command":"paragon","args":["mcp-server"],"env":{"POLARITY_API_KEY":"your-api-key-here"}}'
```

Replace `your-api-key-here` with your API key from [app.paragon.run](https://app.paragon.run).

3. Verify the server is connected:

```bash
claude mcp list
```

You should see `paragon` listed as a configured MCP server.

## Available tools

Once connected, Paragon exposes 12 tools to Claude Code:

### Code review

| Tool | Description |
|------|-------------|
| `run_paragon` | Run a Paragon prompt — reads/writes files, runs shell commands, and performs complex coding tasks. Requires API key. |
| `list_reviewed_prs` | List pull requests with Paragon review comments in the current repository. Supports filtering by state (`open`, `closed`, `all`). |
| `get_review_comments` | Get parsed review findings from a PR, including severity, descriptions, file locations, and suggested fixes. Auto-detects PR from current branch if omitted. |
| `resolve_review_comment` | Reply to a review comment and optionally resolve the thread. |

### Testing

| Tool | Description |
|------|-------------|
| `detect_test_framework` | Scan the repository and detect test frameworks in use. Returns framework details, run commands, file patterns, and confidence level. |
| `find_tests` | Discover existing test files using framework-aware file patterns. Returns file paths and count. |
| `run_tests` | Execute unit/integration tests using the detected or specified framework. Returns pass/fail status and full output. |
| `run_e2e_tests` | Run Playwright E2E tests locally. Installs browsers if needed, optionally starts a dev server, and captures results. |
| `generate_tests` | Generate unit, integration, or E2E tests using Paragon. Detects patterns, writes tests, verifies compilation, runs them, and fixes failures. Requires API key. |

### Test suite management

| Tool | Description |
|------|-------------|
| `list_test_suites` | List your test repositories and suites from the Paragon cloud dashboard. Requires API key. |
| `save_to_suite` | Save locally generated test files to a Paragon cloud test suite. Requires API key. |

### Utility

| Tool | Description |
|------|-------------|
| `check_setup` | Check if the MCP server is properly configured. Reports the status of the API key, Paragon binary, and GitHub CLI. |

## Usage

After setup, Claude Code can automatically use Paragon tools when relevant. You can also request them directly:

```
> Review the comments on my current PR and fix the issues

> Generate unit tests for src/auth.ts

> Run the test suite and show me what's failing

> Find all Playwright tests and run them in headed mode
```

## Configuration

To update your API key or other environment variables, remove and re-add the server:

```bash
claude mcp remove paragon
claude mcp add-json paragon '{"command":"paragon","args":["mcp-server"],"env":{"POLARITY_API_KEY":"your-new-api-key"}}'
```

---

### Cursor

Connect Paragon to Cursor via MCP

Add Paragon as an MCP server in Cursor to get code review, testing, and analysis tools directly in your editor.

## Prerequisites

- Cursor editor installed
- [Paragon CLI installed](/paragon/overview)

## Get your API key

1. Go to [app.paragon.run](https://app.paragon.run)
2. Sign in or create an account
3. Navigate to **Settings** → **API Keys**
4. Copy your API key

## Setup

1. Open Cursor and go to **Settings > MCP**. Click **+ Add new MCP server**.

2. Select **Type: command** and add the following configuration:

```json
{
  "mcpServers": {
    "paragon": {
      "command": "paragon",
      "args": ["mcp-server"],
      "env": {
        "POLARITY_API_KEY": "your-api-key-here"
      }
    }
  }
}
```

Replace `your-api-key-here` with your API key from [app.paragon.run](https://app.paragon.run). Alternatively, create a `.cursor/mcp.json` file in your project root with the same configuration.

3. After saving, you should see **paragon** listed in your MCP servers with a green status indicator.

## Available tools

Once connected, Paragon exposes 12 tools to Cursor:

### Code review

| Tool | Description |
|------|-------------|
| `run_paragon` | Run a Paragon prompt — reads/writes files, runs shell commands, and performs complex coding tasks. Requires API key. |
| `list_reviewed_prs` | List pull requests with Paragon review comments in the current repository. Supports filtering by state (`open`, `closed`, `all`). |
| `get_review_comments` | Get parsed review findings from a PR, including severity, descriptions, file locations, and suggested fixes. Auto-detects PR from current branch if omitted. |
| `resolve_review_comment` | Reply to a review comment and optionally resolve the thread. |

### Testing

| Tool | Description |
|------|-------------|
| `detect_test_framework` | Scan the repository and detect test frameworks in use. Returns framework details, run commands, file patterns, and confidence level. |
| `find_tests` | Discover existing test files using framework-aware file patterns. Returns file paths and count. |
| `run_tests` | Execute unit/integration tests using the detected or specified framework. Returns pass/fail status and full output. |
| `run_e2e_tests` | Run Playwright E2E tests locally. Installs browsers if needed, optionally starts a dev server, and captures results. |
| `generate_tests` | Generate unit, integration, or E2E tests using Paragon. Detects patterns, writes tests, verifies compilation, runs them, and fixes failures. Requires API key. |

### Test suite management

| Tool | Description |
|------|-------------|
| `list_test_suites` | List your test repositories and suites from the Paragon cloud dashboard. Requires API key. |
| `save_to_suite` | Save locally generated test files to a Paragon cloud test suite. Requires API key. |

### Utility

| Tool | Description |
|------|-------------|
| `check_setup` | Check if the MCP server is properly configured. Reports the status of the API key, Paragon binary, and GitHub CLI. |

## Usage

After setup, Cursor's agent mode can automatically use Paragon tools when relevant. You can also request them directly in chat:

```
Review the comments on my current PR and fix the issues

Generate unit tests for src/auth.ts

Run the test suite and show me what's failing

Find all Playwright tests and run them in headed mode
```

---

## Support

### FAQ

Common questions and troubleshooting.

## Getting Started

  
  

  

## PR Reviews

  

  

  

## Paragon CLI

  

  

## Billing & Credits

  

## Troubleshooting

  

  

## Contact Support

Still have questions? Reach out to us:

- **Email**: [support@polarity.so](mailto:support@polarity.so)
- **GitHub Issues**: Report bugs or request features

---