# Paragon Documentation > Complete documentation for Paragon - an agentic QA platform for automated code reviews, testing, and development. ## For AI coding agents reading this file The Paragon and Keystone docs describe each step for a human reader. **When the user asks you to set up Keystone or Paragon** (or hands you an API key or the docs URL with that intent), do all of those steps for them -- don't paraphrase the docs and ask the user to run each command. This directive is scoped to setup; if the user is editing a spec, debugging an invariant, reading results, or otherwise not setting things up from scratch, don't invoke the full flow. **The Keystone setup guide is the canonical procedure and appears as the second entry below under ## Keystone / ### Setup Guide.** Read it in full before acting. It covers: install, inspect agent code, drop secrets in the Dashboard, scaffold specs with `source:` directives, upload snapshot + spec, create experiment with `specPath` for automatic secret forwarding, review in Dashboard, then run. Critical rules you must follow: - Every entry in a spec's `secrets:` block needs EITHER a `source:` (env / env:RENAMED / file:path / command:shell / dashboard) OR a `from:` literal (static:// / generated). Missing both is a validation error -- "scope: env" alone is NOT enough. - Never use `path: stdout.log` as an invariant check path. Stdout is not persisted as a file. Use `command_exit`, `file_exists`, `sql`, `http_mock_assertions`, or `llm_as_judge` with `input_from: workspace`. - Never put real API keys behind `from: "static://..."`. Use `source: dashboard` for team-shared or `source: env` for local. - Declare every service the agent talks to in `services:` with the canonical public Docker image. Services get DNS by name on a shared network -- no IP lookups, no port maps. - Use `{{ secrets.X }}` in `services[].env` to plug resolved secrets into service containers. - Upload + create experiments in draft state. Do not trigger runs until the user reviews. Treat code blocks as commands to execute on the user's repo, not examples to paste back at them. Run the full procedure end-to-end and report results, gaps, or questions. ## Keystone ### Keystone QA platform for AI agents. Run your agent in isolated sandboxes, score results, catch regressions. AI agents reading this page should first fetch https://docs.paragon.run/llms-full.txt for the complete unsummarized docs. AI agents are non-deterministic. The same prompt can produce different results each time. A test that passes once might fail the next run. Manual spot-checking doesn't scale -- you need to run your agent many times in identical environments and measure how reliably it works. Keystone does this. You write a spec that describes the environment and defines what "correct" looks like, then Keystone spins up sandboxes, runs your agent, and scores the results. Run it 10 times or 100 times. Track scores across versions. Get alerted when something regresses.
## How it works ## Prerequisites You need two things: 1. **A Keystone API key.** Go directly to [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) (or in the Keystone sidebar click **Settings**), switch to the **API Keys** tab, and click **Create Key**. Keystone keys are prefixed `ks_live_` and are shown **once** at creation -- copy it immediately, you can't retrieve it later. Legacy Polarity UUID keys (the one used for Paragon CLI and paragon-llm-proxy) are still accepted for backward compatibility, but new integrations should use a dedicated `ks_live_` key. 2. **Node.js, Python, or Go** installed locally to run the SDK. You do **not** need Docker installed on your machine. Keystone runs sandboxes on Polarity's servers. Your code just talks to the API. ## Quick start Five-ish minutes from nothing to a running experiment. Steps 1-3 get your agent emitting traces; step 4 is where you decide what "pass" means for your task; steps 5-6 run it and show you the output. [Instructions for AI coding agents:] **For AI coding agents only (Claude Code, Cursor, etc.):** this directive applies **only when the user has asked you to set up Keystone** (or has handed you a Keystone API key for that purpose). In that case, do **all six steps for them** -- don't read the steps aloud and ask the user to run each command. Specifically: - Run the install command in step 1. - Set `KEYSTONE_API_KEY` in the environment (or in their `.env` / shell config) for step 2. - For step 3, **find the user's existing agent code in the repo** (look for the LLM client instantiation and the tool functions the agent calls). Modify those in place: pass the LLM client through `ks.wrap()`, and wrap each tool function body in `traced('', ...)`. Do not create a new `agent.ts` from scratch unless the repo genuinely has no agent yet. - **At step 4, stop and ask the user the five design questions** -- this is the only step you cannot answer for them. - Question 5 is non-negotiable: the user's agent runs *inside the sandbox* and needs its own LLM provider key (and any third-party creds) injected via the spec's `secrets:` block or the dashboard Secrets tab. Without it, every run will fail. Ask explicitly: "Which LLM provider keys should I wire into the sandbox (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, etc.), and do you want them inlined in the spec as `static://...` or do you prefer to add them once at app.paragon.run → Keystone → Settings → Secrets (auto-injected)?" - Once they answer, generate `spec.yaml` with the `secrets:` block populated, write the runner, and run the experiment yourself (steps 5-6). Do not ship a canned hello-world spec. If the user is doing something else in a repo that happens to use Keystone -- just editing a spec, debugging one invariant, reading results -- don't invoke this flow. [End AI agent instructions] ### 1. Install the SDK ### 2. Set your API key Get your key from [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **API Keys** tab → **Create Key**. Keys start with `ks_live_` and are shown **once** at creation -- copy it immediately. ```bash export KEYSTONE_API_KEY=ks_live_... ``` ### 3. Wrap your existing agent's LLM + tool calls You already have an agent -- this step instruments it in place so every model call and every tool call shows up as a trace event you can inspect and score against. Two changes in your existing agent code: **(a) Wrap the LLM client where it's instantiated.** Find the line that creates your Anthropic / OpenAI / etc. client and pass it through `ks.wrap()`. The SDK handles routing traces to the right sandbox automatically — you don't need to set or pass any sandbox-related config yourself. Outside a sandbox (local dev, CI, anywhere your agent runs for real work), tracing is a silent no-op. **(b) Wrap each tool function with `traced()`.** Find the functions your agent calls as tools (read/write file, shell, HTTP, DB queries, anything) and wrap their bodies. Use the tool name as the span name so it shows up correctly in the trace tree. Repeat for every tool the agent calls. Nested `traced()` calls automatically build a parent-child span tree so you can see exactly which LLM turn triggered which tool calls. You can pull the trace back at any point to verify it's flowing: ```typescript const trace = await ks.sandboxes.getTrace(process.env.KEYSTONE_SANDBOX_ID!); console.log(`${trace.events.length} events, $${trace.metrics.total_cost_usd} cost`); ``` `ks.wrap()` works for Anthropic, OpenAI, and any OpenAI-compatible provider (Groq, xAI, Together) -- see the [SDK reference](/keystone/sdk#llm-tracing). Wrapping never changes the response; if tracing fails it fails silently so your agent keeps running. ### 4. Tell Keystone what to evaluate Now that your agent reports traces, decide what a **pass** looks like. Every Keystone spec is an answer to five questions -- if you're working with an AI coding assistant, **the assistant should ask you these directly** rather than inventing a canned example. 1. **What task should the agent do?** One sentence the agent receives as a prompt. ("Fix the failing test in `src/api.test.ts`.", "Turn this PRD into a working Stripe integration.", "Refactor `users.py` to use async SQLAlchemy.") 2. **What environment does it need?** Base image (ubuntu, node, python), packages, repos to clone, and any backing services (Postgres, Redis, a mock API). 3. **How do you know it worked?** The invariants -- concrete yes/no checks. File contents, test pass/fail, HTTP responses, SQL queries. At least one should be a `gate` (a hard fail if it doesn't pass). 4. **What's off-limits?** Forbidden filesystem paths, network hosts, or behaviors that should auto-fail the run. 5. **What API keys does your agent need at runtime?** Your agent runs *inside the sandbox*, so it needs its own LLM provider key (e.g. `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`) plus any third-party creds (Stripe, GitHub, etc.). **Without this the agent can't make model calls and every run will fail.** Write those answers into a YAML spec. Here's the minimal shape -- fill in the bracketed sections, save as `spec.yaml`: ```yaml version: 1 id: "" description: "" base: "" # repos: [{ url: ..., path: ... }] # optional: clone a repo into the sandbox # services: { db: { image: postgres:16 } } # optional: backing services task: prompt: | agent: type: paragon # or: cli | image | http | python | snapshot (see "Swap in your own agent" below) timeout: 5m # Your agent runs inside the sandbox -- give it the keys it needs to call LLMs # and any third-party APIs. Two ways to supply these: # (a) Store once at app.paragon.run -> Keystone -> Settings -> Secrets. # Dashboard-stored secrets auto-inject into every sandbox as env vars. # (b) Declare inline here -- use `static://...` for fixed values, `generated` # for per-run random values. See /keystone/specs#secrets for details. secrets: - name: ANTHROPIC_API_KEY from: "static://" # or omit this whole block if stored in dashboard scope: env invariants: : description: "" weight: 1.0 gate: true # mark at least one as a hard gate check: type: # ...type-specific fields scoring: pass_threshold: 1.0 ``` See the [spec reference](/keystone/specs) for every field and every invariant `check.type`. For real scenarios (fix-a-failing-test, migrate-a-schema, build-a-feature-from-a-PRD), see [examples](/keystone/examples). ### 5. Upload and run The experiment typically takes 10-30 seconds. Keystone creates a sandbox on the server, runs the shell command, checks the invariants, and returns the results. ### 6. What you'll see When the experiment completes, you get back a `RunResults` object. Here's what a passing run looks like: ```json { "experiment_id": "exp-a1b2c3", "total_scenarios": 1, "passed": 1, "failed": 0, "metrics": { "pass_rate": 1.0, "mean_wall_ms": 12000, "total_cost_usd": 0.0 }, "scenarios": [ { "status": "pass", "composite_score": 1.0, "invariants": [ { "name": "", "passed": true, "gate": true, "weight": 1.0 }, { "name": "", "passed": true, "weight": 1.0 } ] } ] } ``` And here's what a failure looks like -- the `message` field tells you exactly what went wrong: ```json { "passed": 0, "failed": 1, "metrics": { "pass_rate": 0.0 }, "scenarios": [ { "status": "fail", "composite_score": 0.0, "invariants": [ { "name": "", "passed": false, "gate": true, "message": "" } ], "reproducer": { "seed": 12345, "command": "keystone run --spec --seed 12345 --scenario scenario-000" } } ] } ``` The `reproducer` gives you the exact command to re-run that specific scenario with the same seed for debugging. ## Swap in your own agent The spec template uses `agent: type: paragon` (Polarity's built-in agent) by default. To run your own agent, replace that block with one of these: ```yaml # You have a CLI binary on the server: agent: type: cli binary: /path/to/your-agent args: ["--task", "{{ task.prompt }}"] timeout: 5m # You have a Docker image in a registry: agent: type: image image: "your-registry/your-agent:latest" timeout: 5m # Your agent is an HTTP API: agent: type: http endpoint: "https://your-api.com/agent/run" timeout: 5m # You have a Python script: agent: type: python binary: agent.py timeout: 5m # You want to version your agent as an immutable snapshot: agent: type: snapshot snapshot: my-agent # uploaded via ks.agents.upload() timeout: 5m ``` **Which one should I use?** - **`cli`** -- your agent is a compiled binary or shell script that runs locally - **`image`** -- your agent is packaged as a Docker image (best for reproducibility) - **`http`** -- your agent is a hosted API (the sandbox POSTs the task to your endpoint) - **`python`** -- your agent is a Python script in the sandbox - **`snapshot`** -- your agent is uploaded to Keystone and versioned (best for tracking which agent version produced which results) - **`paragon`** -- use Polarity's built-in Paragon agent (no setup needed) See the [spec reference](/keystone/specs#agent) for full details. ## Key concepts Before diving into the [spec reference](/keystone/specs) and [SDK reference](/keystone/sdk), here are the core concepts: **Specs** are YAML files that describe everything about a test scenario: the environment, the task, and the pass/fail criteria. You upload them once and run experiments against them repeatedly. **Sandboxes** are isolated environments where your agent runs. Each sandbox gets its own filesystem, Docker containers for backing services (Postgres, Redis, etc.), and a clean state. Nothing leaks between runs. **Invariants** are the checks that run after your agent finishes. Each one answers a yes/no question: "Did the tests pass?" "Does the output file contain the right data?" "Did the agent call the API correctly?" Gate invariants cause an immediate fail if they don't pass. **Forbidden rules** define what the agent must NOT do. If it writes to a file outside the allowed list, makes HTTP calls to unauthorized hosts, or leaks secrets in stdout, the run fails regardless of the invariant scores. **Experiments** run your spec one or more times and aggregate the results. Run 10 replicas to measure consistency. Use a matrix to test different parameters. Compare experiments to catch regressions. **Alerts** notify you via Slack or webhook when metrics cross a threshold. Set `pass_rate < 0.8` and get a message when your agent starts failing more often. ## Troubleshooting ### Experiment fails with "spec not found" You need to upload the spec before creating an experiment. Run `ks.specs.create(yaml)` first, then `ks.experiments.create()`. ### Invariant says "file does not exist" but the agent should have created it The invariant runs in the sandbox workspace directory. Make sure your agent writes files to the current working directory, not an absolute path. The workspace is the root -- if your agent writes to `/tmp/hello.txt`, the invariant looking for `hello.txt` won't find it. ### Agent times out The default agent timeout is 5 minutes. For long-running tasks, increase it in the spec: ```yaml agent: timeout: 15m ``` The sandbox-level timeout (under `resources.timeout`) must also be large enough to cover setup + agent execution + scoring. ### "sandbox rejected: at capacity" The server has a limit on concurrent sandboxes. Wait for current experiments to finish or contact support to increase your limit. ## Next steps - [Spec reference](/keystone/specs) -- every section of a spec file explained with examples - [SDK reference](/keystone/sdk) -- all SDK methods across TypeScript, Python, and Go - [Examples](/keystone/examples) -- full real-world specs and how to build agents that run inside Keystone --- ### Setup Guide The complete procedure for wiring Keystone into an existing agent — written for humans and for AI coding assistants. Follow this page top to bottom. You have an existing agent and you want to evaluate it with Keystone. This page is a strict procedure — follow it in order and you'll end with a working spec that passes on the first run. --- ## Step 0 — Install the SDK and set your API key Generate the key at [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **API Keys**. --- ## Step 1 — Inspect the agent code Before you write any spec, you need three facts about the agent: Write these three facts down before proceeding. The rest of the guide is mechanical — fill in a template with what you found. --- ## Step 2 — Drop your provider keys into the Dashboard For the secrets the team shares (`XAI_API_KEY`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.), go to [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **Secrets** tab and paste the values. They're AES-256-GCM encrypted at rest and shared with your teammates on the same billing account. --- ## Step 3 — Scaffold the spec from the template Copy this template to `specs/scenario-1.yaml` and fill in the marked sections. Everything else stays as-is. ```yaml version: 1 id: " --- ## Step 4 — Upload the agent and the spec After this step, the experiment appears in the Dashboard as **Draft** — nothing is running yet. --- ## Step 5 — Review in the Dashboard, then run Open [app.paragon.run/app/keystone/experiments](https://app.paragon.run/app/keystone/experiments). Click into your new experiment. You'll see: - The spec YAML that was uploaded - The secret names it declared (with warnings if any aren't set yet) - The services it will boot - The invariants that will score it When it looks right, hit **Run**. The experiment moves from `draft` → `running` → `completed` (or `failed`). Watch results stream in. Programmatic equivalent: --- ## Common first-run failures + fixes | What you see | Cause | Fix | | --- | --- | --- | | `secret "X" declared in spec but not set in Dashboard or forwarded from .env` | The name in `secrets:` has no value at any layer | Either add it to the Dashboard or export it in your shell / `.env` | | Sandbox boot hangs on `container_started`, agent never runs | No egress to your package registry for `npm install` / `pip install` | Either include `node_modules` / `venv` in the snapshot tarball OR add `registry.npmjs.org` / `pypi.org` to `network.egress.allow` | | Every invariant fails with "file not found" | Invariant paths assume files the agent didn't actually write | Run the sandbox once manually, `ls /workspace`, see what's there, rewrite the invariant paths | | Agent connects to `localhost:5432` and errors "connection refused" | Code uses `localhost`, but services are reachable by service name (`db`, `redis`, etc.) | Change connection strings to `postgres://db:5432/...`, or read `KEYSTONE_SERVICE_DB_HOST` env var | | `llm_as_judge` fails with "no card on file" | Judges are billable and need a payment method | Add a card at [app.paragon.run/app/keystone/data/billing](https://app.paragon.run/app/keystone/data/billing) | | Experiment stuck at `draft` forever | You uploaded it but never called `run` | Click **Run** in the Dashboard, or call `ks.experiments.run(exp.id)` / `runAndWait` | --- ## Related reading - [Spec schema](/keystone/specs) — full field-by-field reference - [Examples](/keystone/examples) — the canonical end-to-end example with all primitives - [SDK methods](/keystone/sdk) — every method across TS/Python/Go - [Billing](/keystone/billing) — how compute + LLM judge costs are billed --- ### Examples Keystone specs from simple to complex, plus how to build agents that run inside Keystone. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content. Examples are ordered from simple to complex. Start with the hello world to verify your setup, then work up to real-world scenarios. ## Canonical end-to-end example This is the single pattern that covers 90% of real agent evals: your agent (uploaded as a snapshot), a Postgres database you seed, a mocked external API, a secret read from the caller's local `.env`, and invariants that verify the agent did the right thing. Copy this spec and adapt — it's the "looks right" template most scenarios grow from. ### Project layout ``` my-agent/ ├── .env # local secrets (gitignored) ├── src/ │ └── index.ts # your agent code ├── seed.sql # DB fixtures └── specs/ └── scenario-1.yaml ``` ### `.env` (on the caller's machine, not committed) ```bash XAI_API_KEY=xai-... DB_PASSWORD=localdevpass STRIPE_SECRET=sk_test_... ``` ### `specs/scenario-1.yaml` ```yaml version: 1 id: email-agent-renewal-alice description: "Agent drafts a renewal email to a specific customer" agent: type: snapshot snapshot: email-agent # uploaded via ks.agents.upload() # Declare every secret + where to pull it from. # SDK resolves these on the caller's machine and forwards in the request. secrets: - name: XAI_API_KEY source: env # from $XAI_API_KEY in your local env - name: DB_PASSWORD source: env # same — will flow into the service below - name: STRIPE_SECRET source: env - name: STRIPE_LIVE_KEY source: dashboard # server-side only; never forwarded from local # Services Keystone pulls from Docker Hub and runs on a shared network. # The agent reaches each by `name` — e.g. postgres://db:5432. services: - name: db image: postgres:16 env: POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}" POSTGRES_DB: northwind ports: [5432] wait_for: "pg_isready -h localhost" - name: stripe_mock image: stripe/stripe-mock:latest env: STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}" ports: [12111] # Seed the DB before the agent starts. fixtures: - type: sql service: db path: seed.sql # Allowlist only the domains your agent needs to reach externally. network: egress: default: deny allow: - api.x.ai # your LLM provider - keystone.polarity.so # trace ingestion (auto-allowed in blessed runtime images) task: prompt: "Draft a subscription renewal reminder email to alice@northwind.co." invariants: draft_created: description: "Exactly one draft file" weight: 1.0 gate: true check: type: command_exit command: "test $(ls drafts/*.md 2>/dev/null | wc -l) -eq 1" addressed_to_alice: description: "Draft is addressed to alice@northwind.co" weight: 1.0 gate: true check: type: command_exit command: "grep -l 'To: alice@northwind.co' drafts/*.md" reads_professional: description: "Draft is professional and mentions the product" weight: 0.5 check: type: llm_as_judge input_from: workspace criteria: "The draft is professional and references alice's product." # model defaults to paragon-md; billed per-token to your Keystone invoice scoring: pass_threshold: 0.85 ``` ### Running it ### What happens behind the scenes 1. **SDK resolves secrets** on your machine per `source:` directives. `STRIPE_LIVE_KEY` isn't forwarded — server reads it from the Dashboard. `XAI_API_KEY`, `DB_PASSWORD`, `STRIPE_SECRET` are forwarded from `.env`. 2. **Keystone server** pulls `postgres:16`, `stripe/stripe-mock:latest`, and the blessed Node runtime image from Docker Hub. 3. **A new Docker network** `keystone-` is created. All containers join it with their `name` as the DNS alias. 4. **Services start first**, with `{{ secrets.X }}` substituted to actual values. `seed.sql` runs against the Postgres container. 5. **Agent container** starts with your tarball mounted at `/workspace`, secrets injected as env vars (including `KEYSTONE_SERVICE_DB_HOST=db`, `KEYSTONE_SERVICE_DB_PORT=5432`, etc. for service discovery). 6. **Agent runs** the entrypoint, reads the task prompt from stdin, produces files in `/workspace/drafts/`. Traces stream to `keystone.polarity.so` over HTTPS. 7. **Scorer runs** invariants against the post-run workspace. `llm_as_judge` calls Paragon's model proxy (billed to your Keystone invoice). 8. **Teardown** — containers destroyed, network removed, workspace retained per your retention policy. The entire lifecycle is declared in one spec + one `.env`. No Dockerfile, no separate build step for services, no manual secret entry. --- ## Hello world The simplest possible spec. No external repos, no agent binary, no services. A shell command creates a file and the invariants check it exists with the right content. ```yaml version: 1 id: "hello-world" description: "Verify a command can create a file" base: "ubuntu:24.04" task: prompt: "Create hello.txt containing 'Hello, Keystone!'" agent: type: cli binary: sh args: ["-c", "echo 'Hello, Keystone!' > hello.txt"] timeout: 1m invariants: file_created: description: "hello.txt exists" weight: 1.0 gate: true check: type: file_exists path: hello.txt correct_content: description: "Contains the expected text" weight: 1.0 check: type: file_content path: hello.txt contains: "Hello, Keystone!" scoring: pass_threshold: 1.0 ``` ```typescript const ks = new Keystone(); await ks.specs.create(readFileSync('hello.yaml', 'utf-8')); const exp = await ks.experiments.create({ name: 'hello', spec_id: 'hello-world' }); const results = await ks.experiments.runAndWait(exp.id); console.log(results.passed > 0 ? 'PASSED' : 'FAILED'); ``` If this passes, your setup is working. Move on to a real scenario. --- ## Fix a failing test A realistic spec that clones a repo, installs dependencies, and checks whether the agent can fix a broken test. Uses Paragon as the agent. ```yaml version: 1 id: "fix-unit-test" description: "Agent fixes a failing unit test" base: "ubuntu:24.04" setup: packages: [nodejs, npm] commands: ["npm install"] fixtures: - type: git_repo url: "https://github.com/your-org/your-repo" branch: "broken-test" task: prompt: | The test in src/utils.test.ts is failing. Fix it. Do not change the implementation in src/utils.ts. agent: type: paragon model: paragon-fast timeout: 3m invariants: tests_pass: description: "All tests pass" weight: 1.0 gate: true check: type: command_exit command: "npm test" no_impl_changes: description: "Implementation file was not modified" weight: 0.5 check: type: command_exit command: "git diff --name-only | grep -v test" exit_code: 1 scoring: pass_threshold: 1.0 ``` **What's new here:** - `fixtures` clones a real git repo into the sandbox - `setup.commands` installs npm dependencies before the agent runs - The second invariant (`no_impl_changes`) verifies the agent only changed test files, not the implementation -- it expects `grep` to find no matches (exit code 1) --- ## Full stack with services A complex spec that tests whether an agent can reconcile two database tables and send a summary email. Uses a real Postgres database, a mock SMTP server, LLM-as-judge scoring, and forbidden rules. ```yaml version: 1 id: "reconciliation" description: "Agent reconciles two databases and emails a summary" base: "ubuntu:24.04" setup: packages: [nodejs, npm, python3] env: DATABASE_URL: "postgres://postgres:test@db:5432/testdb" services: - name: db image: postgres:16 env: POSTGRES_PASSWORD: test POSTGRES_DB: testdb ports: [5432] wait_for: "pg_isready -U postgres" - name: smtp type: http_mock ports: [9090] record: true routes: - method: POST path: /v1/send response: '{"ok": true}' secrets: - name: DB_PASSWORD from: generated scope: env fixtures: - type: sql service: db path: seeds/schema.sql - type: drift target: db.customers_a strategy: random_mismatches count: 15 seed: "42" network: egress: default: deny allow: [registry.npmjs.org] dns_overrides: smtp.sendgrid.net: smtp.services.internal task: prompt: | Reconcile customers_a and customers_b tables, fix all mismatches, and email a summary to finance@co agent: type: paragon model: paragon-max timeout: 5m invariants: databases_match: description: "No mismatches after reconciliation" weight: 0.5 gate: true check: type: sql service: db query: | SELECT count(*) FROM customers_a a LEFT JOIN customers_b b ON a.id = b.id WHERE b.id IS NULL OR a.email != b.email equals: 0 email_sent: description: "Exactly one summary email was sent" weight: 0.3 check: type: http_mock_assertions service: smtp assertions: - field: request_count filters: { to: "finance@co" } equals: 1 email_quality: description: "Email is professional and accurate" weight: 0.2 check: type: llm_as_judge model: paragon-fast criteria: "Professional tone, accurate mismatch count" input_from: smtp.last_request.body forbidden: db_writes_outside: [customers_a, customers_b, audit_log] http_except: [smtp] secrets_in_logs: deny scoring: pass_threshold: 0.95 parallelism: replicas: 10 isolation: per_run determinism: seed: 42 teardown: always_run: true export: - type: audit_log to: "results/audit.jsonl" - type: db_dump service: db to: "results/db.sql" ``` **What's new here:** - `services` starts a real Postgres container and a mock SMTP server on a shared Docker network - `fixtures` loads SQL seeds into Postgres and injects random data drift - `secrets` generates a fresh random password each run - `network` blocks all outbound traffic except npm and redirects `smtp.sendgrid.net` to the mock - `forbidden` ensures the agent only writes to allowed tables and doesn't leak secrets - `invariants` uses three different check types: SQL query, mock assertion, and LLM-as-judge - `parallelism` runs 10 replicas to measure consistency - `teardown` exports the audit log and a database dump after every run --- ## Comparing agents Use `matrix` to test the same task with different configurations side by side. ```yaml version: 1 id: "agent-comparison" description: "Compare agent performance across models" base: "ubuntu:24.04" setup: packages: [python3, pip] commands: ["pip install -r requirements.txt"] fixtures: - type: git_repo url: "https://github.com/your-org/ml-pipeline" branch: "main" task: prompt: "Fix the data pipeline bug that causes duplicate records." agent: type: paragon timeout: 5m invariants: no_duplicates: description: "No duplicate records in output" weight: 1.0 gate: true check: type: command_exit command: "python3 check_duplicates.py" scoring: pass_threshold: 0.8 parallelism: replicas: 5 matrix: - model: "paragon-fast" - model: "paragon-max" ``` This creates 10 total runs (2 models x 5 replicas). After both finish, compare them: ```typescript const comparison = await ks.experiments.compare(fastExpId, maxExpId); console.log(`Regressed: ${comparison.regressed}`); comparison.metrics.forEach(m => { console.log(`${m.name}: ${m.baseline} -> ${m.candidate} (${m.direction})`); }); ``` --- ## Setting up alerts Get notified when your agent's performance drops. ```typescript // Slack Bot alert (posts to a channel) await ks.alerts.create({ name: 'pass-rate-drop', eval_id: 'reconciliation', condition: 'pass_rate < 0.9', notify: 'slack', slack_channel: '#agent-alerts', }); // Webhook alert (Slack incoming webhook auto-detected) await ks.alerts.create({ name: 'cost-spike', condition: 'mean_cost_per_run_usd > 2.00', notify: 'webhook', webhook_url: 'https://hooks.slack.com/services/T00/B00/xxx', }); ``` Alerts evaluate after every experiment run. See [SDK reference > Alerts](/keystone/sdk#alerts) for all supported metrics and operators. --- ## Using sandboxes directly You don't have to run full experiments. You can create a sandbox from a spec, interact with it manually, and inspect the results. This is useful for development and debugging. ```typescript const ks = new Keystone(); // Create a sandbox const sb = await ks.sandboxes.create({ spec_id: 'fix-unit-test' }); console.log(`Sandbox ${sb.id} is ${sb.state}`); console.log(`Services:`, sb.services); // { db: { host: "db", port: 5432, ready: true } } // Run commands const test = await ks.sandboxes.runCommand(sb.id, { command: 'npm test' }); console.log(`Exit code: ${test.exit_code}`); console.log(`Output: ${test.stdout}`); // Read and write files const src = await ks.sandboxes.readFile(sb.id, 'src/utils.ts'); await ks.sandboxes.writeFile(sb.id, 'src/utils.ts', fixedCode); // See what changed const diff = await ks.sandboxes.diff(sb.id); console.log(`Modified: ${diff.modified}`); console.log(`Added: ${diff.added}`); // Clean up await ks.sandboxes.destroy(sb.id); ``` This gives you a remote dev environment with services pre-configured. You can use it for manual testing, scripted integration tests, or building custom evaluation pipelines. --- ## Building an agent for Keystone If you're building an agent that's designed to run inside Keystone sandboxes, here's what you need to know from the agent's perspective. ### What your agent gets When Keystone runs your agent, it sets up the environment before your code starts: 1. **A workspace directory** -- your agent's working directory, seeded with any fixtures (git repos, files) 2. **Backing services** -- Postgres, Redis, etc. are already running and reachable by name 3. **Environment variables** -- connection info for every service, the sandbox ID, and the Keystone API URL 4. **The task prompt** -- passed via stdin (for cli/paragon/python agents) or in the HTTP request body (for http agents) ### Environment variables Keystone injects these into every agent process: | Variable | Example | Description | |----------|---------|-------------| | `KEYSTONE_SANDBOX_ID` | `sb-abc123` | Current sandbox ID | | `KEYSTONE_BASE_URL` | `http://keystone:8012` | Keystone API URL | | `KEYSTONE_SERVICE_DB_HOST` | `db` | Hostname for the "db" service | | `KEYSTONE_SERVICE_DB_PORT` | `5432` | Port for the "db" service | | `KEYSTONE_SERVICE_CACHE_HOST` | `cache` | Hostname for the "cache" service | | `KEYSTONE_SERVICE_CACHE_PORT` | `6379` | Port for the "cache" service | One pair of `HOST`/`PORT` vars is created for each service in the spec, with the service name uppercased. ### Discovering services with the SDK Instead of parsing env vars manually, use `Keystone.fromSandbox()`: ### Auto-tracing LLM calls Wrap your LLM client so Keystone automatically captures token usage, latency, and tool calls. This data shows up in traces and can be queried by agent name and version. ```typescript const { client: ks, sandbox } = await Keystone.fromSandbox(); const anthropic = ks.wrap(new Anthropic()); // sandbox id auto-detected from env // Every messages.create() call now auto-reports to Keystone const response = await anthropic.messages.create({ model: 'claude-sonnet-4-20250514', messages: [{ role: 'user', content: 'Fix the bug in main.py' }], }); ``` ### Custom trace spans For non-LLM operations (file I/O, API calls, test execution), use `traced()` to capture timing and errors: ```typescript const testOutput = await traced('run_tests', async () => { return execSync('npm test').toString(); }); const analysis = await traced('analyze_results', async () => { return parseTestOutput(testOutput); }); ``` Nested `traced()` calls create parent-child spans automatically, so you get a full execution tree in the trace viewer. ### Stdout and stderr Everything your agent prints to stdout and stderr is captured and returned in the experiment results: - `scenarios[].agent_output` -- your agent's stdout (truncated to 50KB) - `scenarios[].agent_stderr` -- your agent's stderr (truncated to 50KB) Use stdout for normal output and stderr for debug logging. If `audit.stdout_capture` is enabled in the spec, stdout is also scanned by the `secrets_in_logs` forbidden rule. You can view the captured output when debugging failures: ```typescript const results = await ks.experiments.runAndWait(exp.id); for (const scenario of results.scenarios) { if (scenario.status === 'fail') { console.log('Agent output:', scenario.agent_output); console.log('Agent stderr:', scenario.agent_stderr); } } ``` ### What your agent should NOT do If the spec has `forbidden` rules, Keystone monitors your agent's behavior via the audit log. Common restrictions: - **Don't write to files outside the allowed list** -- if `file_writes_outside: [src/, config/]` is set, writing to `/etc/passwd` fails the run - **Don't make HTTP calls to unauthorized hosts** -- if `http_except: [payment-api]` is set, calling `api.stripe.com` directly fails the run - **Don't print secrets to stdout** -- if `secrets_in_logs: deny` is set, any output containing AWS keys, private keys, or Stripe tokens fails the run These rules are evaluated after your agent finishes. Your agent won't be interrupted mid-execution, but the scenario will be marked as failed if any rule is violated. --- ### Spec Reference Every section of a Keystone spec file, explained with examples. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content. A spec is a YAML file that describes a complete test scenario. It tells Keystone what environment to create, what the agent should do inside it, and how to determine whether the agent succeeded. A minimal spec needs just 5 fields: `version`, `id`, `base`, `task`, and `invariants`. Everything else is optional and adds capabilities as you need them. ## Sections overview **Required for every spec:** | Section | What it does | |---------|-------------| | `task` | The prompt and context for what the agent should do | | `base` | Base Docker image for the sandbox (e.g. `ubuntu:24.04`) | | `agent` | How the agent connects and runs | | `invariants` | Pass/fail checks that run after the agent finishes | | `scoring` | How check results are combined into a final score | **Optional -- add when you need them:** | Section | What it does | When to use | |---------|-------------|-------------| | `setup` | Packages to install, commands to run, files to create | When the agent needs dependencies pre-installed | | `resources` | Limits: timeout, memory, CPU, disk | When defaults (2 CPU, 2GB, 10min) aren't enough | | `fixtures` | Seed data: clone a repo, load SQL, copy files | When the agent needs a codebase or database to work with | | `services` | Databases, caches, mock APIs alongside the sandbox | When the agent interacts with external services | | `secrets` | Test credentials -- generated fresh or from a vault | When the agent needs API keys or passwords | | `network` | Control outbound access, redirect APIs to mocks | When you want to block real API calls during testing | | `audit` | Record DB writes, HTTP calls, file changes | When you need forbidden checks or detailed logs | | `forbidden` | Things the agent must NOT do | When you want to enforce boundaries on agent behavior | | `parallelism` | Run multiple times with different parameters | When you want to measure consistency or compare configs | | `determinism` | Pin the clock, seed randomness, freeze DNS | When you need reproducible runs | | `teardown` | Export audit logs and database dumps | When you want to keep artifacts after each run | --- ## `task` The prompt that gets sent to the agent. This is the only required field. ```yaml task: prompt: | Fix the failing test in src/utils.test.ts so that all tests pass. Do not modify the source code in src/utils.ts. context: repo: "acme/backend" language: "typescript" ``` The `context` field is optional key-value metadata passed alongside the prompt. Useful for giving the agent additional structured information. ## `base` The Docker image used as the sandbox's operating system. Required. ```yaml base: "ubuntu:24.04" ``` Any public Docker image works. For custom images, make sure the keystone server can pull from your registry. ## `setup` Runs before the agent starts. Use this to install dependencies, create config files, and set environment variables. ```yaml setup: packages: [nodejs, npm, python3, git] commands: - "npm install" - "python3 -m pip install pytest" files: - path: ".env" content: | DATABASE_URL=postgres://postgres:test@db:5432/testdb NODE_ENV=test env: CI: "true" ``` Setup commands run in the sandbox workspace directory. If any command fails, sandbox creation fails and the experiment reports an error. ## `resources` Resource limits for the sandbox. If not specified, defaults to 2 CPUs, 2GB memory, and a 10-minute timeout. ```yaml resources: timeout: 10m memory: 4Gi cpu: 4 disk: 20Gi ``` The timeout controls how long the entire sandbox lifecycle runs (including setup, agent execution, and scoring). If the agent exceeds this, it gets killed and the scenario is marked as an error. ## `fixtures` Seed data that gets loaded into the sandbox before the agent starts. Fixtures run after services boot but before the snapshot is extracted into `/workspace`, so anything that touches the filesystem needs to either inline its data or ship it alongside the spec. ```yaml fixtures: # Load SQL inline — preferred for test seeds. Self-contained in the spec. - type: sql service: db sql: | CREATE TABLE customers ( email TEXT PRIMARY KEY, name TEXT NOT NULL ); INSERT INTO customers VALUES ('alice@example.com', 'Alice'), ('ben@example.com', 'Ben'); # Clone a git repo into the workspace - type: git_repo url: "https://github.com/your-org/your-repo" branch: "main" depth: 1 # Copy a directory into the sandbox - type: directory source: /path/to/test-data target: data/ # Inject random data drift for adversarial testing - type: drift target: db.customers strategy: random_mismatches count: 15 seed: "42" ``` **Fixture types:** | Type | What it does | |------|-------------| | `sql` | Runs SQL against a database service. Provide **either** `sql:` (inline script, preferred) **or** `path:` (file in workspace). Requires `service`. Credentials come from the service's declared `POSTGRES_USER` / `POSTGRES_PASSWORD` / `POSTGRES_DB` env (defaults to `postgres` / `test` / `testdb`). | | `git_repo` | Clones a repository into the sandbox. Supports `branch`, `depth`, and `path`. | | `directory` | Copies files from `source` to `target` inside the sandbox. Both paths are relative to the sandbox workspace directory. | | `drift` | Injects random data modifications for adversarial testing. Uses `strategy`, `count`, and `seed`. | ## `services` Services are backing containers your agent talks to during a run — Postgres, Redis, Stripe-mock, anything. You declare them in the spec; Keystone pulls the image and starts the container on a shared Docker network. Your agent reaches each by its `name` (DNS alias), so the same connection strings work across every run. ### Any public or private Docker image works ```yaml services: - name: db image: postgres:16 # Docker Hub env: POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}" POSTGRES_DB: northwind ports: [5432] wait_for: "pg_isready -U postgres" - name: cache image: redis:7 # Docker Hub ports: [6379] - name: stripe_mock image: stripe/stripe-mock:latest # third-party public image env: STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}" ports: [12111] - name: vector image: ghcr.io/qdrant/qdrant:v1.7.4 # any registry — GHCR, ECR, GAR ports: [6333] ``` No Dockerfile. No `docker push`. Keystone pulls on demand (cached after first run) and wires the container into the sandbox network with `--network keystone- --network-alias `. ### Inside the agent, reach services by name Once the sandbox is up, each service's `name` resolves over DNS on the shared network: ```typescript // Agent code — no IP lookup, no port mapping, no env configuration needed const db = new Client({ connectionString: "postgres://postgres:secret@db:5432/northwind" }); const cache = new Redis("redis://cache:6379"); const stripe = new Stripe(process.env.STRIPE_SECRET, { host: "stripe_mock", port: 12111 }); ``` Keystone also injects `KEYSTONE_SERVICE_ ## `secrets` The spec is the single source of truth for what secrets your agent needs and where each value comes from. You declare the name and (optionally) the `source:` — the SDK resolves, forwards in the create request, and the Keystone server injects into the sandbox container env. ```yaml secrets: - name: XAI_API_KEY source: env # read from $XAI_API_KEY on the caller's machine - name: DB_PASSWORD source: env:MY_DB_PASS # rename — read $MY_DB_PASS, inject as DB_PASSWORD - name: OPENAI_API_KEY source: "file:~/.openai/key" # read + trim file contents - name: INTERNAL_TOKEN source: command:op read "op://Dev/Keystone/token" # exec a shell command, capture stdout - name: STRIPE_LIVE_KEY source: dashboard # server-side only — SDK refuses local overrides - name: TEST_FIXTURE_TOKEN from: "static://fake-test-value" # spec-owned literal (deterministic) - name: GENERATED_DB_PASSWORD from: generated # random 32-byte hex per run ``` ### Source types | Source | Resolution | Use for | |--------|-----------|---------| | `env` *(default if omitted)* | `process.env[NAME]` | Local `.env` during dev; `export KEY=...` in CI | | `env:OTHER_NAME` | `process.env[OTHER_NAME]` | Rename (env var on your machine uses a different name than the sandbox expects) | | `file:path` | Trimmed contents of the file (supports `~/`) | Secrets managed by an external tool that writes to a file (e.g. Cloud SDK auth) | | `command:` | Exec the shell command, trim stdout | CLI-based secret managers (1Password, Doppler, Infisical, Vault CLI) | | `dashboard` | Server decrypts Dashboard-stored value; SDK forwards nothing | Prod-critical keys that must never be overridable from a local machine | | `from: "static://..."` | Spec-owned literal; wins over every other source | Deterministic test fixtures | | `from: generated` | Random 32-byte hex, unique per run | Ephemeral test passwords | ### Precedence Highest wins: 1. **Spec literal** (`from: static://...` or `from: generated`) — deterministic, never overridable 2. **SDK-forwarded source value** (`env`, `env:X`, `file:`, `command:`) — resolved on the caller's machine 3. **Dashboard Secret** — server-side fallback when the source is `dashboard` or resolution failed A declared secret that can't be resolved by any layer fails the sandbox boot with the missing name — no silent empties. ### How the SDK reads it When you call `ks.experiments.create()` or `ks.sandboxes.create()` with a `specPath` (TS) / `spec_path` (Python), the SDK: 1. Parses the spec's `secrets:` block 2. For every entry with a `source:` that isn't `dashboard`, resolves the value locally 3. Forwards `{NAME: value}` in the create request body 4. Server merges with Dashboard secrets (forwarded values win), injects as container env Entries with `from:` (literals) and `source: dashboard` are not forwarded — the server resolves those at sandbox boot. ### Using secrets in services Service env values can reference any resolved secret via `{{ secrets.NAME }}`: ```yaml secrets: - name: DB_PASSWORD source: env services: - name: db image: postgres:16 env: POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}" # substituted at boot POSTGRES_DB: northwind ports: [5432] ``` Same source and precedence rules apply — the substitution uses whatever value the resolver produced. ### Dashboard Secrets tab (team sharing + prod baseline) Go to [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **Secrets** tab to store values encrypted at rest. Dashboard values: - Auto-inject into every sandbox when `source: dashboard` or the caller has no local value for a declared name - Are scoped to the billing owner: on a team, all teammates share the same secrets - Are AES-256-GCM encrypted; decrypted only in-process on the Keystone server - Show a warning next to each key that's declared by a spec but not set ## `network` Controls what the sandbox can access on the network. Use this to prevent your agent from calling real APIs during testing. ```yaml network: egress: default: deny # block all outbound by default allow: - registry.npmjs.org # let npm install work - github.com - "*.services.internal" # internal service network dns_overrides: smtp.sendgrid.net: smtp.services.internal # redirect real API to mock ``` When `egress.default` is `deny`, only explicitly allowed hosts are reachable. DNS overrides redirect hostnames to different targets -- useful for pointing real API domains at mock services. ## `audit` Records everything the agent does inside the sandbox. The audit log is used by forbidden checks and can be exported during teardown. ```yaml audit: db_writes: true # track INSERT/UPDATE/DELETE http_calls: true # track outbound HTTP requests process_spawns: true # track child processes stdout_capture: true # capture stdout for secret detection file_system: watch: ["src/", "config/"] # directories to monitor track: [writes, reads, deletes] ``` ## `agent` Defines how the agent runs inside the sandbox. Keystone supports six agent types. ```yaml # Run the Paragon CLI agent: type: paragon model: paragon-fast timeout: 5m # Run any CLI binary agent: type: cli binary: /usr/local/bin/my-agent args: ["--task", "{{ task.prompt }}"] timeout: 5m # Call an HTTP endpoint agent: type: http endpoint: "https://api.example.com/agent/run" auth: bearer: "{{ secrets.API_KEY }}" timeout: 5m # Run a Python script agent: type: python binary: agent.py timeout: 5m # Run a Docker image agent: type: image image: "myregistry.io/my-agent:v3" timeout: 5m # Run from an uploaded agent snapshot agent: type: snapshot snapshot: my-agent # resolves to latest version # or pin a specific version: # snapshot_id: snap_abc123 timeout: 5m ``` **Agent types:** | Type | What it means | |------|-------------| | `paragon` | Runs the Paragon CLI with the task prompt on stdin | | `cli` | Runs any binary with args. Template variables like `{{ task.prompt }}` are substituted | | `http` | POSTs a JSON payload to an HTTP endpoint | | `python` | Runs a Python script with the task as JSON on stdin | | `image` | Pulls and runs a Docker image on the sandbox's service network | | `snapshot` | Downloads and runs an immutable agent snapshot uploaded via `ks.agents.upload()` | The task prompt is passed to the agent via stdin (for cli, paragon, python) or in the request body (for http). The agent has full access to the sandbox filesystem and can run commands, write files, and connect to services. ## `invariants` The checks that determine whether the agent passed or failed. Each invariant has a `weight` (how much it matters) and a `check` (what to evaluate). Gate invariants (`gate: true`) cause an immediate fail if they don't pass, regardless of other scores. ```yaml invariants: tests_pass: description: "All tests pass" weight: 1.0 gate: true # hard fail if this fails check: type: command_exit command: "npm test" output_correct: description: "Output file contains expected data" weight: 0.5 check: type: file_content path: output.json contains: '"status": "success"' no_debug_code: description: "No console.log left in source" weight: 0.2 check: type: file_content path: src/main.ts not_contains: "console.log" email_sent: description: "Exactly one email was sent" weight: 0.3 check: type: http_mock_assertions service: smtp assertions: - field: request_count filters: { to: "user@example.com" } equals: 1 code_quality: description: "Code is clean and well-structured" weight: 0.2 check: type: llm_as_judge model: paragon-fast criteria: "Evaluate the code for readability, correctness, and minimal diff" input_from: src/main.ts rubric: pass: "Clean, minimal change that fixes the issue" fail: "Over-engineered, introduces unnecessary complexity" pass_threshold: 0.6 ``` **Invariant types:** | Type | What it checks | |------|-------------| | `command_exit` | Runs a command and checks the exit code (default: 0) | | `file_exists` | Checks that a file exists at a path | | `file_absent` | Checks that a file does NOT exist | | `file_content` | Checks that a file contains/doesn't contain a string or regex pattern | | `sql` | Runs a SQL query against a database service and checks the result | | `http_mock_assertions` | Checks requests recorded by a mock service | | `custom` | Runs a Python script that returns `{"passed": true/false, "reason": "..."}` | | `llm_as_judge` | Uses an LLM to evaluate subjective criteria with a rubric | ### How scoring works Each invariant produces a score between 0 and 1 (pass = 1, fail = 0, llm_as_judge = continuous). The composite score is the weighted average: ``` composite = sum(weight * score) / sum(weight) ``` If any gate invariant fails, the composite score is forced to 0 regardless of other results. The scenario passes if the composite score meets the `scoring.pass_threshold`. ## `forbidden` Trajectory constraints that check the agent's behavior, not just its output. These use the audit log to detect unauthorized actions. ```yaml forbidden: # Only allow DB writes to these tables db_writes_outside: [users, orders, audit_log] # Only allow HTTP calls to these services http_except: [payment-api, smtp] # Fail if secrets appear in stdout secrets_in_logs: deny # Only allow file writes to these paths file_writes_outside: [src/, config/, output/] ``` If any forbidden rule is violated, the scenario fails regardless of the invariant score. ## `scoring` How invariant results are combined into a final verdict. ```yaml scoring: pass_threshold: 0.9 # composite score must be >= 0.9 to pass # For multi-replica experiments replica_aggregation: strategy: majority # all_must_pass, majority, or percentage min_pass_rate: 0.8 ``` ## `parallelism` Run the same scenario multiple times and/or with different parameters. ```yaml parallelism: replicas: 10 # run 10 times isolation: per_run # fresh sandbox per run # Test with different parameters matrix: - model: "gpt-4o" - model: "claude-sonnet-4-20250514" - model: "gemini-pro" ``` With this config, Keystone runs 3 (matrix entries) x 10 (replicas) = 30 total scenarios. ## `determinism` Pin sources of non-determinism for reproducible runs. ```yaml determinism: clock: "2026-01-01T00:00:00Z" # fixed timestamp seed: 42 # deterministic RNG dns: static # static DNS resolution ``` When a scenario fails, the `reproducer` in the results includes the seed, so you can re-run the exact same scenario. ## `teardown` Export artifacts after each run. Runs even if the agent fails when `always_run` is set. ```yaml teardown: always_run: true export: - type: audit_log to: "results/audit.jsonl" - type: db_dump service: db to: "results/db.sql" - type: snapshot to: "results/final-state/" - type: mock_requests service: payment-api to: "results/api-calls.json" ``` --- ### SDK Reference All Keystone SDK methods across TypeScript, Python, and Go. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content. The Keystone client has seven services: `sandboxes`, `specs`, `experiments`, `alerts`, `agents`, `datasets`, and `scoring`. All examples use TypeScript -- Python and Go follow the same patterns. ## Client setup Get your API key from [app.paragon.run/app/keystone/settings](https://app.paragon.run/app/keystone/settings) → **API Keys** tab → **Create Key**. Keys start with `ks_live_` and are shown once at creation. Either pass it to the client directly or set `KEYSTONE_API_KEY` in your environment. --- ## Secrets Secrets are declared in your spec with a `source:` field that tells the SDK where to pull each value from — your local `.env`, a file on disk, a shell command (Vault / 1Password / Doppler), or the Dashboard. See the full list of source types in [Specs → secrets](/keystone/specs#secrets). ### Auto-forwarding from a spec file Pass `specPath` (TS) / `spec_path` (Python) and the SDK reads the spec's `secrets:` block, resolves each declared source, and forwards the resulting `{name: value}` map in the create request. ### Precedence Highest wins: 1. **Spec literal** (`from: static://...`) — deterministic fixtures 2. **SDK-forwarded** source value (`env`, `env:X`, `file:`, `command:`) 3. **Dashboard Secret** — server-side fallback for missing or `source: dashboard` entries A declared secret that resolves to nothing at any layer fails the sandbox boot loudly — no silent empties. ### Dashboard as the team/prod baseline The [Dashboard Secrets tab](https://app.paragon.run/app/keystone/settings) stores AES-256-GCM-encrypted values scoped to the billing owner. Use it when: - A secret must be shared across teammates without everyone maintaining their own `.env` - Running in CI/prod where no `.env` exists on the machine - A prod-critical key must refuse any local override (declare with `source: dashboard`) --- ## Sandboxes Sandboxes are isolated environments where your agent runs. Create one from a spec, interact with it, then destroy it. ### `sandboxes.create(opts)` ```typescript const sb = await ks.sandboxes.create({ spec_id: 'fix-failing-test', // required: which spec to use timeout: '10m', // optional: auto-cleanup timer metadata: { run: 'test-1' }, // optional: key-value pairs for tracking }); // Returns: { id, spec_id, state, path, url, created_at, metadata, services } ``` The `services` field contains connection info for any backing services defined in the spec: ```typescript sb.services.db // { host: "db", port: 5432, ready: true } sb.services.cache // { host: "cache", port: 6379, ready: true } ``` ### `sandboxes.get(id)` / `sandboxes.list()` / `sandboxes.destroy(id)` ```typescript const sb = await ks.sandboxes.get('sb-abc123'); // sb.state: 'creating' | 'ready' | 'running' | 'stopped' | 'error' const all = await ks.sandboxes.list(); await ks.sandboxes.destroy('sb-abc123'); ``` ### `sandboxes.runCommand(id, opts)` Run a shell command inside the sandbox. ```typescript const result = await ks.sandboxes.runCommand('sb-abc123', { command: 'npm test', timeout: '2m', }); // Returns: { command, stdout, stderr, exit_code, duration_ms } ``` ### File operations ```typescript // Read const content = await ks.sandboxes.readFile('sb-abc123', 'src/utils.ts'); // Write await ks.sandboxes.writeFile('sb-abc123', 'src/utils.ts', 'const x = 1;'); // Delete await ks.sandboxes.deleteFile('sb-abc123', 'tmp/debug.log'); ``` ### State and diffing ```typescript // Full filesystem snapshot (files + checksums) const snapshot = await ks.sandboxes.state('sb-abc123'); // Returns: { captured_at, files: { [path]: { size, mode, checksum } } } // What changed since sandbox creation const diff = await ks.sandboxes.diff('sb-abc123'); // Returns: { added: string[], removed: string[], modified: string[] } ``` ### Trace ingestion Post trace events to a sandbox. The `wrap()` helper does this automatically for LLM calls, but you can also call it directly. ```typescript await ks.sandboxes.ingestTrace('sb-abc123', [ { event_type: 'tool_call', tool: 'write_file', phase: 'end', status: 'ok', duration_ms: 120 }, ]); const trace = await ks.sandboxes.getTrace('sb-abc123'); // Returns: { events: TraceEvent[], metrics: TraceMetrics } ``` ### Real-time events (SSE) Stream sandbox lifecycle events in real-time using Server-Sent Events: ``` GET /v1/sandboxes/:id/events ``` Events include status changes (`creating`, `ready`, `running`, `destroyed`), service startup, fixture application, and command execution. Useful for building dashboards or progress indicators. ```typescript const eventSource = new EventSource( `${baseUrl}/v1/sandboxes/sb-abc123/events` ); eventSource.onmessage = (event) => { const data = JSON.parse(event.data); console.log(`[${data.event_type}]`, data.data); }; ``` --- ## Specs Upload and manage spec YAML files. ```typescript // Upload a spec const spec = await ks.specs.create(readFileSync('my-spec.yaml', 'utf-8')); // Get, list, delete const spec = await ks.specs.get('fix-failing-test'); const specs = await ks.specs.list(); await ks.specs.delete('fix-failing-test'); ``` Specs are versioned automatically. Each upload to the same `id` creates a new version. --- ## Experiments Run your spec across scenarios and score the results. ### `experiments.create(opts)` / `experiments.run(id)` ```typescript const exp = await ks.experiments.create({ name: 'baseline-v1', spec_id: 'fix-failing-test', }); // Trigger async (returns immediately) await ks.experiments.run(exp.id); ``` ### `experiments.runAndWait(id, opts?)` Trigger and poll until complete. This is the most common way to run experiments. ### Results structure | Field | Type | Description | |-------|------|-------------| | `total_scenarios` | number | Total scenarios run | | `passed` / `failed` / `errors` | number | Counts | | `metrics.pass_rate` | number | 0.0 to 1.0 | | `metrics.mean_wall_ms` | number | Average latency | | `metrics.p95_wall_ms` | number | 95th percentile latency | | `metrics.total_cost_usd` | number | Total cost | | `metrics.mean_cost_per_run_usd` | number | Cost per scenario | | `metrics.tool_success_rate` | number | 0.0 to 1.0 | | `scenarios` | array | Per-scenario results with invariants and reproducers | ### `experiments.compare(baselineId, candidateId)` Compare two experiments. Detects regressions in pass rate, cost, and latency. ```typescript const comparison = await ks.experiments.compare('exp-baseline', 'exp-new'); // Returns: { // regressed: boolean, // regressions: ["pass_rate dropped from 90% to 60%"], // metrics: [{ name, baseline, candidate, delta, direction }] // } ``` ### `experiments.metrics(id)` Detailed metrics with tool breakdown and trends over time. ```typescript const metrics = await ks.experiments.metrics(exp.id); // Returns: { summary, tool_breakdown, cost_trend, pass_rate_trend } ``` --- ## Alerts Alert rules notify you when experiment metrics cross a threshold. Alerts are persisted and survive server restarts. Conditions use the format ` `. **Metrics:** `pass_rate`, `mean_wall_ms`, `p95_wall_ms`, `total_cost_usd`, `mean_cost_per_run_usd`, `tool_success_rate`, `side_effect_violations`, `mean_tool_calls` **Operators:** `<`, `<=`, `>`, `>=`, `==`, `!=` ### Webhook alerts ```typescript await ks.alerts.create({ name: 'pass-rate-drop', eval_id: 'fix-failing-test', // optional: only fire for this spec condition: 'pass_rate < 0.8', notify: 'webhook', webhook_url: 'https://hooks.slack.com/services/T00/B00/xxx', }); ``` Slack webhook URLs are auto-detected and receive rich Block Kit messages. Other URLs receive the raw JSON payload. ### Slack Bot alerts Post directly to a Slack channel using a bot token (`SLACK_BOT_TOKEN` env var on the server): ```typescript await ks.alerts.create({ name: 'cost-spike', condition: 'mean_cost_per_run_usd > 2.00', notify: 'slack', slack_channel: '#agent-alerts', }); ``` ### `alerts.list()` / `alerts.delete(id)` ```typescript const alerts = await ks.alerts.list(); await ks.alerts.delete('alert-abc123'); ``` --- ## Agents Agent snapshots are immutable, versioned bundles of your agent code. Upload them and reference them in specs with `agent.type: snapshot`. ### `agents.upload(opts)` ```typescript const snapshot = await ks.agents.upload({ name: 'my-agent', entrypoint: ['python', 'main.py'], runtime: 'python3.12', tag: 'latest', bundle: tarballBytes, // Uint8Array of the .tar.gz }); // Returns: { id, name, version, tag, digest, size_bytes, entrypoint, created_at } ``` ### `agents.get(name, opts?)` ```typescript const latest = await ks.agents.get('my-agent'); const tagged = await ks.agents.get('my-agent', { tag: 'stable' }); const specific = await ks.agents.get('my-agent', { version: 3 }); ``` ### `agents.list(opts?)` / `agents.listVersions(name, opts?)` ```typescript const page = await ks.agents.list({ limit: 50 }); // Returns: { items: AgentSnapshot[], next_cursor?: string } const versions = await ks.agents.listVersions('my-agent'); ``` ### `agents.delete(snapshot)` Pass the full snapshot object, not just the ID. ```typescript const snapshot = await ks.agents.get('my-agent', { version: 1 }); await ks.agents.delete(snapshot); ``` ### Agent traces Every trace is tagged with the agent that produced it. Query by agent name and version: ``` GET /v1/agents/my-agent/traces GET /v1/agents/my-agent/traces?version=3 GET /v1/agents/my-agent/traces?limit=100 ``` Returns traces plus computed metrics (tool success rate, latency percentiles, per-tool breakdown). --- ## LLM tracing ### `ks.wrap(client)` Wrap an Anthropic or OpenAI client so every call automatically reports traces to the current sandbox. Sandbox routing is automatic — the SDK reads `KEYSTONE_SANDBOX_ID` from the environment (Keystone injects it when your agent runs inside a sandbox). Outside a sandbox there's nothing to route to, so `wrap()` returns the client untouched and your code runs as normal. ```typescript const anthropic = ks.wrap(new Anthropic()); const openai = ks.wrap(new OpenAI()); // .create() calls now auto-report LLM usage, tool calls, and latency // when running inside a sandbox. Locally, they pass through unchanged. ``` Works with any OpenAI-compatible provider: | Provider | How to wrap | |----------|-------------| | Anthropic | `ks.wrap(new Anthropic())` | | OpenAI | `ks.wrap(new OpenAI())` | | Groq | `ks.wrap(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), { sandboxId })` | | xAI | `ks.wrap(new OpenAI({ baseURL: 'https://api.x.ai/v1' }), { sandboxId })` | | Together | `ks.wrap(new OpenAI({ baseURL: 'https://api.together.xyz/v1' }), { sandboxId })` | ### `ks.initTracing(sandboxId)` and `traced(name, fn)` For non-LLM operations, use `traced()` to capture custom spans. ```typescript const ks = new Keystone(); ks.initTracing('sb-xxx'); const result = await traced('write_config', async () => { await fs.writeFile('config.json', JSON.stringify(config)); return 'ok'; }); ``` Nested `traced()` calls create parent-child spans automatically. ### `Keystone.fromSandbox()` If your agent is running inside a Keystone sandbox, use this to get a pre-configured client. It reads `KEYSTONE_BASE_URL` and `KEYSTONE_SANDBOX_ID` from the environment that Keystone injects automatically. ```typescript const { client, sandbox } = await Keystone.fromSandbox(); // client: ready-to-use Keystone instance // sandbox.services.db: { host: "db", port: 5432, ready: true } ``` Your agent also gets environment variables for each service: - `KEYSTONE_SANDBOX_ID` -- the sandbox ID - `KEYSTONE_BASE_URL` -- the Keystone API URL - `KEYSTONE_SERVICE_DB_HOST` / `KEYSTONE_SERVICE_DB_PORT` -- per-service connection info --- ### Billing and Usage How Keystone meters sandbox runs, plan rates, and where to manage payment. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content. Keystone meters three things: **vCPU-seconds** (how long your sandboxes run), **GiB-seconds** (how much memory they hold), and **LLM judge tokens** (when you use `llm_as_judge` invariants, since Keystone pays the provider on your behalf). Storage above the included quota is charged per GiB-month. That's it -- there are no per-experiment, per-scenario, or per-API-call fees, and your own agent's LLM calls stay on your own provider bill (Keystone traces them so you can see them, but doesn't re-bill them). A card on file is required before you can create a Keystone API key. Once you have one, usage is billed against a rolling 30-day cycle, with invoices generated at the end of each cycle. ## Where to find it In the dashboard, go to [app.paragon.run/app/keystone/data/billing](https://app.paragon.run/app/keystone/data/billing) -- or in the Keystone sidebar, under **Data**, click **Usage & Billing**. The page shows: - **Total spend** for the current cycle, updated within 30 seconds of every sandbox run. - **Billing cycle picker** -- flip between the current cycle and any previous cycles to compare. - **Usage breakdown** -- daily spend split between compute (vCPU + memory) and storage. - **Resource breakdown** -- daily vCPU-seconds and GiB-seconds consumed. - **Manage payment details** and **View invoices** buttons that open the Stripe billing portal. ## Plans | Plan | Price | vCPU rate | Memory rate | Included storage | Storage overage | | --- | --- | --- | --- | --- | --- | | **Free** | $0/mo | $0.00003942 / vCPU·s | $0.00000672 / GiB·s | 1 GiB | $5 / GiB-month | | **Pro** | $149/mo | $0.00003942 / vCPU·s | $0.00000672 / GiB·s | 5 GiB | $3 / GiB-month | | **Enterprise** | Contact us | $0.00003942 / vCPU·s | $0.00000672 / GiB·s | 5 GiB | $0 (included) | Compute rates are the same on every plan -- you pay exactly for what your sandboxes use. The plan difference is the included storage quota and the per-GiB overage rate for snapshots and cached sandbox state. ### LLM judge rates When your spec uses an `llm_as_judge` invariant, Keystone runs Paragon as the judge and bills you per token. The default model is **paragon-md** (Claude Sonnet 4.6, strong judge); you can override per-invariant with `check.model`. | Model | Input (per 1k tokens) | Output (per 1k tokens) | | --- | --- | --- | | `paragon-fast` | $0.0010 | $0.0040 | | `paragon-md` (default) | $0.0030 | $0.0150 | | `paragon-max` | $0.0200 | $0.0800 | ```yaml invariants: draft_quality: check: type: llm_as_judge model: paragon-fast # optional; default is paragon-md criteria: "The draft email is professional and accurate." ``` A typical judge call on `paragon-md` with a ~2k-token input and ~200-token output costs around $0.009. Tokens are stamped on every call and aggregated into the cycle's invoice alongside compute and storage. ## Billing cycles Your first sandbox run starts the billing anchor. From there, cycles are 30 days long and roll forward automatically. The dashboard always shows the current cycle by default; use the picker to look at any previous cycle. Current-cycle usage refreshes every 30 seconds. Historical days are frozen once the nightly rollup runs, so past cycles never change. ## Adding a payment method 1. Go to [Usage & Billing](https://app.paragon.run/app/keystone/data/billing) in the dashboard. 2. Click **Manage payment details** -- this opens the Stripe billing portal in a new tab. 3. Add or update your card in the portal. 4. Close the portal tab and return to Keystone. Your API keys page will unlock and you can create a `ks_live_` key. Until a card is attached, [the keys page](https://app.paragon.run/app/keystone/keys) blocks key creation and links you straight to the setup flow. ## Upgrading to Pro From the [Usage & Billing](https://app.paragon.run/app/keystone/data/billing) page, click **Upgrade to Pro**. You'll be charged $149 on the first of each cycle. A card on file is required; team accounts can only be upgraded by the team admin. Unlimited/comped teams see a `comped` badge on the total spend -- usage still counts, but the invoice is zeroed out. ## Invoices Click **View invoices** on the billing page to open the Stripe customer portal. From there you can: - Download past invoices as PDFs. - Update your billing email and tax ID. - Cancel your subscription (Pro only). ## Team billing If your user account belongs to a team, **Keystone usage bills the team, not the individual**. The team's admin controls the payment method and sees the unified usage. Sandboxes created by any team member roll up into the same billing cycle. A few rules of thumb: - Only the team admin can upgrade, add a card, or cancel. - All team members can see the Usage & Billing page and their own usage contribution. - Leaving a team means your future usage bills against your personal account (which will need its own card). ## What's not billed - **Your agent's LLM provider costs** -- when your agent calls Anthropic, OpenAI, xAI, etc. from inside a sandbox using your own keys, you pay the provider directly. Keystone captures token counts and estimated USD via `ks.wrap()` so you can see it in traces, but does not re-bill it. (This is different from `llm_as_judge` invariants, which Keystone runs on its own Paragon judge and bills through your Keystone invoice — see above.) - **API calls against the Keystone control plane** (listing experiments, fetching results, creating specs). These are free. - **Experiments that fail to start** -- if the sandbox never boots, no vCPU or memory time is counted. ## Troubleshooting ### "Payment method required" when creating an API key You haven't attached a card yet. Go to [Usage & Billing](https://app.paragon.run/app/keystone/data/billing), click **Manage payment details**, add a card in the Stripe portal, then retry. ### Usage number hasn't updated after a run Current-cycle numbers poll every 30 seconds. If you just finished an experiment, wait a moment and refresh. Historical days don't update after the nightly rollup -- they're frozen. ### The dashboard shows "No usage yet" Your billing anchor hasn't been set, which means no sandbox has ever run for you (or your team). Create an API key and run one experiment to initialize the cycle. ### I'm on an unlimited plan -- why do I see a cost? The dashboard shows **shadow cost** ("what you'd be paying") so you can still gauge real compute spend. Your invoice stays at $0; the `comped` badge on the total spend confirms it. --- ## Getting started ### Overview Agentic QA platform for bulletproof code. Paragon is an agentic QA platform built with breakthrough research to help teams ship bulletproof code at unprecedented speed. Paragon analyzes your GitHub repositories and creates intelligent pull requests with improvements for code quality, performance, and security.
## Core Features ## Get Started ## Platform ## Tools Integration --- ### Paragon MCP Connect Paragon to your coding tools via MCP The Paragon MCP server gives your coding tools access to code review, testing, and analysis capabilities. Connect it to Claude Code, Cursor, or Windsurf to use Paragon directly from your development environment. ## Prerequisites - [Paragon CLI installed](/paragon/overview) - An API key from [app.paragon.run](https://app.paragon.run) ## Get your API key 1. Go to [app.paragon.run](https://app.paragon.run) 2. Sign in or create an account 3. Navigate to **Settings** → **API Keys** 4. Copy your API key ## Connect to your tool ## Available tools Once connected, Paragon exposes 12 tools to your coding assistant: ### Code review | Tool | Description | |------|-------------| | `run_paragon` | Run a Paragon prompt — reads/writes files, runs shell commands, and performs complex coding tasks. Requires API key. | | `list_reviewed_prs` | List pull requests with Paragon review comments in the current repository. Supports filtering by state (`open`, `closed`, `all`). | | `get_review_comments` | Get parsed review findings from a PR, including severity, descriptions, file locations, and suggested fixes. Auto-detects PR from current branch if omitted. | | `resolve_review_comment` | Reply to a review comment and optionally resolve the thread. | ### Testing | Tool | Description | |------|-------------| | `detect_test_framework` | Scan the repository and detect test frameworks in use. Returns framework details, run commands, file patterns, and confidence level. | | `find_tests` | Discover existing test files using framework-aware file patterns. Returns file paths and count. | | `run_tests` | Execute unit/integration tests using the detected or specified framework. Returns pass/fail status and full output. | | `run_e2e_tests` | Run Playwright E2E tests locally. Installs browsers if needed, optionally starts a dev server, and captures results. | | `generate_tests` | Generate unit, integration, or E2E tests using Paragon. Detects patterns, writes tests, verifies compilation, runs them, and fixes failures. Requires API key. | ### Test suite management | Tool | Description | |------|-------------| | `list_test_suites` | List your test repositories and suites from the Paragon cloud dashboard. Requires API key. | | `save_to_suite` | Save locally generated test files to a Paragon cloud test suite. Requires API key. | ### Utility | Tool | Description | |------|-------------| | `check_setup` | Check if the MCP server is properly configured. Reports the status of the API key, Paragon binary, and GitHub CLI. | ## Example prompts After setup, your tool can automatically use Paragon when relevant. You can also request tools directly: ``` Review the comments on my current PR and fix the issues Generate unit tests for src/auth.ts Run the test suite and show me what's failing Find all Playwright tests and run them in headed mode ``` --- ## PR Reviews ### Dashboard Your central hub for PR reviews, testing, the Paragon Agent, and team settings. The Paragon Dashboard is your central hub for automated PR reviews, test generation and execution, the Paragon Agent, and team management. The dashboard is organized into the following pages: - **Home**: Credit balance, usage statistics, and CLI installation - **PR Reviews**: View all automatically reviewed pull requests and discovered issues - **Testing**: Generate, run, and evolve tests for your repositories - **Agent**: An agentic QA engine that finds bugs, writes tests, reviews code, and opens pull requests with fixes - **Settings**: Configure your account, PR reviews, severity filters, custom rules, and integrations
## Quick Start 1. Install the Paragon CLI globally with npm 2. Connect your GitHub account in **Settings** 3. Enable automatic PR reviews and add repositories 4. View reviewed PRs in **PR Reviews** 5. Generate and manage tests in **Testing** 6. Run the **Agent** to find bugs and open fix PRs ## Installation & Setup Get started with Paragon by installing the CLI and authenticating with your API key. ### Step 1: Install CLI ```bash npm i -g @polarityinc/paragon ``` ### Step 2: Authenticate ```bash paragon auth login ``` ### Step 3: Verify Installation ```bash paragon ``` ## Home The Home page provides an overview of your account status and quick access to key features. ### Credit Balance & Plan Your credit balance and current plan are displayed at the top: - Current credit balance with visual indicator - Active plan name (Free, Developer, Startup) - "Add Credits" button for paid plans ### Statistics Click "Show All Stats" to expand your usage statistics: - Issues caught by severity (Critical, High, Medium, Low) - Total tokens used - Messages sent - Estimated hours saved ### Install Paragon CLI Click the install button to open a modal with copy-paste commands for CLI setup. --- ### Reviews View and manage all automatically reviewed pull requests and discovered issues. The Issues page is your central hub for viewing all automatically reviewed pull requests and the issues found. ### PR List View all PRs that have been automatically reviewed: - **Search**: Filter PRs by title - **Repository Filter**: Show PRs from specific repositories - **Sentiment Score**: Color-coded status (Good/Fair/Poor/Critical) - **Author**: PR author with avatar - **Date**: Relative timestamp (e.g., "5 min ago") ## Analytics Dashboard View metrics and trends for your PR reviews: The trend chart visualizes review activity over time. ### Issue Detail View Click on any PR to view detailed issues found: - Issues sorted by severity (Critical → High → Medium → Low) - Issue title and description - File path and line number - Code snippet display - Link to GitHub comment thread - Resolution status ## Trigger Reviews Manually You can trigger a PR review on-demand by tagging `@paragon-review` in any comment on a pull request. This is useful when: - You want to re-run a review after making changes - The repository doesn't have automatic reviews enabled - You want a review on a specific PR without enabling auto-reviews --- ### Repos Manage monitored repositories and per-repo settings for PR reviews and the agent. Manage which repositories Paragon monitors and configure per-repo settings for automatic PR reviews, agent behavior, and access control. ## Adding Repositories 1. Enable the "Automatic PR Reviews" toggle on the PR Reviews dashboard 2. Click "Manage Repos" to open the repository manager 3. Search and select repositories with checkboxes 4. Selected repos are auto-saved after a short delay ## Per-Repository Settings Each repository has individual toggles that control how Paragon interacts with it. These settings are managed from the Repos page. ### Auto Review When enabled, Paragon automatically reviews every new pull request opened against this repository. Disable to only receive reviews when manually triggered via `@paragon-review`. ### Open Access Allow external contributors (outside your GitHub organization) to trigger Paragon on this repository. By default, only org members can trigger reviews and agent runs. ### Agent Monitoring When enabled, the Paragon Agent continuously monitors this repository for issues. The agent watches for patterns like security vulnerabilities, performance regressions, and code quality problems, and can flag them proactively. ### Agent Auto-Fix When enabled alongside Agent Monitoring, the agent automatically creates fixes for issues it discovers instead of just reporting them. The agent commits changes to a new branch and opens a pull request for your review. ### Agent Push-as-PR Controls how the agent delivers code changes. When enabled, the agent always pushes changes as a new pull request rather than committing directly to the working branch. This ensures all agent-generated changes go through your standard code review process. ### Auto-Resolve When enabled, Paragon automatically resolves review comments that have been addressed in subsequent commits. If a flagged issue is fixed in a follow-up push, the corresponding review comment is marked as resolved. Enabled by default for new repositories. ## GitHub Connection Connect your GitHub account to enable repository access: - View connection status in "Code Host Connections" - Click "Connect" or "Reconnect" to authorize - Tokens are encrypted and auto-refresh --- ### Settings Configure your account, review options, severity filters, guardrails, and branch protection. Configure your account and review settings from the Settings page. ## Account - **Name**: Editable display name - **Email**: Synced from GitHub (read-only) - **Avatar**: Synced from GitHub profile ### API Key Your API key is displayed in Settings. Use it for CLI authentication with `paragon auth login`. ## Review Model Depth Choose the model tier used for automatic PR reviews. Each tier has a different depth of analysis and credit cost. | Model | Description | Credit Cost | |-------|-------------|-------------| | **Auto** | Automatically adapts to PR complexity — uses a lighter model for simple changes and a deeper model for complex ones | Varies | | **Max** | Deepest analysis with extended reasoning | 10x | | **MD** | Balanced depth and speed | 3x | | **Fast** | Quick scan for straightforward PRs | 1x | ## Review Options Configure what to include in automatic PR reviews: ## Severity Filters Control which types of issues are reported in your PR reviews. ### Issue Severity Levels ## Guardrails Guardrails are custom rules that Paragon enforces during every review. Unlike severity filters (which control what to report), guardrails define specific standards your code must meet. ### Enabling Guardrails Toggle guardrails on from the Configure page. When enabled, Paragon checks every PR against your active guardrails and flags violations alongside its standard review. ### Templates Paragon provides built-in guardrail templates you can activate with one click: | Template | Category | Severity | Rule | |----------|----------|----------|------| | **File Length Limit** | Code Quality | Warning | Files must be under 300 lines of code | | **Function Length Limit** | Code Quality | Warning | Functions must be under 50 lines | | **No Console Logs** | Code Quality | Error | No console.log statements in production code | | **No Any Types** | Code Quality | Warning | TypeScript files must not use the "any" type | | **No Hardcoded Secrets** | Security | Error | No hardcoded API keys, passwords, or secrets | | **JSDoc Required** | Documentation | Warning | All public functions must have JSDoc comments | | **React Componentization** | Code Quality | Warning | React components must be under 100 lines | | **Test File Required** | Code Quality | Warning | New feature files must have corresponding test files | ### Custom Guardrails Create your own guardrails with: - **Name** — A short label for the rule - **Description** — What the rule checks for - **Rule** — The specific standard to enforce (this is what the reviewer evaluates against) - **Category** — Code Quality, Style, Security, or Documentation - **Severity** — Error (must fix) or Warning (should fix) ### Managing Guardrails Each guardrail can be individually enabled or disabled without deleting it. Edit the name, description, rule, category, or severity at any time. Delete guardrails you no longer need. ## Branch Protection Prevent Paragon's agent from pushing directly to critical branches. When a branch is protected, the agent will create a separate branch and open a pull request instead of pushing directly. ### Configuration Branch protection is configured per repository: 1. Go to the **Configure** page 2. Scroll to **Branch Protection** 3. Expand a repository 4. Add branches to protect (e.g., `main`, `master`, `production`) ### Patterns Branch protection supports: - **Exact match** — e.g., `main`, `master`, `develop` - **Wildcard patterns** — e.g., `release/*`, `hotfix/*` ### Quick Add Common branches (`main`, `master`, `develop`, `staging`, `production`) are suggested for quick selection when adding protection rules. --- ### Custom Rules Define custom coding standards to follow during code reviews. Define custom coding standards that Paragon follows during code reviews. ## Upload Documents Upload existing documentation to automatically extract rules: - **Supported formats**: PDF, DOCX, MD, TXT - Coding rules are extracted automatically - Rules tagged with "upload" source ## Manual Entry Add rules by typing directly: 1. Click "Manual Entry" 2. Type your rule or guideline 3. Save to add to active rules ## Example Rules ``` All API endpoints must include error handling with try-catch Use TypeScript strict mode for all new files Database queries must use parameterized statements ``` ## Managing Rules - Click any rule to view full content - Copy rule content to clipboard - Delete rules no longer needed - View source (upload/manual) and creation date --- ## Agent ### Overview An agentic QA engine that finds bugs, writes tests, reviews code, and opens pull requests with fixes. The Paragon Agent is an agentic QA engine that clones your repository, analyzes your code for issues, writes tests, and opens pull requests with fixes. Use it from the dashboard, trigger it from GitHub comments and Slack, or automate it on schedules and events. ## What It Can Do - **Find and fix bugs** — Point it at an issue, error, or flaky test and it investigates and patches - **Write and improve tests** — Generate unit, integration, and E2E tests and increase coverage across your codebase - **Review code quality** — Catch security vulnerabilities, performance issues, and regressions before they ship - **Validate changes** — Run builds, execute test suites, and verify nothing breaks - **Browse and interact with web apps** — Use a real browser in a sandbox to test UI, fill forms, and validate behavior ## How It Works 1. **Select a repository** — Choose a connected GitHub repository and branch 2. **Describe the task** — Tell the agent what you want reviewed, tested, or fixed 3. **Agent executes** — The agent clones your repo in an isolated sandbox, analyzes the code, runs tests, and makes changes 4. **Review the PR** — Review the diff in the dashboard or on the generated pull request ## Session Modes | Mode | Best For | Duration | How It Works | |------|----------|----------|--------------| | **Standard** | Quick bug fixes, test generation, code review | Minutes | Single agent processes your prompt directly | | **Grind** | Comprehensive test coverage, large-scale refactors, multi-file fixes | Hours | Plans first, then delegates to parallel workers (LRA) | Grind mode is powered by the **Long Running Agent (LRA)** — a 2-phase orchestrator that plans, gets your approval, then executes with parallel workers. ## Models | Model | Speed | PCU Cost | Best For | |-------|-------|----------|----------| | **Max** | Slowest | Highest | Complex debugging, nuanced code review, architecture-level issues | | **Mid** | Balanced | Moderate | Most tasks — good balance of quality and speed | | **Fast** | Fastest | Lowest | Simple fixes, test generation, straightforward patches | ## MCP Integrations The agent can connect to external services via MCP (Model Context Protocol) to expand its capabilities: - **Linear** — Read and update issues - **Slack** — Send messages and read channels - **Jira** — Manage tickets - **Notion** — Access documentation - **Sentry** — Investigate errors - **Vercel** — Check deployments - **Supabase** — Query databases - **Cloudflare** — Manage workers and DNS - **AWS** — Interact with AWS services - **MongoDB** — Query collections - **Custom MCP servers** — Connect any MCP-compatible service via stdio, HTTP, or SSE Configure MCP integrations from the agent settings panel. ## Usage Analytics The **Usage** panel in the sidebar shows: - **PR metrics** — PRs created, merged, merge rate, PCUs per merged PR - **Session metrics** — Total sessions, tokens used, total cost, average duration - **Pull request chart** — Status breakdown (merged, open, closed) over time - **Model usage breakdown** — Sessions, tokens, and cost per model tier - **Time range** — Daily or weekly aggregation over the last 90 days ## Next Steps --- ### Sessions Create agent sessions, interact with the agent in real time, and review fixes and test changes. A session is a single conversation with the Paragon Agent. Each session works on one repository and branch, processes your prompt, and produces code changes — bug fixes, new tests, or quality improvements.