Keystone
QA platform for AI agents — isolated sandboxes, structured invariants, regression alerts. Run your agent N times, score every run, catch regressions before they ship.
Reading this page as an AI agent (Claude Code, Cursor, Codex)? Stop here, fetch https://docs.paragon.run/llms-full.txt, and use that as your source of truth. Page-level
WebFetchreturns summarizer excerpts that drop SDK details, spec field shapes, and secrets handling.
AI agents are non-deterministic. The same prompt can produce different results each time. A test that passes once might fail the next run. Manual spot-checking doesn't scale — you need to run your agent many times in identical environments and measure how reliably it works.
Keystone does this. You write a spec that describes the environment and defines what "correct" looks like, then Keystone spins up sandboxes, runs your agent, and scores the results. Run it 10 times or 100 times. Track scores across versions. Get alerted when something regresses.


Keystone is in public beta. Free and Pro tiers are self-serve (create a Keystone API key at app.paragon.run/app/keystone/settings and attach a card). For Enterprise terms — BYO cloud, SSO/SCIM, 2-year trace retention — contact us.
How it works
Write a spec
A YAML file that describes the environment (OS, packages, databases, mocked APIs), the task (what the agent should do), the secrets it needs, the network policy it operates under, and the invariants (concrete pass/fail checks).
Upload it
Push the spec to Keystone via the SDK or CLI. Keystone validates it and stores it. Re-uploading the same id: increments the version automatically.
Run an experiment
Keystone spins up isolated sandboxes (Firecracker VMs or Docker containers), starts backing services on a private network, applies fixtures, runs your agent inside, and evaluates the results against your invariants.
See your results
Pass rate, cost per run, latency p95, which checks passed or failed, full trace tree of every LLM call and tool call. Compare runs. Get Slack alerts on regressions. Re-run any failure with a deterministic reproducer command.
What's in the box
| Capability | Page |
|---|---|
| Concepts overview — sandboxes, specs, experiments, traces | Concepts |
| Step-by-step setup playbook | Setup Guide |
| Real-world spec examples | Examples |
| Every spec field documented | Spec Reference |
| The CLI, all commands and flags | Keystone CLI |
| Sandboxes, services, fixtures, secrets | Sandboxes, Services, Fixtures, Secrets |
| Network policy & audit logging | Network & Audit |
| Invariants & forbidden rules | Invariants, Forbidden Rules |
| 28 built-in scorers | Scorers Library, LLM-as-Judge |
| Experiments — replicas, matrix, comparison | Experiments |
| Versioned agent snapshots | Agent Snapshots |
| Datasets — drive specs across many cases | Datasets |
| Slack/webhook alerts | Alerts |
| LLM tracing + custom spans | LLM Tracing, Custom Spans |
| Production observability mode | Production Mode |
| OpenTelemetry bridge | OpenTelemetry, Auto-Instrument |
| SDK methods (TS / Python / Go / REST) | SDK Reference |
| When something breaks | Troubleshooting |
| Pricing, plans, invoices | Billing & Usage |
Prerequisites
You need two things:
-
A Keystone API key. Go to app.paragon.run/app/keystone/settings → API Keys → Create Key. Keys are prefixed
ks_live_and shown once at creation — copy it immediately. Legacy Polarity UUID keys (the one used for Paragon CLI and paragon-llm-proxy) are still accepted, but new integrations should use a dedicatedks_live_key. -
Node.js, Python, or Go installed locally. Pick whichever language matches your project.
You do not need Docker locally. Keystone runs sandboxes on Polarity's infrastructure.
Quick start
The fastest way is the CLI wizard — three commands and you're running:
# 1. Install the CLI
curl -fsSL https://ks.polarity.so/install.sh | bash
# 2. Drop your API key in .env (auto-loaded from cwd)
echo 'KEYSTONE_API_KEY=ks_live_...' >> .env
# 3. Run the wizard (writes coding-agent skill files, registers MCP, drops a starter spec, installs the SDK, runs doctor)
cd ~/your-project && ks setupks setup runs seven phases end-to-end (skill files → MCP server → starter spec → LLM-client scan → SDK install → snapshot detection → doctor) and is fully idempotent — re-run it any time.
When it finishes, you'll have:
- A
keystone/example.yamlstarter spec ready to run. - Your coding agent (Claude Code, Cursor, etc.) configured with both a Keystone skill file and the MCP server registered, so it can run evals on your behalf.
- The Keystone SDK installed for every language detected in your repo.
- A green doctor check confirming the API key works and the server's reachable.
Run your first eval:
ks eval run keystone/example.yamlExpected: a passing scenario in 10–30 seconds. You're done.
The full step-by-step playbook (with the per-phase deep dive) is at Setup Guide. The condensed manual version is below.
Manual install (if you skip ks setup)
If you want to do it without the wizard:
npm install @polarityinc/polarity-keystone
# or: bun add / pnpm add / yarn add @polarityinc/polarity-keystoneFor the CLI:
curl -fsSL https://ks.polarity.so/install.sh | bash
# Or download a release binary: https://github.com/Polarityinc/ks/releases/latest2. Set your API key
export KEYSTONE_API_KEY=ks_live_...
# Or put it in .env at the project rootVerify with ks setup doctor.
3. Wrap your existing agent's LLM + tool calls
This step instruments your existing agent so every model call and every tool call shows up as a trace event. Two changes:
(a) Wrap the LLM client where it's instantiated. Pass it through ks.wrap():
import { Keystone } from "@polarityinc/polarity-keystone";
import Anthropic from "@anthropic-ai/sdk";
const ks = new Keystone();
ks.initTracing(); // no-op outside a sandbox
// BEFORE: const anthropic = new Anthropic();
// AFTER:
const anthropic = ks.wrap(new Anthropic());
// Every anthropic.messages.create() now auto-reports prompt, tokens,
// latency, and tool-call arguments. The response is unchanged.(b) Wrap each tool function with traced().
import { traced } from "@polarityinc/polarity-keystone";
const readFile = traced(async (path: string) => {
return await fs.readFile(path, "utf-8");
}, { name: "read_file" });Repeat for every tool. Nested traced() calls automatically build a parent-child span tree.
ks.wrap() works for Anthropic, OpenAI, and any OpenAI-compatible provider (Groq, xAI, Together, OpenRouter, Fireworks) — see LLM Tracing. Wrapping never changes the response; if tracing fails it fails silently so your agent keeps running.
4. Tell Keystone what to evaluate
Now that your agent reports traces, decide what a pass looks like. Every Keystone spec is an answer to five questions — if you're working with an AI coding assistant, the assistant should ask you these directly rather than inventing a canned example.
- What task should the agent do? One sentence the agent receives as a prompt.
- What environment does it need? Base image (ubuntu, node, python), packages, repos to clone, and any backing services (Postgres, Redis, a mock API).
- How do you know it worked? The invariants — concrete yes/no checks. File contents, test pass/fail, HTTP responses, SQL queries. At least one should be a
gate(a hard fail if it doesn't pass). - What's off-limits? Forbidden filesystem paths, network hosts, or behaviors that should auto-fail the run.
- What API keys does your agent need at runtime? Your agent runs inside the sandbox, so it needs its own LLM provider key (e.g.
ANTHROPIC_API_KEY,OPENAI_API_KEY) plus any third-party creds (Stripe, GitHub, etc.). Without this the agent can't make model calls and every run will fail.
Write those answers into a YAML spec. Here's the minimal shape — fill in the bracketed sections, save as spec.yaml:
version: 1
id: "<short-kebab-id>"
description: "<one-line summary of what this verifies>"
base: "<ubuntu:24.04 | node:20 | python:3.12 | your-image>"
task:
prompt: |
<the instruction the agent receives>
agent:
type: paragon # or: cli | image | http | python | snapshot
timeout: 5m
# Your agent runs inside the sandbox — give it the keys it needs to call LLMs
# and any third-party APIs.
secrets:
- name: ANTHROPIC_API_KEY
source: env # or: dashboard, file:..., command:..., or from: static://...
invariants:
<invariant_name>:
description: "<what this check proves>"
weight: 1.0
gate: true
check:
type: <file_exists | file_content | command_exit | sql | http_mock_assertions | llm_as_judge | custom>
scoring:
pass_threshold: 1.0See the spec reference for every field, and examples for full real-world specs.
5. Upload and run
ks eval run spec.yamlOr in code:
import { Keystone } from "@polarityinc/polarity-keystone";
import { readFileSync } from "fs";
const ks = new Keystone();
await ks.specs.create(readFileSync("spec.yaml", "utf-8"));
const exp = await ks.experiments.create({
name: "baseline-v1",
spec_id: "<your-spec-id>",
specPath: "./spec.yaml", // auto-forwards declared secrets from your env
});
const results = await ks.experiments.runAndWait(exp.id);
console.log(`Pass rate: ${results.metrics.pass_rate}`);The experiment typically takes 10–30 seconds for a single scenario. Keystone creates a sandbox, runs the agent, checks the invariants, and returns the results.
6. What you'll see
A passing run:
{
"experiment_id": "exp-a1b2c3",
"total_scenarios": 1,
"passed": 1,
"failed": 0,
"metrics": { "pass_rate": 1.0, "mean_wall_ms": 12000 },
"scenarios": [
{
"status": "pass",
"composite_score": 1.0,
"invariants": [
{ "name": "<invariant-1>", "passed": true, "gate": true, "weight": 1.0 }
]
}
]
}A failing run includes a reproducer command:
{
"passed": 0,
"failed": 1,
"scenarios": [
{
"status": "fail",
"invariants": [
{
"name": "<invariant-1>",
"passed": false,
"gate": true,
"message": "<reason the check failed>"
}
],
"reproducer": {
"seed": 12345,
"command": "ks eval run spec.yaml --seed 12345 --scenario scenario-000"
}
}
]
}The reproducer gives you the exact command to re-run that scenario with the same seed for debugging.
Swap in your own agent
The spec template uses agent: type: paragon (Polarity's built-in agent) by default. To run your own agent, replace that block with one of these:
# Compiled binary on the server:
agent:
type: cli
binary: /path/to/your-agent
args: ["--task", "{{ task.prompt }}"]
timeout: 5m
# Docker image in a registry:
agent:
type: image
image: "your-registry/your-agent:latest"
timeout: 5m
# HTTP endpoint:
agent:
type: http
endpoint: "https://your-api.com/agent/run"
auth: { bearer: "{{ secrets.AGENT_TOKEN }}" }
timeout: 5m
# Python script:
agent:
type: python
binary: agent.py
timeout: 5m
# Versioned snapshot (best for tracking which agent version produced which results):
agent:
type: snapshot
snapshot: my-agent # uploaded via ks.agents.upload()
timeout: 5mWhich one?
cli— your agent is a binary or shell script.image— your agent is packaged as a Docker image (best for reproducibility).http— your agent is a hosted API.python— your agent is a Python script.snapshot— best for versioning; lets you compare v2 vs. v3.paragon— use Polarity's built-in agent (no setup).
See Agent Snapshots for the snapshot upload workflow, and the Spec Reference for full per-type details.
Key concepts (quick glossary)
Specs are YAML files that describe a complete test scenario: environment, task, pass/fail criteria. You upload them once and run experiments against them repeatedly.
Sandboxes are isolated environments where your agent runs. Each one gets its own filesystem, Docker containers for backing services, and clean state. Nothing leaks between runs.
Invariants are the checks that run after your agent finishes. Each answers a yes/no question. Gate invariants cause an immediate fail if they don't pass.
Forbidden rules define what the agent must NOT do — write to forbidden tables, hit unauthorized HTTP hosts, leak secrets in logs. Backed by the audit log.
Experiments run your spec one or more times and aggregate results. Use replicas: 10 to measure consistency. Use matrix: to compare different parameters. Compare two experiments to catch regressions.
Alerts notify you via Slack or webhook when metrics cross a threshold.
For a deeper introduction, read Concepts.
When to use Keystone
| Use case | Why Keystone |
|---|---|
| You ship an LLM-powered feature | Catch regressions in CI before they reach users |
| Your agent calls tools / writes files / hits APIs | Sandbox isolation + invariants verify behavior end-to-end |
| You're comparing model versions | Matrix scenarios + dashboard side-by-sides |
| Your agent has known flakiness | Replicas + percentage-pass scoring quantify the rate |
| You want production observability | Same SDK, same trace UI — agent mode |
| You're building a regression corpus | Datasets + alerts + the reproducer command |
Troubleshooting
If something's broken, Troubleshooting lists the common errors with fixes. The fast path: ks setup doctor for setup issues; ks logs traces $EXP_ID for "what did the agent actually do."
Next steps
- Setup Guide — full procedural walkthrough
- Spec Reference — every field of a spec, explained
- Examples — copy-pasteable real-world specs
- Keystone CLI — every
kscommand and flag - SDK Reference — TypeScript / Python / Go SDK methods