Keystone
QA platform for AI agents. Run your agent in isolated sandboxes, score results, catch regressions. AI agents reading this page should first fetch https://docs.paragon.run/llms-full.txt for the complete unsummarized docs.
Reading this page as an AI agent (Claude Code, Cursor, Codex, etc.)? Stop here, fetch https://docs.paragon.run/llms-full.txt, and use that as your source of truth instead of individual pages. Page-level
WebFetchreturns a summarizer-model excerpt that drops detail (SDKwrap()+traced()wrapping, spec YAML field shapes, secrets handling).llms-full.txtis the raw concatenated MDX of every doc page.
AI agents are non-deterministic. The same prompt can produce different results each time. A test that passes once might fail the next run. Manual spot-checking doesn't scale -- you need to run your agent many times in identical environments and measure how reliably it works.
Keystone does this. You write a spec that describes the environment and defines what "correct" looks like, then Keystone spins up sandboxes, runs your agent, and scores the results. Run it 10 times or 100 times. Track scores across versions. Get alerted when something regresses.


Keystone is currently in beta and available to enterprise customers. Contact us to get access.
How it works
Write a spec
A YAML file that describes the environment (OS, packages, databases), the task (what the agent should do), and the checks (how to know if it worked).
Upload it
Push the spec to Keystone via the SDK. Keystone validates it and stores it.
Run an experiment
Keystone spins up isolated sandboxes with Docker containers, runs your agent inside them, and evaluates the results against your checks.
See your results
Pass rate, cost, latency, and which checks passed or failed. Compare runs. Get Slack alerts on regressions.
Prerequisites
You need two things:
-
A Keystone API key. Go directly to app.paragon.run/app/keystone/settings (or in the Keystone sidebar click Settings), switch to the API Keys tab, and click Create Key. Keystone keys are prefixed
ks_live_and are shown once at creation -- copy it immediately, you can't retrieve it later. Legacy Polarity UUID keys (the one used for Paragon CLI and paragon-llm-proxy) are still accepted for backward compatibility, but new integrations should use a dedicatedks_live_key. -
Node.js, Python, or Go installed locally to run the SDK.
You do not need Docker installed on your machine. Keystone runs sandboxes on Polarity's servers. Your code just talks to the API.
Quick start
Five-ish minutes from nothing to a running experiment. Steps 1-3 get your agent emitting traces; step 4 is where you decide what "pass" means for your task; steps 5-6 run it and show you the output.
1. Install the SDK
npm install @polarityinc/polarity-keystone2. Set your API key
Get your key from app.paragon.run/app/keystone/settings → API Keys tab → Create Key. Keys start with ks_live_ and are shown once at creation -- copy it immediately.
export KEYSTONE_API_KEY=ks_live_...3. Wrap your existing agent's LLM + tool calls
You already have an agent -- this step instruments it in place so every model call and every tool call shows up as a trace event you can inspect and score against. Two changes in your existing agent code:
(a) Wrap the LLM client where it's instantiated. Find the line that creates your Anthropic / OpenAI / etc. client and pass it through ks.wrap(). The SDK handles routing traces to the right sandbox automatically — you don't need to set or pass any sandbox-related config yourself. Outside a sandbox (local dev, CI, anywhere your agent runs for real work), tracing is a silent no-op.
import { Keystone } from '@polarityinc/polarity-keystone';
import Anthropic from '@anthropic-ai/sdk';
const ks = new Keystone();
ks.initTracing(); // no-op outside a sandbox
// BEFORE: const anthropic = new Anthropic();
// AFTER:
const anthropic = ks.wrap(new Anthropic());
// Every anthropic.messages.create() now auto-reports prompt, tokens,
// latency, and tool-call arguments. The response is unchanged.(b) Wrap each tool function with traced(). Find the functions your agent calls as tools (read/write file, shell, HTTP, DB queries, anything) and wrap their bodies. Use the tool name as the span name so it shows up correctly in the trace tree.
import { traced } from '@polarityinc/polarity-keystone';
// BEFORE:
async function readFile(path: string) {
return await fs.readFile(path, 'utf-8');
}
// AFTER:
async function readFile(path: string) {
return await traced('read_file', async () => {
return await fs.readFile(path, 'utf-8');
});
}Repeat for every tool the agent calls. Nested traced() calls automatically build a parent-child span tree so you can see exactly which LLM turn triggered which tool calls.
You can pull the trace back at any point to verify it's flowing:
const trace = await ks.sandboxes.getTrace(process.env.KEYSTONE_SANDBOX_ID!);
console.log(`${trace.events.length} events, $${trace.metrics.total_cost_usd} cost`);ks.wrap() works for Anthropic, OpenAI, and any OpenAI-compatible provider (Groq, xAI, Together) -- see the SDK reference. Wrapping never changes the response; if tracing fails it fails silently so your agent keeps running.
4. Tell Keystone what to evaluate
Now that your agent reports traces, decide what a pass looks like. Every Keystone spec is an answer to five questions -- if you're working with an AI coding assistant, the assistant should ask you these directly rather than inventing a canned example.
- What task should the agent do? One sentence the agent receives as a prompt. ("Fix the failing test in
src/api.test.ts.", "Turn this PRD into a working Stripe integration.", "Refactorusers.pyto use async SQLAlchemy.") - What environment does it need? Base image (ubuntu, node, python), packages, repos to clone, and any backing services (Postgres, Redis, a mock API).
- How do you know it worked? The invariants -- concrete yes/no checks. File contents, test pass/fail, HTTP responses, SQL queries. At least one should be a
gate(a hard fail if it doesn't pass). - What's off-limits? Forbidden filesystem paths, network hosts, or behaviors that should auto-fail the run.
- What API keys does your agent need at runtime? Your agent runs inside the sandbox, so it needs its own LLM provider key (e.g.
ANTHROPIC_API_KEY,OPENAI_API_KEY) plus any third-party creds (Stripe, GitHub, etc.). Without this the agent can't make model calls and every run will fail.
Write those answers into a YAML spec. Here's the minimal shape -- fill in the bracketed sections, save as spec.yaml:
version: 1
id: "<short-kebab-id>"
description: "<one-line summary of what this verifies>"
base: "<ubuntu:24.04 | node:20 | python:3.12 | your-image>"
# repos: [{ url: ..., path: ... }] # optional: clone a repo into the sandbox
# services: { db: { image: postgres:16 } } # optional: backing services
task:
prompt: |
<the instruction the agent receives>
agent:
type: paragon # or: cli | image | http | python | snapshot (see "Swap in your own agent" below)
timeout: 5m
# Your agent runs inside the sandbox -- give it the keys it needs to call LLMs
# and any third-party APIs. Two ways to supply these:
# (a) Store once at app.paragon.run -> Keystone -> Settings -> Secrets.
# Dashboard-stored secrets auto-inject into every sandbox as env vars.
# (b) Declare inline here -- use `static://...` for fixed values, `generated`
# for per-run random values. See /keystone/specs#secrets for details.
secrets:
- name: ANTHROPIC_API_KEY
from: "static://<your-anthropic-key>" # or omit this whole block if stored in dashboard
scope: env
invariants:
<invariant_name>:
description: "<what this check proves>"
weight: 1.0
gate: true # mark at least one as a hard gate
check:
type: <file_exists | file_content | shell_exit | http | sql | ...>
# ...type-specific fields
scoring:
pass_threshold: 1.0See the spec reference for every field and every invariant check.type. For real scenarios (fix-a-failing-test, migrate-a-schema, build-a-feature-from-a-PRD), see examples.
5. Upload and run
Create a file called run.ts:
import { Keystone } from '@polarityinc/polarity-keystone';
import { readFileSync } from 'fs';
const ks = new Keystone();
// Upload the spec
await ks.specs.create(readFileSync('spec.yaml', 'utf-8'));
// Create and run the experiment
const exp = await ks.experiments.create({
name: '<experiment-name>',
spec_id: '<your-spec-id>', // the `id:` field from spec.yaml
});
console.log('Running experiment...');
const results = await ks.experiments.runAndWait(exp.id);
console.log(`Status: ${results.passed > 0 ? 'PASSED' : 'FAILED'}`);
console.log(`Pass rate: ${results.metrics.pass_rate}`);
console.log(`Duration: ${results.metrics.mean_wall_ms}ms`);Run it:
npx tsx run.tsThe experiment typically takes 10-30 seconds. Keystone creates a sandbox on the server, runs the shell command, checks the invariants, and returns the results.
6. What you'll see
When the experiment completes, you get back a RunResults object. Here's what a passing run looks like:
{
"experiment_id": "exp-a1b2c3",
"total_scenarios": 1,
"passed": 1,
"failed": 0,
"metrics": {
"pass_rate": 1.0,
"mean_wall_ms": 12000,
"total_cost_usd": 0.0
},
"scenarios": [
{
"status": "pass",
"composite_score": 1.0,
"invariants": [
{ "name": "<invariant-1>", "passed": true, "gate": true, "weight": 1.0 },
{ "name": "<invariant-2>", "passed": true, "weight": 1.0 }
]
}
]
}And here's what a failure looks like -- the message field tells you exactly what went wrong:
{
"passed": 0,
"failed": 1,
"metrics": { "pass_rate": 0.0 },
"scenarios": [
{
"status": "fail",
"composite_score": 0.0,
"invariants": [
{
"name": "<invariant-1>",
"passed": false,
"gate": true,
"message": "<reason the check failed, e.g. file X does not exist>"
}
],
"reproducer": {
"seed": 12345,
"command": "keystone run --spec <your-spec-id> --seed 12345 --scenario scenario-000"
}
}
]
}The reproducer gives you the exact command to re-run that specific scenario with the same seed for debugging.
Swap in your own agent
The spec template uses agent: type: paragon (Polarity's built-in agent) by default. To run your own agent, replace that block with one of these:
# You have a CLI binary on the server:
agent:
type: cli
binary: /path/to/your-agent
args: ["--task", "{{ task.prompt }}"]
timeout: 5m
# You have a Docker image in a registry:
agent:
type: image
image: "your-registry/your-agent:latest"
timeout: 5m
# Your agent is an HTTP API:
agent:
type: http
endpoint: "https://your-api.com/agent/run"
timeout: 5m
# You have a Python script:
agent:
type: python
binary: agent.py
timeout: 5m
# You want to version your agent as an immutable snapshot:
agent:
type: snapshot
snapshot: my-agent # uploaded via ks.agents.upload()
timeout: 5mWhich one should I use?
cli-- your agent is a compiled binary or shell script that runs locallyimage-- your agent is packaged as a Docker image (best for reproducibility)http-- your agent is a hosted API (the sandbox POSTs the task to your endpoint)python-- your agent is a Python script in the sandboxsnapshot-- your agent is uploaded to Keystone and versioned (best for tracking which agent version produced which results)paragon-- use Polarity's built-in Paragon agent (no setup needed)
See the spec reference for full details.
Key concepts
Before diving into the spec reference and SDK reference, here are the core concepts:
Specs are YAML files that describe everything about a test scenario: the environment, the task, and the pass/fail criteria. You upload them once and run experiments against them repeatedly.
Sandboxes are isolated environments where your agent runs. Each sandbox gets its own filesystem, Docker containers for backing services (Postgres, Redis, etc.), and a clean state. Nothing leaks between runs.
Invariants are the checks that run after your agent finishes. Each one answers a yes/no question: "Did the tests pass?" "Does the output file contain the right data?" "Did the agent call the API correctly?" Gate invariants cause an immediate fail if they don't pass.
Forbidden rules define what the agent must NOT do. If it writes to a file outside the allowed list, makes HTTP calls to unauthorized hosts, or leaks secrets in stdout, the run fails regardless of the invariant scores.
Experiments run your spec one or more times and aggregate the results. Run 10 replicas to measure consistency. Use a matrix to test different parameters. Compare experiments to catch regressions.
Alerts notify you via Slack or webhook when metrics cross a threshold. Set pass_rate < 0.8 and get a message when your agent starts failing more often.
Troubleshooting
Experiment fails with "spec not found"
You need to upload the spec before creating an experiment. Run ks.specs.create(yaml) first, then ks.experiments.create().
Invariant says "file does not exist" but the agent should have created it
The invariant runs in the sandbox workspace directory. Make sure your agent writes files to the current working directory, not an absolute path. The workspace is the root -- if your agent writes to /tmp/hello.txt, the invariant looking for hello.txt won't find it.
Agent times out
The default agent timeout is 5 minutes. For long-running tasks, increase it in the spec:
agent:
timeout: 15mThe sandbox-level timeout (under resources.timeout) must also be large enough to cover setup + agent execution + scoring.
"sandbox rejected: at capacity"
The server has a limit on concurrent sandboxes. Wait for current experiments to finish or contact support to increase your limit.
Next steps
- Spec reference -- every section of a spec file explained with examples
- SDK reference -- all SDK methods across TypeScript, Python, and Go
- Examples -- full real-world specs and how to build agents that run inside Keystone