Get started

Keystone

QA platform for AI agents — isolated sandboxes, structured invariants, regression alerts. Run your agent N times, score every run, catch regressions before they ship.

Reading this page as an AI agent (Claude Code, Cursor, Codex)? Stop here, fetch https://docs.paragon.run/llms-full.txt, and use that as your source of truth. Page-level WebFetch returns summarizer excerpts that drop SDK details, spec field shapes, and secrets handling.

AI agents are non-deterministic. The same prompt can produce different results each time. A test that passes once might fail the next run. Manual spot-checking doesn't scale — you need to run your agent many times in identical environments and measure how reliably it works.

Keystone does this. You write a spec that describes the environment and defines what "correct" looks like, then Keystone spins up sandboxes, runs your agent, and scores the results. Run it 10 times or 100 times. Track scores across versions. Get alerted when something regresses.

Keystone dashboard

Keystone is in public beta. Free and Pro tiers are self-serve (create a Keystone API key at app.paragon.run/app/keystone/settings and attach a card). For Enterprise terms — BYO cloud, SSO/SCIM, 2-year trace retention — contact us.

How it works

Write a spec

A YAML file that describes the environment (OS, packages, databases, mocked APIs), the task (what the agent should do), the secrets it needs, the network policy it operates under, and the invariants (concrete pass/fail checks).

Upload it

Push the spec to Keystone via the SDK or CLI. Keystone validates it and stores it. Re-uploading the same id: increments the version automatically.

Run an experiment

Keystone spins up isolated sandboxes (Firecracker VMs or Docker containers), starts backing services on a private network, applies fixtures, runs your agent inside, and evaluates the results against your invariants.

See your results

Pass rate, cost per run, latency p95, which checks passed or failed, full trace tree of every LLM call and tool call. Compare runs. Get Slack alerts on regressions. Re-run any failure with a deterministic reproducer command.

What's in the box

CapabilityPage
Concepts overview — sandboxes, specs, experiments, tracesConcepts
Step-by-step setup playbookSetup Guide
Real-world spec examplesExamples
Every spec field documentedSpec Reference
The CLI, all commands and flagsKeystone CLI
Sandboxes, services, fixtures, secretsSandboxes, Services, Fixtures, Secrets
Network policy & audit loggingNetwork & Audit
Invariants & forbidden rulesInvariants, Forbidden Rules
28 built-in scorersScorers Library, LLM-as-Judge
Experiments — replicas, matrix, comparisonExperiments
Versioned agent snapshotsAgent Snapshots
Datasets — drive specs across many casesDatasets
Slack/webhook alertsAlerts
LLM tracing + custom spansLLM Tracing, Custom Spans
Production observability modeProduction Mode
OpenTelemetry bridgeOpenTelemetry, Auto-Instrument
SDK methods (TS / Python / Go / REST)SDK Reference
When something breaksTroubleshooting
Pricing, plans, invoicesBilling & Usage

Prerequisites

You need two things:

  1. A Keystone API key. Go to app.paragon.run/app/keystone/settingsAPI KeysCreate Key. Keys are prefixed ks_live_ and shown once at creation — copy it immediately. Legacy Polarity UUID keys (the one used for Paragon CLI and paragon-llm-proxy) are still accepted, but new integrations should use a dedicated ks_live_ key.

  2. Node.js, Python, or Go installed locally. Pick whichever language matches your project.

You do not need Docker locally. Keystone runs sandboxes on Polarity's infrastructure.

Quick start

The fastest way is the CLI wizard — three commands and you're running:

# 1. Install the CLI
curl -fsSL https://ks.polarity.so/install.sh | bash
 
# 2. Drop your API key in .env (auto-loaded from cwd)
echo 'KEYSTONE_API_KEY=ks_live_...' >> .env
 
# 3. Run the wizard (writes coding-agent skill files, registers MCP, drops a starter spec, installs the SDK, runs doctor)
cd ~/your-project && ks setup

ks setup runs seven phases end-to-end (skill files → MCP server → starter spec → LLM-client scan → SDK install → snapshot detection → doctor) and is fully idempotent — re-run it any time.

When it finishes, you'll have:

  • A keystone/example.yaml starter spec ready to run.
  • Your coding agent (Claude Code, Cursor, etc.) configured with both a Keystone skill file and the MCP server registered, so it can run evals on your behalf.
  • The Keystone SDK installed for every language detected in your repo.
  • A green doctor check confirming the API key works and the server's reachable.

Run your first eval:

ks eval run keystone/example.yaml

Expected: a passing scenario in 10–30 seconds. You're done.

The full step-by-step playbook (with the per-phase deep dive) is at Setup Guide. The condensed manual version is below.

Manual install (if you skip ks setup)

If you want to do it without the wizard:

npm install @polarityinc/polarity-keystone
# or: bun add / pnpm add / yarn add @polarityinc/polarity-keystone

For the CLI:

curl -fsSL https://ks.polarity.so/install.sh | bash
# Or download a release binary: https://github.com/Polarityinc/ks/releases/latest

2. Set your API key

export KEYSTONE_API_KEY=ks_live_...
# Or put it in .env at the project root

Verify with ks setup doctor.

3. Wrap your existing agent's LLM + tool calls

This step instruments your existing agent so every model call and every tool call shows up as a trace event. Two changes:

(a) Wrap the LLM client where it's instantiated. Pass it through ks.wrap():

import { Keystone } from "@polarityinc/polarity-keystone";
import Anthropic from "@anthropic-ai/sdk";
 
const ks = new Keystone();
ks.initTracing();                              // no-op outside a sandbox
 
// BEFORE: const anthropic = new Anthropic();
// AFTER:
const anthropic = ks.wrap(new Anthropic());
// Every anthropic.messages.create() now auto-reports prompt, tokens,
// latency, and tool-call arguments. The response is unchanged.

(b) Wrap each tool function with traced().

import { traced } from "@polarityinc/polarity-keystone";
 
const readFile = traced(async (path: string) => {
  return await fs.readFile(path, "utf-8");
}, { name: "read_file" });

Repeat for every tool. Nested traced() calls automatically build a parent-child span tree.

ks.wrap() works for Anthropic, OpenAI, and any OpenAI-compatible provider (Groq, xAI, Together, OpenRouter, Fireworks) — see LLM Tracing. Wrapping never changes the response; if tracing fails it fails silently so your agent keeps running.

4. Tell Keystone what to evaluate

Now that your agent reports traces, decide what a pass looks like. Every Keystone spec is an answer to five questions — if you're working with an AI coding assistant, the assistant should ask you these directly rather than inventing a canned example.

  1. What task should the agent do? One sentence the agent receives as a prompt.
  2. What environment does it need? Base image (ubuntu, node, python), packages, repos to clone, and any backing services (Postgres, Redis, a mock API).
  3. How do you know it worked? The invariants — concrete yes/no checks. File contents, test pass/fail, HTTP responses, SQL queries. At least one should be a gate (a hard fail if it doesn't pass).
  4. What's off-limits? Forbidden filesystem paths, network hosts, or behaviors that should auto-fail the run.
  5. What API keys does your agent need at runtime? Your agent runs inside the sandbox, so it needs its own LLM provider key (e.g. ANTHROPIC_API_KEY, OPENAI_API_KEY) plus any third-party creds (Stripe, GitHub, etc.). Without this the agent can't make model calls and every run will fail.

Write those answers into a YAML spec. Here's the minimal shape — fill in the bracketed sections, save as spec.yaml:

version: 1
id: "<short-kebab-id>"
description: "<one-line summary of what this verifies>"
 
base: "<ubuntu:24.04 | node:20 | python:3.12 | your-image>"
 
task:
  prompt: |
    <the instruction the agent receives>
 
agent:
  type: paragon         # or: cli | image | http | python | snapshot
  timeout: 5m
 
# Your agent runs inside the sandbox — give it the keys it needs to call LLMs
# and any third-party APIs.
secrets:
  - name: ANTHROPIC_API_KEY
    source: env                   # or: dashboard, file:..., command:..., or from: static://...
 
invariants:
  <invariant_name>:
    description: "<what this check proves>"
    weight: 1.0
    gate: true
    check:
      type: <file_exists | file_content | command_exit | sql | http_mock_assertions | llm_as_judge | custom>
 
scoring:
  pass_threshold: 1.0

See the spec reference for every field, and examples for full real-world specs.

5. Upload and run

ks eval run spec.yaml

Or in code:

import { Keystone } from "@polarityinc/polarity-keystone";
import { readFileSync } from "fs";
 
const ks = new Keystone();
await ks.specs.create(readFileSync("spec.yaml", "utf-8"));
 
const exp = await ks.experiments.create({
  name: "baseline-v1",
  spec_id: "<your-spec-id>",
  specPath: "./spec.yaml",     // auto-forwards declared secrets from your env
});
 
const results = await ks.experiments.runAndWait(exp.id);
console.log(`Pass rate: ${results.metrics.pass_rate}`);

The experiment typically takes 10–30 seconds for a single scenario. Keystone creates a sandbox, runs the agent, checks the invariants, and returns the results.

6. What you'll see

A passing run:

{
  "experiment_id": "exp-a1b2c3",
  "total_scenarios": 1,
  "passed": 1,
  "failed": 0,
  "metrics": { "pass_rate": 1.0, "mean_wall_ms": 12000 },
  "scenarios": [
    {
      "status": "pass",
      "composite_score": 1.0,
      "invariants": [
        { "name": "<invariant-1>", "passed": true, "gate": true, "weight": 1.0 }
      ]
    }
  ]
}

A failing run includes a reproducer command:

{
  "passed": 0,
  "failed": 1,
  "scenarios": [
    {
      "status": "fail",
      "invariants": [
        {
          "name": "<invariant-1>",
          "passed": false,
          "gate": true,
          "message": "<reason the check failed>"
        }
      ],
      "reproducer": {
        "seed": 12345,
        "command": "ks eval run spec.yaml --seed 12345 --scenario scenario-000"
      }
    }
  ]
}

The reproducer gives you the exact command to re-run that scenario with the same seed for debugging.

Swap in your own agent

The spec template uses agent: type: paragon (Polarity's built-in agent) by default. To run your own agent, replace that block with one of these:

# Compiled binary on the server:
agent:
  type: cli
  binary: /path/to/your-agent
  args: ["--task", "{{ task.prompt }}"]
  timeout: 5m
 
# Docker image in a registry:
agent:
  type: image
  image: "your-registry/your-agent:latest"
  timeout: 5m
 
# HTTP endpoint:
agent:
  type: http
  endpoint: "https://your-api.com/agent/run"
  auth: { bearer: "{{ secrets.AGENT_TOKEN }}" }
  timeout: 5m
 
# Python script:
agent:
  type: python
  binary: agent.py
  timeout: 5m
 
# Versioned snapshot (best for tracking which agent version produced which results):
agent:
  type: snapshot
  snapshot: my-agent          # uploaded via ks.agents.upload()
  timeout: 5m

Which one?

  • cli — your agent is a binary or shell script.
  • image — your agent is packaged as a Docker image (best for reproducibility).
  • http — your agent is a hosted API.
  • python — your agent is a Python script.
  • snapshot — best for versioning; lets you compare v2 vs. v3.
  • paragon — use Polarity's built-in agent (no setup).

See Agent Snapshots for the snapshot upload workflow, and the Spec Reference for full per-type details.

Key concepts (quick glossary)

Specs are YAML files that describe a complete test scenario: environment, task, pass/fail criteria. You upload them once and run experiments against them repeatedly.

Sandboxes are isolated environments where your agent runs. Each one gets its own filesystem, Docker containers for backing services, and clean state. Nothing leaks between runs.

Invariants are the checks that run after your agent finishes. Each answers a yes/no question. Gate invariants cause an immediate fail if they don't pass.

Forbidden rules define what the agent must NOT do — write to forbidden tables, hit unauthorized HTTP hosts, leak secrets in logs. Backed by the audit log.

Experiments run your spec one or more times and aggregate results. Use replicas: 10 to measure consistency. Use matrix: to compare different parameters. Compare two experiments to catch regressions.

Alerts notify you via Slack or webhook when metrics cross a threshold.

For a deeper introduction, read Concepts.

When to use Keystone

Use caseWhy Keystone
You ship an LLM-powered featureCatch regressions in CI before they reach users
Your agent calls tools / writes files / hits APIsSandbox isolation + invariants verify behavior end-to-end
You're comparing model versionsMatrix scenarios + dashboard side-by-sides
Your agent has known flakinessReplicas + percentage-pass scoring quantify the rate
You want production observabilitySame SDK, same trace UI — agent mode
You're building a regression corpusDatasets + alerts + the reproducer command

Troubleshooting

If something's broken, Troubleshooting lists the common errors with fixes. The fast path: ks setup doctor for setup issues; ks logs traces $EXP_ID for "what did the agent actually do."

Next steps