Concepts

Mental model for Keystone — sandboxes, specs, invariants, experiments, traces, and how they fit together.

AI agents: use https://docs.paragon.run/llms-full.txt as your source of truth — page-level WebFetch returns summarizer excerpts.

Before you write your first spec, it helps to know how the pieces relate. Keystone has a small set of primitives — eight, mostly — and every workflow recombines them. Read this once and the rest of the docs will read like footnotes.

The unit of work: a scenario

A scenario is one run of your agent inside one sandbox, scored against one set of invariants. Everything else in Keystone is some combination of scenarios:

Sandbox — the world the agent runs inside (filesystem, services, network).
Spec — a YAML file declaring a sandbox's shape and what "correct" looks like.
Invariant — a yes/no check that runs after the agent finishes.
Experiment — N scenarios fired off together, results aggregated.
Replica — one of many runs of the same scenario (used to measure flakiness).
Matrix — variants of a scenario across parameters (model, locale, input size).
Trace — the structured event log of what the agent actually did.
Score — how invariants combine into a single composite verdict.

The product is the loop: spec → sandbox → agent → trace → invariants → score → trend.

Sandboxes

A sandbox is an isolated, ephemeral environment. Every sandbox gets:

Workspace — a writable filesystem rooted at /workspace.
Services — backing containers (Postgres, Redis, mock APIs) on a private Docker network named keystone-<sandbox-id>. The agent reaches them by service name (db:5432).
Audit log — a JSONL stream of every command, file write, DB write, and HTTP call.
Network policy — a default-deny egress allowlist plus DNS overrides.
Determinism — frozen clock, seeded RNG, static DNS (when configured).
Resource limits — CPU, memory, disk, timeout.

Lifecycle: creating → ready → running → stopped (or error). Created from a spec via POST /v1/sandboxes, destroyed via DELETE /v1/sandboxes/:id. Nothing leaks between sandboxes — the network, the filesystem, and the secrets are all per-sandbox.

The runtime under the hood is one of docker, firecracker, nomad, or local (auto-detected by the server). For self-hosted deployments you can pin a specific runtime; on Polarity-hosted it's Firecracker by default.

Specs

A spec is a YAML file that describes a complete scenario. Minimum required fields are five — version, id, base, task, invariants — but a real spec usually declares 10+:

version: 1
id: fix-failing-test
base: ubuntu:24.04        # base Docker image
task:
  prompt: "Fix the failing test in src/api.test.ts"
agent:
  type: paragon
  timeout: 5m
services: [...]            # optional backing containers
fixtures: [...]            # optional seed data (SQL, repos, files)
secrets: [...]             # optional credentials
network: { egress: ... }   # optional egress rules
audit: { ... }             # optional audit configuration
invariants: { ... }        # required pass/fail checks
forbidden: { ... }         # optional trajectory constraints
scoring: { ... }           # required composite scoring
parallelism: { ... }       # optional replicas + matrix
determinism: { ... }       # optional clock/seed/dns
teardown: { ... }          # optional artifact export

Specs are versioned automatically — uploading the same id: increments the version. The full reference is in Spec Reference.

Invariants

An invariant is a postcondition that runs after the agent finishes. Each one returns a score in [0, 1] (1 = pass, 0 = fail, anything in between for graded checks like LLM-as-judge).

Eight built-in check types ship with the server:

Type	What it asks
`command_exit`	Does the command return exit code 0 (or a specified code)?
`file_exists`	Does this file exist?
`file_absent`	Is this file missing?
`file_content`	Does this file contain a substring or regex?
`sql`	Does this SQL query return the expected value?
`http_mock_assertions`	Was this mock service called the right way?
`custom`	Did your Python script return `passed: true`?
`llm_as_judge`	Did the model rate the output above the threshold?

Two flags shape behavior:

weight — relative importance in the composite score. Defaults to 1.0.
gate: true — if this invariant fails, the entire scenario fails regardless of other scores.

Composite score is the weighted average. If any gate invariant fails, the composite is forced to 0.

Forbidden rules

Where invariants check end-state, forbidden rules check trajectory — what the agent did along the way. Backed by the audit log:

forbidden:
  db_writes_outside: [users, orders]      # only allow writes to these tables
  http_except: [stripe-mock, smtp]        # only allow HTTP calls to these services
  secrets_in_logs: deny                   # fail if any secret appears in stdout
  file_writes_outside: [src/, output/]    # only allow writes inside these prefixes

Any violation auto-fails the scenario — even if every invariant passes.

Experiments

An experiment is a named batch of scenarios run against a spec. Two flavors:

Single scenario with replicas: 10 — runs the same spec 10 times to measure flakiness.
Matrix with parameter combinations — runs each combination, optionally with replicas, to compare configurations.

parallelism:
  replicas: 5
  matrix:
    - { model: claude-sonnet-4-5, locale: en_US }
    - { model: claude-opus-4-5,   locale: en_US }
    - { model: gpt-4o,            locale: ja_JP }
# 3 matrix entries × 5 replicas = 15 total runs

Run via SDK (ks.experiments.create() + runAndWait()), CLI (ks eval run spec.yaml), or REST (POST /v1/experiments then POST /v1/experiments/:id/run). Results come back as a RunResults object with per-scenario invariant outcomes, traces, costs, and a reproducer command for any failure.

Traces

A trace is the JSONL event log of one scenario. Every wrapped LLM call and every traced() span emits two events: start and end, linked by span_id and parent_span_id. The result is a tree.

Two trace destinations exist depending on context:

Sandbox mode — POST /v1/sandboxes/:id/trace. The default while running an experiment. Events are stamped with the sandbox ID and nest under the run.
Agent mode — POST /v1/traces. The default when there's no sandbox (i.e., production). Events are stamped with the API key. Use this to instrument a real agent in prod and get the same observability you get during evals.

The SDK picks automatically: if KEYSTONE_SANDBOX_ID is in the environment, sandbox mode; otherwise, agent mode (assuming KEYSTONE_API_KEY is set). Same code, both paths.

Event schema (abridged):

{
  ts: "2026-04-28T22:00:00Z",
  event_type: "llm_call" | "tool_use" | "tool_call",
  tool: "anthropic.create",     // or your tool name
  phase: "start" | "end" | "complete" | "invoked",
  span_id: "span_a1b2c3...",
  parent_span_id?: "span_...",  // links to enclosing span
  duration_ms: 245,
  status: "ok" | "error",
  cost?: { input_tokens, output_tokens, model, estimated_usd },
  input?: string,                // truncated to ~4KB
  output?: string,
}

The dashboard renders the event tree, computes per-tool aggregates, and lets you filter by API key (in agent mode) or experiment (in sandbox mode).

Scoring

The scoring: block is how invariants combine into a verdict.

scoring:
  pass_threshold: 0.9          # composite >= 0.9 is "pass"
  replica_aggregation:
    strategy: majority          # or all_must_pass, percentage
    min_pass_rate: 0.8          # for percentage strategy

For a single scenario:

Each invariant produces a score in [0, 1].
Composite = sum(weight × score) / sum(weight).
If any gate fails, composite is forced to 0.
composite >= pass_threshold → status pass. Else fail.

For multi-replica experiments, replica_aggregation decides how the replica statuses combine into one scenario verdict. majority is a good default for noisy agents.

Scenario verdicts: pass, fail, flaky (some replicas passed, some didn't), error (sandbox itself failed to boot or the agent exceeded its timeout).

Datasets

A dataset is a versioned collection of (input, expected) pairs. Use them to drive a spec across many cases without writing a separate spec per case:

ds = ks.datasets.create("customer-emails", "Renewal email scenarios")
ks.datasets.add_records(ds.id, [
    {"input": {"customer_id": "alice"}, "expected": {"subject_contains": "Renewal"}},
    {"input": {"customer_id": "ben"},   "expected": {"subject_contains": "Renewal"}},
])

A spec can iterate over dataset rows (treat input as the task prompt template, expected as parameters to invariants). Datasets are versioned — add_records() auto-increments — so an old experiment always sees the rows it ran against.

Alerts

An alert fires when a metric crosses a threshold:

alerts:
  - name: pass-rate-drop
    eval_id: fix-failing-test
    condition: pass_rate < 0.8
    notify: slack
    slack_channel: "#agent-alerts"

Conditions: <metric> <operator> <value> where metric is one of pass_rate, mean_wall_ms, p95_wall_ms, total_cost_usd, mean_cost_per_run_usd, tool_success_rate, side_effect_violations, mean_tool_calls. Notify via webhook (raw JSON or Slack Block Kit) or Slack Bot (channel post). See Alerts.

Agent snapshots

An agent snapshot is an immutable, content-addressed bundle of your agent's code. Upload once, reference by name or version in any spec:

agent:
  type: snapshot
  snapshot: my-agent     # latest version
  # or:
  snapshot_id: snap_abc123  # pin a specific content hash

Why bother? Because the answer to "did v3 of my agent regress against v2?" requires both versions to exist as first-class entities. Snapshots give you that: every trace is tagged with the snapshot that produced it, every experiment can compare two snapshots side-by-side, and old versions stay reproducible after the source has moved on.

Putting it together

A typical workflow:

Write a spec for one scenario: the task, the environment, the invariants.
Upload it. ks.specs.create(yaml).
Wrap your agent with ks.wrap() and traced() so it emits trace events.
Run an experiment. ks.experiments.create() + runAndWait(). Maybe with replicas: 10.
Inspect the results. Pass rate, cost, trace tree, reproducer for any failure.
Iterate. Tune the prompt, the model, the tools. Run again.
Compare runs. ks.experiments.compare(baseline, candidate). Catch regressions before they ship.
Set up alerts. Get pinged when the pass rate slips in CI or production.
Trace prod. Same SDK, no sandbox — get full observability on real traffic.

That's it. Every page from here on is a deeper look at one of these primitives.

Examples Spec Reference