Spec Reference
Every section of a Keystone spec file, explained with examples. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.
AI agents: use https://docs.paragon.run/llms-full.txt as your source of truth — page-level
WebFetchreturns summarizer excerpts that drop spec field shapes and invariant details.
A spec is a YAML file that describes a complete test scenario. It tells Keystone what environment to create, what the agent should do inside it, and how to determine whether the agent succeeded.
A minimal spec needs just 5 fields: version, id, base, task, and invariants. Everything else is optional and adds capabilities as you need them.
Sections overview
Required for every spec:
| Section | What it does |
|---|---|
task | The prompt and context for what the agent should do |
base | Base Docker image for the sandbox (e.g. ubuntu:24.04) |
agent | How the agent connects and runs |
invariants | Pass/fail checks that run after the agent finishes |
scoring | How check results are combined into a final score |
Optional -- add when you need them:
| Section | What it does | When to use |
|---|---|---|
setup | Packages to install, commands to run, files to create | When the agent needs dependencies pre-installed |
resources | Limits: timeout, memory, CPU, disk | When defaults (2 CPU, 2GB, 10min) aren't enough |
fixtures | Seed data: clone a repo, load SQL, copy files | When the agent needs a codebase or database to work with |
services | Databases, caches, mock APIs alongside the sandbox | When the agent interacts with external services |
secrets | Test credentials -- generated fresh or from a vault | When the agent needs API keys or passwords |
network | Control outbound access, redirect APIs to mocks | When you want to block real API calls during testing |
audit | Record DB writes, HTTP calls, file changes | When you need forbidden checks or detailed logs |
forbidden | Things the agent must NOT do | When you want to enforce boundaries on agent behavior |
parallelism | Run multiple times with different parameters | When you want to measure consistency or compare configs |
determinism | Pin the clock, seed randomness, freeze DNS | When you need reproducible runs |
teardown | Export audit logs and database dumps | When you want to keep artifacts after each run |
task
The prompt that gets sent to the agent. This is the only required field.
task:
prompt: |
Fix the failing test in src/utils.test.ts so that all tests pass.
Do not modify the source code in src/utils.ts.
context:
repo: "acme/backend"
language: "typescript"The context field is optional key-value metadata passed alongside the prompt. Useful for giving the agent additional structured information.
base
The Docker image used as the sandbox's operating system. Required.
base: "ubuntu:24.04"Any public Docker image works. For custom images, make sure the keystone server can pull from your registry.
setup
Runs before the agent starts. Use this to install dependencies, create config files, and set environment variables.
setup:
packages: [nodejs, npm, python3, git]
commands:
- "npm install"
- "python3 -m pip install pytest"
files:
- path: ".env"
content: |
DATABASE_URL=postgres://postgres:test@db:5432/testdb
NODE_ENV=test
env:
CI: "true"Setup commands run in the sandbox workspace directory. If any command fails, sandbox creation fails and the experiment reports an error.
resources
Resource limits for the sandbox. If not specified, defaults to 2 CPUs, 2GB memory, and a 10-minute timeout.
resources:
timeout: 10m
memory: 4Gi
cpu: 4
disk: 20GiThe timeout controls how long the entire sandbox lifecycle runs (including setup, agent execution, and scoring). If the agent exceeds this, it gets killed and the scenario is marked as an error.
fixtures
Seed data that gets loaded into the sandbox before the agent starts. Fixtures run after services boot but before the snapshot is extracted into /workspace, so anything that touches the filesystem needs to either inline its data or ship it alongside the spec.
fixtures:
# Load SQL inline — preferred for test seeds. Self-contained in the spec.
- type: sql
service: db
sql: |
CREATE TABLE customers (
email TEXT PRIMARY KEY,
name TEXT NOT NULL
);
INSERT INTO customers VALUES
('alice@example.com', 'Alice'),
('ben@example.com', 'Ben');
# Clone a git repo into the workspace
- type: git_repo
url: "https://github.com/your-org/your-repo"
branch: "main"
depth: 1
# Copy a directory into the sandbox
- type: directory
source: /path/to/test-data
target: data/
# Inject random data drift for adversarial testing
- type: drift
target: db.customers
strategy: random_mismatches
count: 15
seed: "42"Fixture types:
| Type | What it does |
|---|---|
sql | Runs SQL against a database service. Provide either sql: (inline script, preferred) or path: (file in workspace). Requires service. Credentials come from the service's declared POSTGRES_USER / POSTGRES_PASSWORD / POSTGRES_DB env (defaults to postgres / test / testdb). |
git_repo | Clones a repository into the sandbox. Supports branch, depth, and path. |
directory | Copies files from source to target inside the sandbox. Both paths are relative to the sandbox workspace directory. |
drift | Injects random data modifications for adversarial testing. Uses strategy, count, and seed. |
Prefer inline
sql:overpath:. Fixtures run before the snapshot is extracted into/workspace, so apath:pointing at a file bundled in your snapshot will fail with ENOENT. Use inline SQL for seed data under ~1000 rows. For larger datasets, restore from an external source insidesetup.commands(which runs after setup file writes) or via the agent itself at start-up.
Service credentials are derived from the service spec's
env:block. If you setPOSTGRES_DB: crmon the service, fixtures run againstcrm— not the oldtestdbdefault. Same forPOSTGRES_USER/POSTGRES_PASSWORD. Match your fixture SQL'sCREATE TABLEto the database you declared.
services
Services are backing containers your agent talks to during a run — Postgres, Redis, Stripe-mock, anything. You declare them in the spec; Keystone pulls the image and starts the container on a shared Docker network. Your agent reaches each by its name (DNS alias), so the same connection strings work across every run.
Any public or private Docker image works
services:
- name: db
image: postgres:16 # Docker Hub
env:
POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
POSTGRES_DB: northwind
ports: [5432]
wait_for: "pg_isready -U postgres"
- name: cache
image: redis:7 # Docker Hub
ports: [6379]
- name: stripe_mock
image: stripe/stripe-mock:latest # third-party public image
env:
STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}"
ports: [12111]
- name: vector
image: ghcr.io/qdrant/qdrant:v1.7.4 # any registry — GHCR, ECR, GAR
ports: [6333]No Dockerfile. No docker push. Keystone pulls on demand (cached after first run) and wires the container into the sandbox network with --network keystone-<sandbox-id> --network-alias <service-name>.
Inside the agent, reach services by name
Once the sandbox is up, each service's name resolves over DNS on the shared network:
// Agent code — no IP lookup, no port mapping, no env configuration needed
const db = new Client({ connectionString: "postgres://postgres:secret@db:5432/northwind" });
const cache = new Redis("redis://cache:6379");
const stripe = new Stripe(process.env.STRIPE_SECRET, { host: "stripe_mock", port: 12111 });Keystone also injects KEYSTONE_SERVICE_<NAME>_HOST and KEYSTONE_SERVICE_<NAME>_PORT env vars automatically, so agents that prefer discovery over hardcoding can use process.env.KEYSTONE_SERVICE_DB_HOST (= "db").
Service env with secret interpolation
Service env: values support {{ secrets.NAME }} substitution — the value comes from whatever source the secrets: block declares (local .env, file, command, Dashboard, etc. — see secrets). Spec-literal strings work too.
secrets:
- name: DB_PASSWORD
source: env # from $DB_PASSWORD on the caller's machine
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}" # substituted at boot
POSTGRES_DB: northwindThis means you never bake real credentials into the spec. Rotate a key? Change your .env or Dashboard value — specs don't need to change.
Full field reference
| Field | Type | Required | What it does |
|---|---|---|---|
name | string | yes | DNS alias on the shared network (reachable as <name>:<port> from the agent) |
image | string | yes (unless type: http_mock) | Any Docker image — Docker Hub, GHCR, ECR, GAR, a private registry |
env | map | no | Container env. Supports {{ secrets.X }} interpolation |
ports | list of ints | no | Container ports to expose on the network (internal only — not published to the host) |
wait_for | string | no | Shell command run inside the container every second, up to 60s, to gate readiness |
type: http_mock | — | no | Switches to Keystone's built-in Go HTTP mock (no image needed). See below. |
Built-in HTTP mocks
When you don't want to run a real image and just need a scripted HTTP responder:
services:
- name: payment-api
type: http_mock
ports: [9090]
default_response: 200
record: true # capture every request for later assertion
routes:
- method: POST
path: /v1/charge
response: '{"status":"ok","charge_id":"ch_test_123"}'
status: 200
- method: GET
path: /v1/balance
response: '{"balance":10000}'With record: true, invariants can later assert on what the agent sent:
invariants:
charged_once:
check:
type: http_mock_assertions
service: payment-api
assertions:
- field: request_count
filters: { path: "/v1/charge" }
equals: 1Networking rules
- Every sandbox gets a dedicated Docker network
keystone-<sandbox-id>. Services and the agent container all join it. - Service containers are not reachable from the host — only from other containers in the same sandbox. No port conflicts across parallel sandboxes.
- Outbound traffic from services and the agent is governed by the top-level
network.egressallowlist. If your agent needs to callapi.stripe.comfrom within the sandbox, allow it there; by default all egress is denied.
Reuse the same image across scenarios. Keystone caches image pulls on the host — if 10 scenarios all use
postgres:16, the image is pulled once and started 10 times. Don't worry about boot time for widely-used images.
Do not publish service ports with
-p host:containersyntax. Keystone manages ports internally. Declaringports: [5432]exposes the port on the sandbox network only. This is intentional — it keeps parallel sandboxes from colliding on host ports.
secrets
The spec is the single source of truth for what secrets your agent needs and where each value comes from. You declare the name and (optionally) the source: — the SDK resolves, forwards in the create request, and the Keystone server injects into the sandbox container env.
secrets:
- name: XAI_API_KEY
source: env # read from $XAI_API_KEY on the caller's machine
- name: DB_PASSWORD
source: env:MY_DB_PASS # rename — read $MY_DB_PASS, inject as DB_PASSWORD
- name: OPENAI_API_KEY
source: "file:~/.openai/key" # read + trim file contents
- name: INTERNAL_TOKEN
source: command:op read "op://Dev/Keystone/token"
# exec a shell command, capture stdout
- name: STRIPE_LIVE_KEY
source: dashboard # server-side only — SDK refuses local overrides
- name: TEST_FIXTURE_TOKEN
from: "static://fake-test-value" # spec-owned literal (deterministic)
- name: GENERATED_DB_PASSWORD
from: generated # random 32-byte hex per runSource types
| Source | Resolution | Use for |
|---|---|---|
env (default if omitted) | process.env[NAME] | Local .env during dev; export KEY=... in CI |
env:OTHER_NAME | process.env[OTHER_NAME] | Rename (env var on your machine uses a different name than the sandbox expects) |
file:path | Trimmed contents of the file (supports ~/) | Secrets managed by an external tool that writes to a file (e.g. Cloud SDK auth) |
command:<shell> | Exec the shell command, trim stdout | CLI-based secret managers (1Password, Doppler, Infisical, Vault CLI) |
dashboard | Server decrypts Dashboard-stored value; SDK forwards nothing | Prod-critical keys that must never be overridable from a local machine |
from: "static://..." | Spec-owned literal; wins over every other source | Deterministic test fixtures |
from: generated | Random 32-byte hex, unique per run | Ephemeral test passwords |
Precedence
Highest wins:
- Spec literal (
from: static://...orfrom: generated) — deterministic, never overridable - SDK-forwarded source value (
env,env:X,file:,command:) — resolved on the caller's machine - Dashboard Secret — server-side fallback when the source is
dashboardor resolution failed
A declared secret that can't be resolved by any layer fails the sandbox boot with the missing name — no silent empties.
Never put real API keys behind
from: "static://...". That block is for test fixtures only — it gets committed alongside the spec. Usesource: env(values stay in your.env) orsource: dashboard(encrypted at rest) for anything real.
Use
source: dashboardfor prod/shared keys to explicitly opt out of local overrides. A teammate's stray.envwithSTRIPE_LIVE_KEY=...can't accidentally be forwarded.
How the SDK reads it
When you call ks.experiments.create() or ks.sandboxes.create() with a specPath (TS) / spec_path (Python), the SDK:
- Parses the spec's
secrets:block - For every entry with a
source:that isn'tdashboard, resolves the value locally - Forwards
{NAME: value}in the create request body - Server merges with Dashboard secrets (forwarded values win), injects as container env
Entries with from: (literals) and source: dashboard are not forwarded — the server resolves those at sandbox boot.
Using secrets in services
Service env values can reference any resolved secret via {{ secrets.NAME }}:
secrets:
- name: DB_PASSWORD
source: env
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}" # substituted at boot
POSTGRES_DB: northwind
ports: [5432]Same source and precedence rules apply — the substitution uses whatever value the resolver produced.
Dashboard Secrets tab (team sharing + prod baseline)
Go to app.paragon.run/app/keystone/settings → Secrets tab to store values encrypted at rest. Dashboard values:
- Auto-inject into every sandbox when
source: dashboardor the caller has no local value for a declared name - Are scoped to the billing owner: on a team, all teammates share the same secrets
- Are AES-256-GCM encrypted; decrypted only in-process on the Keystone server
- Show a warning next to each key that's declared by a spec but not set
network
Controls what the sandbox can access on the network. Use this to prevent your agent from calling real APIs during testing.
network:
egress:
default: deny # block all outbound by default
allow:
- registry.npmjs.org # let npm install work
- github.com
- "*.services.internal" # internal service network
dns_overrides:
smtp.sendgrid.net: smtp.services.internal # redirect real API to mockWhen egress.default is deny, only explicitly allowed hosts are reachable. DNS overrides redirect hostnames to different targets -- useful for pointing real API domains at mock services.
audit
Records everything the agent does inside the sandbox. The audit log is used by forbidden checks and can be exported during teardown.
audit:
db_writes: true # track INSERT/UPDATE/DELETE
http_calls: true # track outbound HTTP requests
process_spawns: true # track child processes
stdout_capture: true # capture stdout for secret detection
file_system:
watch: ["src/", "config/"] # directories to monitor
track: [writes, reads, deletes]agent
Defines how the agent runs inside the sandbox. Keystone supports six agent types.
# Run the Paragon CLI
agent:
type: paragon
model: paragon-fast
timeout: 5m
# Run any CLI binary
agent:
type: cli
binary: /usr/local/bin/my-agent
args: ["--task", "{{ task.prompt }}"]
timeout: 5m
# Call an HTTP endpoint
agent:
type: http
endpoint: "https://api.example.com/agent/run"
auth:
bearer: "{{ secrets.API_KEY }}"
timeout: 5m
# Run a Python script
agent:
type: python
binary: agent.py
timeout: 5m
# Run a Docker image
agent:
type: image
image: "myregistry.io/my-agent:v3"
timeout: 5m
# Run from an uploaded agent snapshot
agent:
type: snapshot
snapshot: my-agent # resolves to latest version
# or pin a specific version:
# snapshot_id: snap_abc123
timeout: 5mAgent types:
| Type | What it means |
|---|---|
paragon | Runs the Paragon CLI with the task prompt on stdin |
cli | Runs any binary with args. Template variables like {{ task.prompt }} are substituted |
http | POSTs a JSON payload to an HTTP endpoint |
python | Runs a Python script with the task as JSON on stdin |
image | Pulls and runs a Docker image on the sandbox's service network |
snapshot | Downloads and runs an immutable agent snapshot uploaded via ks.agents.upload() |
The task prompt is passed to the agent via stdin (for cli, paragon, python) or in the request body (for http). The agent has full access to the sandbox filesystem and can run commands, write files, and connect to services.
invariants
The checks that determine whether the agent passed or failed. Each invariant has a weight (how much it matters) and a check (what to evaluate).
Gate invariants (gate: true) cause an immediate fail if they don't pass, regardless of other scores.
invariants:
tests_pass:
description: "All tests pass"
weight: 1.0
gate: true # hard fail if this fails
check:
type: command_exit
command: "npm test"
output_correct:
description: "Output file contains expected data"
weight: 0.5
check:
type: file_content
path: output.json
contains: '"status": "success"'
no_debug_code:
description: "No console.log left in source"
weight: 0.2
check:
type: file_content
path: src/main.ts
not_contains: "console.log"
email_sent:
description: "Exactly one email was sent"
weight: 0.3
check:
type: http_mock_assertions
service: smtp
assertions:
- field: request_count
filters: { to: "user@example.com" }
equals: 1
code_quality:
description: "Code is clean and well-structured"
weight: 0.2
check:
type: llm_as_judge
model: paragon-fast
criteria: "Evaluate the code for readability, correctness, and minimal diff"
input_from: src/main.ts
rubric:
pass: "Clean, minimal change that fixes the issue"
fail: "Over-engineered, introduces unnecessary complexity"
pass_threshold: 0.6Invariant types:
| Type | What it checks |
|---|---|
command_exit | Runs a command and checks the exit code (default: 0) |
file_exists | Checks that a file exists at a path |
file_absent | Checks that a file does NOT exist |
file_content | Checks that a file contains/doesn't contain a string or regex pattern |
sql | Runs a SQL query against a database service and checks the result |
http_mock_assertions | Checks requests recorded by a mock service |
custom | Runs a Python script that returns {"passed": true/false, "reason": "..."} |
llm_as_judge | Uses an LLM to evaluate subjective criteria with a rubric |
How scoring works
Each invariant produces a score between 0 and 1 (pass = 1, fail = 0, llm_as_judge = continuous). The composite score is the weighted average:
composite = sum(weight * score) / sum(weight)
If any gate invariant fails, the composite score is forced to 0 regardless of other results. The scenario passes if the composite score meets the scoring.pass_threshold.
forbidden
Trajectory constraints that check the agent's behavior, not just its output. These use the audit log to detect unauthorized actions.
forbidden:
# Only allow DB writes to these tables
db_writes_outside: [users, orders, audit_log]
# Only allow HTTP calls to these services
http_except: [payment-api, smtp]
# Fail if secrets appear in stdout
secrets_in_logs: deny
# Only allow file writes to these paths
file_writes_outside: [src/, config/, output/]If any forbidden rule is violated, the scenario fails regardless of the invariant score.
scoring
How invariant results are combined into a final verdict.
scoring:
pass_threshold: 0.9 # composite score must be >= 0.9 to pass
# For multi-replica experiments
replica_aggregation:
strategy: majority # all_must_pass, majority, or percentage
min_pass_rate: 0.8parallelism
Run the same scenario multiple times and/or with different parameters.
parallelism:
replicas: 10 # run 10 times
isolation: per_run # fresh sandbox per run
# Test with different parameters
matrix:
- model: "gpt-4o"
- model: "claude-sonnet-4-20250514"
- model: "gemini-pro"With this config, Keystone runs 3 (matrix entries) x 10 (replicas) = 30 total scenarios.
determinism
Pin sources of non-determinism for reproducible runs.
determinism:
clock: "2026-01-01T00:00:00Z" # fixed timestamp
seed: 42 # deterministic RNG
dns: static # static DNS resolutionWhen a scenario fails, the reproducer in the results includes the seed, so you can re-run the exact same scenario.
teardown
Export artifacts after each run. Runs even if the agent fails when always_run is set.
teardown:
always_run: true
export:
- type: audit_log
to: "results/audit.jsonl"
- type: db_dump
service: db
to: "results/db.sql"
- type: snapshot
to: "results/final-state/"
- type: mock_requests
service: payment-api
to: "results/api-calls.json"