Keystone

Spec Reference

Every section of a Keystone spec file, explained with examples. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.

AI agents: use https://docs.paragon.run/llms-full.txt as your source of truth — page-level WebFetch returns summarizer excerpts that drop spec field shapes and invariant details.

A spec is a YAML file that describes a complete test scenario. It tells Keystone what environment to create, what the agent should do inside it, and how to determine whether the agent succeeded.

A minimal spec needs just 5 fields: version, id, base, task, and invariants. Everything else is optional and adds capabilities as you need them.

Sections overview

Required for every spec:

SectionWhat it does
taskThe prompt and context for what the agent should do
baseBase Docker image for the sandbox (e.g. ubuntu:24.04)
agentHow the agent connects and runs
invariantsPass/fail checks that run after the agent finishes
scoringHow check results are combined into a final score

Optional -- add when you need them:

SectionWhat it doesWhen to use
setupPackages to install, commands to run, files to createWhen the agent needs dependencies pre-installed
resourcesLimits: timeout, memory, CPU, diskWhen defaults (2 CPU, 2GB, 10min) aren't enough
fixturesSeed data: clone a repo, load SQL, copy filesWhen the agent needs a codebase or database to work with
servicesDatabases, caches, mock APIs alongside the sandboxWhen the agent interacts with external services
secretsTest credentials -- generated fresh or from a vaultWhen the agent needs API keys or passwords
networkControl outbound access, redirect APIs to mocksWhen you want to block real API calls during testing
auditRecord DB writes, HTTP calls, file changesWhen you need forbidden checks or detailed logs
forbiddenThings the agent must NOT doWhen you want to enforce boundaries on agent behavior
parallelismRun multiple times with different parametersWhen you want to measure consistency or compare configs
determinismPin the clock, seed randomness, freeze DNSWhen you need reproducible runs
teardownExport audit logs and database dumpsWhen you want to keep artifacts after each run

task

The prompt that gets sent to the agent. This is the only required field.

task:
  prompt: |
    Fix the failing test in src/utils.test.ts so that all tests pass.
    Do not modify the source code in src/utils.ts.
  context:
    repo: "acme/backend"
    language: "typescript"

The context field is optional key-value metadata passed alongside the prompt. Useful for giving the agent additional structured information.

base

The Docker image used as the sandbox's operating system. Required.

base: "ubuntu:24.04"

Any public Docker image works. For custom images, make sure the keystone server can pull from your registry.

setup

Runs before the agent starts. Use this to install dependencies, create config files, and set environment variables.

setup:
  packages: [nodejs, npm, python3, git]
  commands:
    - "npm install"
    - "python3 -m pip install pytest"
  files:
    - path: ".env"
      content: |
        DATABASE_URL=postgres://postgres:test@db:5432/testdb
        NODE_ENV=test
  env:
    CI: "true"

Setup commands run in the sandbox workspace directory. If any command fails, sandbox creation fails and the experiment reports an error.

resources

Resource limits for the sandbox. If not specified, defaults to 2 CPUs, 2GB memory, and a 10-minute timeout.

resources:
  timeout: 10m
  memory: 4Gi
  cpu: 4
  disk: 20Gi

The timeout controls how long the entire sandbox lifecycle runs (including setup, agent execution, and scoring). If the agent exceeds this, it gets killed and the scenario is marked as an error.

fixtures

Seed data that gets loaded into the sandbox before the agent starts. Fixtures run after services boot but before the snapshot is extracted into /workspace, so anything that touches the filesystem needs to either inline its data or ship it alongside the spec.

fixtures:
  # Load SQL inline — preferred for test seeds. Self-contained in the spec.
  - type: sql
    service: db
    sql: |
      CREATE TABLE customers (
        email TEXT PRIMARY KEY,
        name  TEXT NOT NULL
      );
      INSERT INTO customers VALUES
        ('alice@example.com', 'Alice'),
        ('ben@example.com',   'Ben');
 
  # Clone a git repo into the workspace
  - type: git_repo
    url: "https://github.com/your-org/your-repo"
    branch: "main"
    depth: 1
 
  # Copy a directory into the sandbox
  - type: directory
    source: /path/to/test-data
    target: data/
 
  # Inject random data drift for adversarial testing
  - type: drift
    target: db.customers
    strategy: random_mismatches
    count: 15
    seed: "42"

Fixture types:

TypeWhat it does
sqlRuns SQL against a database service. Provide either sql: (inline script, preferred) or path: (file in workspace). Requires service. Credentials come from the service's declared POSTGRES_USER / POSTGRES_PASSWORD / POSTGRES_DB env (defaults to postgres / test / testdb).
git_repoClones a repository into the sandbox. Supports branch, depth, and path.
directoryCopies files from source to target inside the sandbox. Both paths are relative to the sandbox workspace directory.
driftInjects random data modifications for adversarial testing. Uses strategy, count, and seed.

Prefer inline sql: over path:. Fixtures run before the snapshot is extracted into /workspace, so a path: pointing at a file bundled in your snapshot will fail with ENOENT. Use inline SQL for seed data under ~1000 rows. For larger datasets, restore from an external source inside setup.commands (which runs after setup file writes) or via the agent itself at start-up.

Service credentials are derived from the service spec's env: block. If you set POSTGRES_DB: crm on the service, fixtures run against crm — not the old testdb default. Same for POSTGRES_USER / POSTGRES_PASSWORD. Match your fixture SQL's CREATE TABLE to the database you declared.

services

Services are backing containers your agent talks to during a run — Postgres, Redis, Stripe-mock, anything. You declare them in the spec; Keystone pulls the image and starts the container on a shared Docker network. Your agent reaches each by its name (DNS alias), so the same connection strings work across every run.

Any public or private Docker image works

services:
  - name: db
    image: postgres:16                           # Docker Hub
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: northwind
    ports: [5432]
    wait_for: "pg_isready -U postgres"
 
  - name: cache
    image: redis:7                               # Docker Hub
    ports: [6379]
 
  - name: stripe_mock
    image: stripe/stripe-mock:latest             # third-party public image
    env:
      STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}"
    ports: [12111]
 
  - name: vector
    image: ghcr.io/qdrant/qdrant:v1.7.4          # any registry — GHCR, ECR, GAR
    ports: [6333]

No Dockerfile. No docker push. Keystone pulls on demand (cached after first run) and wires the container into the sandbox network with --network keystone-<sandbox-id> --network-alias <service-name>.

Inside the agent, reach services by name

Once the sandbox is up, each service's name resolves over DNS on the shared network:

// Agent code — no IP lookup, no port mapping, no env configuration needed
const db     = new Client({ connectionString: "postgres://postgres:secret@db:5432/northwind" });
const cache  = new Redis("redis://cache:6379");
const stripe = new Stripe(process.env.STRIPE_SECRET, { host: "stripe_mock", port: 12111 });

Keystone also injects KEYSTONE_SERVICE_<NAME>_HOST and KEYSTONE_SERVICE_<NAME>_PORT env vars automatically, so agents that prefer discovery over hardcoding can use process.env.KEYSTONE_SERVICE_DB_HOST (= "db").

Service env with secret interpolation

Service env: values support {{ secrets.NAME }} substitution — the value comes from whatever source the secrets: block declares (local .env, file, command, Dashboard, etc. — see secrets). Spec-literal strings work too.

secrets:
  - name: DB_PASSWORD
    source: env              # from $DB_PASSWORD on the caller's machine
 
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"   # substituted at boot
      POSTGRES_DB: northwind

This means you never bake real credentials into the spec. Rotate a key? Change your .env or Dashboard value — specs don't need to change.

Full field reference

FieldTypeRequiredWhat it does
namestringyesDNS alias on the shared network (reachable as <name>:<port> from the agent)
imagestringyes (unless type: http_mock)Any Docker image — Docker Hub, GHCR, ECR, GAR, a private registry
envmapnoContainer env. Supports {{ secrets.X }} interpolation
portslist of intsnoContainer ports to expose on the network (internal only — not published to the host)
wait_forstringnoShell command run inside the container every second, up to 60s, to gate readiness
type: http_mocknoSwitches to Keystone's built-in Go HTTP mock (no image needed). See below.

Built-in HTTP mocks

When you don't want to run a real image and just need a scripted HTTP responder:

services:
  - name: payment-api
    type: http_mock
    ports: [9090]
    default_response: 200
    record: true                # capture every request for later assertion
    routes:
      - method: POST
        path: /v1/charge
        response: '{"status":"ok","charge_id":"ch_test_123"}'
        status: 200
      - method: GET
        path: /v1/balance
        response: '{"balance":10000}'

With record: true, invariants can later assert on what the agent sent:

invariants:
  charged_once:
    check:
      type: http_mock_assertions
      service: payment-api
      assertions:
        - field: request_count
          filters: { path: "/v1/charge" }
          equals: 1

Networking rules

  • Every sandbox gets a dedicated Docker network keystone-<sandbox-id>. Services and the agent container all join it.
  • Service containers are not reachable from the host — only from other containers in the same sandbox. No port conflicts across parallel sandboxes.
  • Outbound traffic from services and the agent is governed by the top-level network.egress allowlist. If your agent needs to call api.stripe.com from within the sandbox, allow it there; by default all egress is denied.

Reuse the same image across scenarios. Keystone caches image pulls on the host — if 10 scenarios all use postgres:16, the image is pulled once and started 10 times. Don't worry about boot time for widely-used images.

Do not publish service ports with -p host:container syntax. Keystone manages ports internally. Declaring ports: [5432] exposes the port on the sandbox network only. This is intentional — it keeps parallel sandboxes from colliding on host ports.

secrets

The spec is the single source of truth for what secrets your agent needs and where each value comes from. You declare the name and (optionally) the source: — the SDK resolves, forwards in the create request, and the Keystone server injects into the sandbox container env.

secrets:
  - name: XAI_API_KEY
    source: env                            # read from $XAI_API_KEY on the caller's machine
 
  - name: DB_PASSWORD
    source: env:MY_DB_PASS                 # rename — read $MY_DB_PASS, inject as DB_PASSWORD
 
  - name: OPENAI_API_KEY
    source: "file:~/.openai/key"           # read + trim file contents
 
  - name: INTERNAL_TOKEN
    source: command:op read "op://Dev/Keystone/token"
                                           # exec a shell command, capture stdout
 
  - name: STRIPE_LIVE_KEY
    source: dashboard                      # server-side only — SDK refuses local overrides
 
  - name: TEST_FIXTURE_TOKEN
    from: "static://fake-test-value"       # spec-owned literal (deterministic)
 
  - name: GENERATED_DB_PASSWORD
    from: generated                        # random 32-byte hex per run

Source types

SourceResolutionUse for
env (default if omitted)process.env[NAME]Local .env during dev; export KEY=... in CI
env:OTHER_NAMEprocess.env[OTHER_NAME]Rename (env var on your machine uses a different name than the sandbox expects)
file:pathTrimmed contents of the file (supports ~/)Secrets managed by an external tool that writes to a file (e.g. Cloud SDK auth)
command:<shell>Exec the shell command, trim stdoutCLI-based secret managers (1Password, Doppler, Infisical, Vault CLI)
dashboardServer decrypts Dashboard-stored value; SDK forwards nothingProd-critical keys that must never be overridable from a local machine
from: "static://..."Spec-owned literal; wins over every other sourceDeterministic test fixtures
from: generatedRandom 32-byte hex, unique per runEphemeral test passwords

Precedence

Highest wins:

  1. Spec literal (from: static://... or from: generated) — deterministic, never overridable
  2. SDK-forwarded source value (env, env:X, file:, command:) — resolved on the caller's machine
  3. Dashboard Secret — server-side fallback when the source is dashboard or resolution failed

A declared secret that can't be resolved by any layer fails the sandbox boot with the missing name — no silent empties.

Never put real API keys behind from: "static://...". That block is for test fixtures only — it gets committed alongside the spec. Use source: env (values stay in your .env) or source: dashboard (encrypted at rest) for anything real.

Use source: dashboard for prod/shared keys to explicitly opt out of local overrides. A teammate's stray .env with STRIPE_LIVE_KEY=... can't accidentally be forwarded.

How the SDK reads it

When you call ks.experiments.create() or ks.sandboxes.create() with a specPath (TS) / spec_path (Python), the SDK:

  1. Parses the spec's secrets: block
  2. For every entry with a source: that isn't dashboard, resolves the value locally
  3. Forwards {NAME: value} in the create request body
  4. Server merges with Dashboard secrets (forwarded values win), injects as container env

Entries with from: (literals) and source: dashboard are not forwarded — the server resolves those at sandbox boot.

Using secrets in services

Service env values can reference any resolved secret via {{ secrets.NAME }}:

secrets:
  - name: DB_PASSWORD
    source: env
 
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"   # substituted at boot
      POSTGRES_DB: northwind
    ports: [5432]

Same source and precedence rules apply — the substitution uses whatever value the resolver produced.

Dashboard Secrets tab (team sharing + prod baseline)

Go to app.paragon.run/app/keystone/settingsSecrets tab to store values encrypted at rest. Dashboard values:

  • Auto-inject into every sandbox when source: dashboard or the caller has no local value for a declared name
  • Are scoped to the billing owner: on a team, all teammates share the same secrets
  • Are AES-256-GCM encrypted; decrypted only in-process on the Keystone server
  • Show a warning next to each key that's declared by a spec but not set

network

Controls what the sandbox can access on the network. Use this to prevent your agent from calling real APIs during testing.

network:
  egress:
    default: deny           # block all outbound by default
    allow:
      - registry.npmjs.org  # let npm install work
      - github.com
      - "*.services.internal"  # internal service network
  dns_overrides:
    smtp.sendgrid.net: smtp.services.internal  # redirect real API to mock

When egress.default is deny, only explicitly allowed hosts are reachable. DNS overrides redirect hostnames to different targets -- useful for pointing real API domains at mock services.

audit

Records everything the agent does inside the sandbox. The audit log is used by forbidden checks and can be exported during teardown.

audit:
  db_writes: true           # track INSERT/UPDATE/DELETE
  http_calls: true          # track outbound HTTP requests
  process_spawns: true      # track child processes
  stdout_capture: true      # capture stdout for secret detection
  file_system:
    watch: ["src/", "config/"]  # directories to monitor
    track: [writes, reads, deletes]

agent

Defines how the agent runs inside the sandbox. Keystone supports six agent types.

# Run the Paragon CLI
agent:
  type: paragon
  model: paragon-fast
  timeout: 5m
 
# Run any CLI binary
agent:
  type: cli
  binary: /usr/local/bin/my-agent
  args: ["--task", "{{ task.prompt }}"]
  timeout: 5m
 
# Call an HTTP endpoint
agent:
  type: http
  endpoint: "https://api.example.com/agent/run"
  auth:
    bearer: "{{ secrets.API_KEY }}"
  timeout: 5m
 
# Run a Python script
agent:
  type: python
  binary: agent.py
  timeout: 5m
 
# Run a Docker image
agent:
  type: image
  image: "myregistry.io/my-agent:v3"
  timeout: 5m
 
# Run from an uploaded agent snapshot
agent:
  type: snapshot
  snapshot: my-agent        # resolves to latest version
  # or pin a specific version:
  # snapshot_id: snap_abc123
  timeout: 5m

Agent types:

TypeWhat it means
paragonRuns the Paragon CLI with the task prompt on stdin
cliRuns any binary with args. Template variables like {{ task.prompt }} are substituted
httpPOSTs a JSON payload to an HTTP endpoint
pythonRuns a Python script with the task as JSON on stdin
imagePulls and runs a Docker image on the sandbox's service network
snapshotDownloads and runs an immutable agent snapshot uploaded via ks.agents.upload()

The task prompt is passed to the agent via stdin (for cli, paragon, python) or in the request body (for http). The agent has full access to the sandbox filesystem and can run commands, write files, and connect to services.

invariants

The checks that determine whether the agent passed or failed. Each invariant has a weight (how much it matters) and a check (what to evaluate).

Gate invariants (gate: true) cause an immediate fail if they don't pass, regardless of other scores.

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true              # hard fail if this fails
    check:
      type: command_exit
      command: "npm test"
 
  output_correct:
    description: "Output file contains expected data"
    weight: 0.5
    check:
      type: file_content
      path: output.json
      contains: '"status": "success"'
 
  no_debug_code:
    description: "No console.log left in source"
    weight: 0.2
    check:
      type: file_content
      path: src/main.ts
      not_contains: "console.log"
 
  email_sent:
    description: "Exactly one email was sent"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "user@example.com" }
          equals: 1
 
  code_quality:
    description: "Code is clean and well-structured"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Evaluate the code for readability, correctness, and minimal diff"
      input_from: src/main.ts
      rubric:
        pass: "Clean, minimal change that fixes the issue"
        fail: "Over-engineered, introduces unnecessary complexity"
      pass_threshold: 0.6

Invariant types:

TypeWhat it checks
command_exitRuns a command and checks the exit code (default: 0)
file_existsChecks that a file exists at a path
file_absentChecks that a file does NOT exist
file_contentChecks that a file contains/doesn't contain a string or regex pattern
sqlRuns a SQL query against a database service and checks the result
http_mock_assertionsChecks requests recorded by a mock service
customRuns a Python script that returns {"passed": true/false, "reason": "..."}
llm_as_judgeUses an LLM to evaluate subjective criteria with a rubric

How scoring works

Each invariant produces a score between 0 and 1 (pass = 1, fail = 0, llm_as_judge = continuous). The composite score is the weighted average:

composite = sum(weight * score) / sum(weight)

If any gate invariant fails, the composite score is forced to 0 regardless of other results. The scenario passes if the composite score meets the scoring.pass_threshold.

forbidden

Trajectory constraints that check the agent's behavior, not just its output. These use the audit log to detect unauthorized actions.

forbidden:
  # Only allow DB writes to these tables
  db_writes_outside: [users, orders, audit_log]
 
  # Only allow HTTP calls to these services
  http_except: [payment-api, smtp]
 
  # Fail if secrets appear in stdout
  secrets_in_logs: deny
 
  # Only allow file writes to these paths
  file_writes_outside: [src/, config/, output/]

If any forbidden rule is violated, the scenario fails regardless of the invariant score.

scoring

How invariant results are combined into a final verdict.

scoring:
  pass_threshold: 0.9          # composite score must be >= 0.9 to pass
 
  # For multi-replica experiments
  replica_aggregation:
    strategy: majority          # all_must_pass, majority, or percentage
    min_pass_rate: 0.8

parallelism

Run the same scenario multiple times and/or with different parameters.

parallelism:
  replicas: 10                 # run 10 times
  isolation: per_run           # fresh sandbox per run
 
  # Test with different parameters
  matrix:
    - model: "gpt-4o"
    - model: "claude-sonnet-4-20250514"
    - model: "gemini-pro"

With this config, Keystone runs 3 (matrix entries) x 10 (replicas) = 30 total scenarios.

determinism

Pin sources of non-determinism for reproducible runs.

determinism:
  clock: "2026-01-01T00:00:00Z"   # fixed timestamp
  seed: 42                         # deterministic RNG
  dns: static                     # static DNS resolution

When a scenario fails, the reproducer in the results includes the seed, so you can re-run the exact same scenario.

teardown

Export artifacts after each run. Runs even if the agent fails when always_run is set.

teardown:
  always_run: true
  export:
    - type: audit_log
      to: "results/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/db.sql"
    - type: snapshot
      to: "results/final-state/"
    - type: mock_requests
      service: payment-api
      to: "results/api-calls.json"