Spec Reference

Every section of a Keystone spec — fields, defaults, types, and worked examples for each block.

A Keystone spec is a YAML file that describes everything about one scenario: the environment to spin up, the task the agent should do, the credentials it needs, the policies it operates under, and how to score the result. This page is the complete field-by-field reference. For an introduction to the concepts, start with Concepts; for end-to-end examples, see Examples.

The minimum viable spec

Five required fields — version, id, base, task, invariants — are enough to run a hello-world:

version: 1
id: hello-world
description: "Smoke test — does the agent write a file?"
 
base: "ubuntu:24.04"
 
task:
  prompt: "Create a file called hello.txt with the text 'Hello, World!' inside."
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  hello_file_exists:
    description: "hello.txt was created"
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: hello.txt
 
  hello_file_correct:
    description: "hello.txt contains the right text"
    weight: 1.0
    check:
      type: file_content
      path: hello.txt
      contains: "Hello, World!"
 
scoring:
  pass_threshold: 1.0

Run with ks eval run hello.yaml. Every other section in the spec is optional — add what you need.

Top-level fields

Field	Type	Required	Default	Notes
`version`	int	yes	—	Always `1` for now
`id`	string	yes	—	Kebab-case ID. Re-uploading the same id increments version
`description`	string	no	""	Human-readable description
`extends`	string	no	""	Path to a base spec to inherit from (v1.1 future)
`task`	object	yes	—	What the agent should do
`base`	string	yes	—	Base Docker image
`agent`	object	yes	—	How the agent gets invoked
`invariants`	map	yes	—	Pass/fail checks
`scoring`	object	yes	—	Composite scoring + replica aggregation
`setup`	object	no	—	Packages, commands, files, env vars
`resources`	object	no	(see below)	CPU/memory/disk/timeout
`fixtures`	list	no	[]	Seed data (git repos, SQL, files)
`services`	list	no	[]	Backing containers (DB, cache, mocks)
`secrets`	list	no	[]	Credentials
`network`	object	no	(see below)	Egress/ingress rules, DNS overrides
`audit`	object	no	(see below)	Structured capture of agent activity
`snapshots`	object	no	(see below)	World-state checkpointing
`forbidden`	object	no	—	Trajectory constraints
`parallelism`	object	no	(see below)	Replicas + matrix
`determinism`	object	no	—	Clock/seed/DNS pinning
`retention`	object	no	(see below)	How long to keep artifacts
`teardown`	object	no	—	Export actions after run

The next sections walk through every block.

`task` — what the agent should do

task:
  prompt: |
    Reconcile customers_a and customers_b tables, fix all mismatches,
    and email a summary to finance@co.
  context:
    ticket_url: "https://linear.app/co/issue/ENG-1234"
    success_criteria: "All rows match, exactly one summary email sent"
    repo: "acme/backend"
    language: "typescript"

Field	Type	Required	Notes
`prompt`	string	yes	The instruction the agent receives, verbatim
`context`	map of strings	no	Extra structured info passed alongside the prompt

context is free-form — your agent decides how to use it. Common keys:

success_criteria — what counts as done (also used in some LLM judges)
ticket_url — link to the originating ticket
repo / language / framework — type hints
style_guide — link or inline guidance

The prompt + context are passed to the agent via stdin (for cli, paragon, python agents) or in the request body (for http agents).

Best practices

Be specific. "Fix the bug" is too vague. "Fix the failing test in src/api.test.ts so npm test returns exit 0 without modifying src/api.ts" is testable.
Restate the invariants in plain English. If the invariant requires a file at output.json, say so in the prompt. The model can read the spec.
Bound the work. "Refactor the codebase" is unbounded; "Refactor the users.py module to use async SQLAlchemy" is bounded.

`base` — the base Docker image

base: "ubuntu:24.04"

Any Docker image works. Common choices:

Image	When to use
`ubuntu:24.04`	General Linux, install whatever you need in `setup`
`node:20` / `node:22`	Node/TS agents and tasks
`python:3.12` / `python:3.11`	Python agents and tasks
`golang:1.22`	Go agents and tasks
`mcr.microsoft.com/playwright`	Browser automation
`your-registry.io/your-image:tag`	Pre-baked custom image

Custom images need to be reachable by the Keystone server (Hub, GHCR, ECR, GAR, or a registry the server has credentials for).

base: is a string — there is no nested object form. To customize the environment beyond the bare image, use the setup: block (packages, commands, files, env).

`setup` — environment prep

setup:
  packages: [nodejs, npm, python3, git, postgresql-client]
  commands:
    - "npm install -g typescript"
    - "pip install pytest"
  files:
    - path: ".env"
      content: |
        DATABASE_URL=postgres://postgres:test@db:5432/testdb
        NODE_ENV=test
    - path: "tsconfig.json"
      content: '{"compilerOptions": {"strict": true}}'
  env:
    CI: "true"
    NODE_ENV: "test"

Field	Type	Notes
`packages`	list of strings	apt/apk packages installed before any commands run
`commands`	list of strings	Shell commands run sequentially in the workspace
`files`	list of `{path, content, template?}`	Files written into the workspace
`env`	map	Env vars set for the agent (and for `setup.commands`)

Order: packages first, then files written, then commands run. If any command fails, the sandbox fails to boot.

Setup vs. fixtures: setup is for static prep (install Node, write tsconfig). Fixtures are for seed data (clone a repo, load SQL). The two run at different boot stages — see Fixtures.

`setup.files` — templating

setup:
  files:
    - path: "config.json"
      template: '{"key": "{{ secrets.API_KEY }}"}'

template interpolates {{ secrets.NAME }} placeholders before writing. Use template: instead of content: when you need secret values inside the file. Otherwise prefer content:.

`agent` — how the agent gets invoked

Six agent types ship with Keystone. Pick whichever fits how your agent is packaged.

`type: paragon` — Polarity's built-in agent

agent:
  type: paragon
  model: paragon-fast      # or paragon-max, claude-sonnet-4-5, gpt-4o, etc.
  args: ["--yolo", "--quiet"]
  timeout: 5m
  env:
    PARAGON_TOOL_TRACE_FILE: "{{ sandbox.trace_path }}"

The Paragon CLI runs inside the sandbox with the task prompt on stdin. Default for "I just want a coding agent."

`type: cli` — any binary

agent:
  type: cli
  binary: /workspace/my-agent
  args: ["--task", "{{ task.prompt }}", "--workspace", "{{ sandbox.path }}"]
  timeout: 5m
  env:
    LOG_LEVEL: "info"

Runs an executable. Template variables: {{ task.prompt }}, {{ task.context.<key> }}, {{ sandbox.path }}.

`type: http` — call an HTTP endpoint

agent:
  type: http
  endpoint: "https://my-agent.internal/run"
  auth:
    bearer: "{{ secrets.AGENT_TOKEN }}"
  input_template: |
    {
      "task": "{{ task.prompt }}",
      "workspace": "{{ sandbox.url }}",
      "context": {{ task.context | tojson }}
    }
  timeout: 5m

POSTs the rendered input_template to endpoint. Use this when your agent runs as a service outside the sandbox.

`type: python` — Python script

agent:
  type: python
  binary: agent.py
  args: ["--mode", "eval"]
  timeout: 5m

Runs python3 agent.py [args] with the task as JSON on stdin.

`type: image` — Docker image

agent:
  type: image
  image: "your-registry/your-agent:v3"
  entrypoint: ["python", "main.py"]   # optional override
  timeout: 5m
  env:
    LOG_LEVEL: "debug"

Pulls the image and runs it on the sandbox's service network. Best for reproducibility.

`type: snapshot` — uploaded agent

agent:
  type: snapshot
  snapshot: my-agent              # latest version
  # or pin a specific version:
  # snapshot_id: snap_abc123
  entrypoint: ["python", "main.py"]   # optional override
  timeout: 5m

Resolves to an agent snapshot you uploaded via ks.agents.upload(). Best for tracking which agent version produced which results.

Common fields across all agent types

Field	Type	Default	Notes
`type`	string	required	`paragon`, `cli`, `http`, `python`, `image`, `snapshot`
`timeout`	duration	`5m`	Max wall time before agent is killed
`env`	map		Env vars for the agent process

`resources` — compute & timeout limits

resources:
  timeout: "10m"
  memory: "4Gi"
  cpu: 4
  disk: "20Gi"
  desktop: false               # true = XFCE desktop with browser available
  concurrency_limit: 5         # max sandboxes for this spec at once

Field	Type	Default	Notes
`timeout`	duration	`10m`	Total sandbox lifecycle (setup + agent + scoring)
`memory`	string	`2Gi`	Memory limit (Ki/Mi/Gi)
`cpu`	int	`2`	CPU cores
`disk`	string	`10Gi`	Disk quota
`desktop`	bool	false	Spin up XFCE for browser-using agents
`concurrency_limit`	int	tenant default	Max parallel sandboxes for this spec

The timeout here covers the entire sandbox lifecycle, not just the agent. If your agent has a 5-minute timeout and your invariants need another 2 minutes, set resources.timeout: 8m (or more, for safety).

`fixtures` — seed data

Fixtures load data into the sandbox after services boot but before the agent runs. Four types: git_repo, sql, directory, drift. See Fixtures for the deep dive.

fixtures:
  # Clone a git repository into /workspace
  - type: git_repo
    url: "https://github.com/your-org/your-repo"
    branch: "main"
    depth: 1
    path: /workspace
 
  # Run inline SQL against a service (preferred over path:)
  - type: sql
    service: db
    sql: |
      CREATE TABLE customers (
        email TEXT PRIMARY KEY,
        name  TEXT NOT NULL
      );
      INSERT INTO customers VALUES
        ('alice@example.com', 'Alice'),
        ('ben@example.com',   'Ben');
 
  # Copy a directory from the Keystone server's filesystem
  - type: directory
    source: /var/keystone/seeds/test-data
    target: data/
 
  # Inject random data corruption for adversarial testing
  - type: drift
    target: db.customers
    strategy: random_mismatches      # or random_nulls, duplicate_rows
    count: 15
    seed: "{{ determinism.seed }}"

Fixtures run in declaration order. A failure aborts sandbox boot.

`services` — backing containers

services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: "pg_isready -U postgres"
 
  - name: cache
    image: redis:7
    ports: [6379]
    wait_for: "redis-cli ping"
 
  - name: smtp
    image: mailhog/mailhog
    ports: [1025, 8025]    # SMTP + HTTP UI
 
  - name: stripe-mock
    type: http_mock
    record: true
    default_response: 404
    routes:
      - method: POST
        path: /v1/charge
        response: '{"id":"ch_test","status":"succeeded"}'
        status: 200
      - method: ANY
        path: "/v1/webhooks/.*"   # regex match
        response: '{"ok": true}'

The full reference is in Services. Service field summary:

Field	Type	Required	Notes
`name`	string	yes	DNS alias on the shared network
`image`	string	yes (unless `type: http_mock`)	Any Docker image
`type`	string	no	Empty (default) or `http_mock` for built-in mocks
`env`	map	no	Container env. Supports `{{ secrets.X }}`
`ports`	list of ints	no	Container ports exposed on the network
`wait_for`	string	no	Shell command run inside; gates readiness
`record`	bool	no	(`http_mock` only) capture all requests
`default_response`	int	no	(`http_mock`) status for unmatched routes
`routes`	list	no	(`http_mock`) per-method/path responders

Services are reachable from the agent by name: postgres://db:5432, redis://cache:6379, http://stripe-mock:9090. Auto-injected env vars: KEYSTONE_SERVICE_<NAME>_HOST, KEYSTONE_SERVICE_<NAME>_PORT.

`secrets` — credentials

secrets:
  - name: ANTHROPIC_API_KEY
    source: env                  # default; pulls $ANTHROPIC_API_KEY from caller's env
 
  - name: DB_PASSWORD
    source: env:LOCAL_DB_PASS    # rename — pull $LOCAL_DB_PASS, expose as DB_PASSWORD
 
  - name: GCP_KEY
    source: "file:~/.config/gcloud/sa.json"
 
  - name: VAULT_TOKEN
    source: 'command:vault token lookup -field=id'
 
  - name: STRIPE_LIVE_KEY
    source: dashboard            # server-side only; SDK refuses local override
 
  - name: TEST_FIXTURE_TOKEN
    from: "static://fake-test-value"   # spec literal; deterministic
 
  - name: GENERATED_DB_PASSWORD
    from: generated              # random per-run

Field	Type	Notes
`name`	string	Env var name inside the sandbox
`source`	string	`env`, `env:OTHER`, `file:...`, `command:...`, `dashboard`
`from`	string	`static://<value>` or `generated` (mutually exclusive with `source`)
`scope`	string or `{env, file_template}`	Where the secret is exposed (default: `env`)

Precedence: from: > source: (resolved locally) > Dashboard. A secret that resolves to nothing fails the sandbox boot. See Secrets for the full deep dive.

Scope examples

secrets:
  # Default: env var only
  - name: API_KEY
    source: env
 
  # Env var AND templated into a config file
  - name: STRIPE_KEY
    from: "static://sk_test_xxx"
    scope:
      env: true
      file_template: "config/stripe.json"

file_template: is templated at sandbox boot — {{ STRIPE_KEY }} placeholders in the named file are replaced with the resolved value.

`network` — egress, ingress, DNS overrides

network:
  egress:
    default: deny           # block all outbound by default
    allow:
      - registry.npmjs.org
      - github.com
      - "*.services.internal"  # internal service network
  ingress:
    default: deny
    allow:
      - from: host
        to_port: 3000
  dns_overrides:
    api.stripe.com: stripe-mock.services.internal
    smtp.sendgrid.net: smtp.services.internal

Section	What it controls
`egress.default`	`deny` (block all outbound) or `allow` (let everything through)
`egress.allow`	Hostnames or globs that bypass the default
`ingress.default`	Inbound from outside the sandbox
`ingress.allow`	Specific inbound rules (e.g., from host to port)
`dns_overrides`	Map real hostnames to in-sandbox services

Default-deny + DNS overrides is the canonical pattern: the agent's existing code calls https://api.stripe.com/..., but inside the sandbox that resolves to your stripe-mock container. See Network & Audit.

`audit` — capture activity

audit:
  db_writes: true              # log every INSERT/UPDATE/DELETE
  http_calls: true             # log every outbound HTTP
  process_spawns: true         # log every child process
  stdout_capture: true         # capture stdout for secret detection
  file_system:
    watch: ["src/", "config/"]
    track: [writes, reads, deletes]

Field	Type	Default	Notes
`db_writes`	bool	false	Required for `forbidden.db_writes_outside`
`http_calls`	bool	false	Required for `forbidden.http_except`
`process_spawns`	bool	false	Logs every child process exec
`stdout_capture`	bool	false	Required for `forbidden.secrets_in_logs`
`file_system.watch`	list of strings	[]	Directories to monitor
`file_system.track`	list of strings	[]	`writes`, `reads`, `deletes`

Audit events stream to <workspace>/.keystone/audit.jsonl and to the /v1/sandboxes/:id/events SSE endpoint. See Network & Audit.

`snapshots` — world-state checkpointing

snapshots:
  before_run: true              # always capture before agent starts
  checkpoints: per_action       # snapshot after each agent tool call
  retain_on: [failure]          # only persist snapshots for failed runs

Field	Type	Default	Notes
`before_run`	bool	true	Snapshot taken once when sandbox enters `ready`
`checkpoints`	string	`none`	`none` or `per_action` (after every tool call)
`retain_on`	list	`[failure]`	When to persist — `[failure]`, `[always]`, or `[]`

Snapshots enable sandboxes.diff() (what did the agent change?) and the per-action replay in the dashboard's trace viewer. Stored under <workspace>/.keystone/snapshots/ until destroyed.

`invariants` — pass/fail checks

The most important block. Eight check types:

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true                 # hard fail if this fails
    check:
      type: command_exit
      command: "npm test"
      exit_code: 0
 
  output_correct:
    description: "Output file contains expected data"
    weight: 0.5
    check:
      type: file_content
      path: output.json
      contains: '"status": "success"'
      not_contains: "ERROR"
 
  no_orphan_orders:
    description: "Every order has a matching customer"
    weight: 0.3
    gate: true
    check:
      type: sql
      service: db
      query: "SELECT count(*) FROM orders WHERE customer_id NOT IN (SELECT id FROM customers)"
      equals: 0
 
  email_sent:
    description: "Exactly one email to finance"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  custom_check:
    description: "Custom Python validation"
    weight: 0.2
    check:
      type: custom
      script: checks/validate.py
      runs_in: host           # host (default) or sandbox
 
  code_quality:
    description: "Code is well-structured"
    weight: 0.4
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Is the diff minimal and idiomatic?"
      input_from: workspace
      rubric:
        pass: "Clean, minimal change"
        fail: "Over-engineered or invasive"
      pass_threshold: 0.7

Each invariant has:

Field	Type	Required	Notes
`description`	string	yes	Human-readable
`weight`	float	no (default 1.0)	Relative importance
`gate`	bool	no (default false)	If true, scenario fails when this fails
`check`	object	yes	The actual check (type-specific)

The full reference is in Invariants.

`check.type` — full table

Type	Asks
`command_exit`	Does the command return the expected exit code?
`file_exists`	Does this file exist?
`file_absent`	Is this file missing?
`file_content`	Does this file contain (or not contain) a string or regex?
`sql`	Does this SQL query return the expected value?
`http_mock_assertions`	Was this mock service called the right way?
`custom`	Did your Python script return `passed: true`?
`llm_as_judge`	Did the model rate the output above the threshold?

`forbidden` — trajectory constraints

Things the agent must NOT do. Backed by the audit log; any violation auto-fails the scenario.

forbidden:
  db_writes_outside: [users, orders, audit_log]
  http_except: [stripe-mock, smtp]
  secrets_in_logs: deny
  file_writes_outside: [src/, output/, .keystone/]

See Forbidden Rules.

`scoring` — composite + replica aggregation

scoring:
  pass_threshold: 0.9          # composite must be >= 0.9 to pass
  replica_aggregation:
    strategy: majority          # all_must_pass | majority | percentage
    min_pass_rate: 0.85

Field	Type	Default	Notes
`pass_threshold`	float in `[0, 1]`	required	Composite minimum
`replica_aggregation.strategy`	string	`all_must_pass`	How replica statuses combine
`replica_aggregation.min_pass_rate`	float	0.5	For `percentage` strategy

Composite computation

For each scenario:

Each invariant produces a score in [0, 1].
Composite = sum(weight × score) / sum(weight).
Any gate failure → composite forced to 0.
Any forbidden violation → composite forced to 0.
composite >= pass_threshold → status pass. Else fail.

Replica aggregation strategies

Strategy	Verdict
`all_must_pass`	Every replica must pass. Otherwise `fail`.
`majority`	More than 50% pass → `pass`; else `fail` (ties → `flaky`)
`percentage`	At least `min_pass_rate` fraction pass → `pass`. Otherwise check spread → `flaky` if mixed, else `fail`

flaky is its own status — distinct from fail — for "passed sometimes, failed sometimes" cases that you typically want to investigate, not gate on.

`parallelism` — replicas and matrix

parallelism:
  replicas: 10                 # runs PER matrix entry
  isolation: per_run           # per_run = own sandbox | shared = reuse
  matrix:
    - { input_size: small, locale: en_US }
    - { input_size: large, locale: en_US }
    - { input_size: small, locale: ja_JP }

Field	Type	Default	Notes
`replicas`	int	1	Runs per matrix entry
`isolation`	string	`per_run`	`per_run` (own sandbox) or `shared` (reuse)
`matrix`	list of maps	[]	Parameter combinations

replicas: 10 × 3 matrix entries = 30 total scenarios.

Matrix interpolation

Matrix values are available as {{ matrix.<key> }} in any string field:

parallelism:
  matrix:
    - { model: claude-sonnet-4-5 }
    - { model: gpt-4o }
 
agent:
  type: cli
  binary: my-agent
  args: ["--model", "{{ matrix.model }}"]

Isolation: `per_run` vs `shared`

per_run (default) — every scenario gets its own sandbox + services. Slow but accurate.
shared — sandboxes are reused across scenarios. Fast but state can leak.

Use per_run for any real measurement. Use shared only for cheap smoke tests where state pollution is acceptable.

`determinism` — pin sources of non-determinism

determinism:
  clock: "2026-01-01T00:00:00Z"   # frozen time (libfaketime)
  seed: 42                         # PRNG seed
  network_latency: 0ms              # injected latency
  dns: static                       # static or live

See Determinism.

`retention` — artifact lifetime

retention:
  audit_logs: 24h
  snapshots: 7d
  teardown_exports: 30d
  traces: 30d

Field	Default
`audit_logs`	24h
`snapshots`	7d
`teardown_exports`	30d (Free), 6mo (Pro), 2y (Enterprise)
`traces`	30d (Free), 6mo (Pro), 2y (Enterprise)

`teardown` — export artifacts after run

teardown:
  always_run: true
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/{{ run_id }}/db.sql"
    - type: snapshot
      to: "results/{{ run_id }}/final_state.snap"
    - type: mock_requests
      service: stripe-mock
      to: "results/{{ run_id }}/stripe-calls.jsonl"

Field	Type	Notes
`always_run`	bool	Run teardown even on agent failure
`export[].type`	string	`audit_log`, `db_dump`, `snapshot`, `mock_requests`
`export[].service`	string	(for `db_dump`/`mock_requests`) which service
`export[].to`	string	Destination path; `{{ run_id }}` and `{{ scenario_id }}` interpolated

Destination paths are relative to the artifact store the server is configured for (S3, local FS).

Templating reference

These template variables are available in every string field of the spec:

Variable	Resolves to
`{{ secrets.NAME }}`	The resolved value of secret `NAME`
`{{ matrix.KEY }}`	The matrix entry's value for `KEY`
`{{ task.prompt }}`	The task prompt
`{{ task.context.KEY }}`	A `task.context` value
`{{ sandbox.path }}`	The workspace path inside the sandbox
`{{ sandbox.url }}`	The Keystone API URL for this sandbox
`{{ sandbox.trace_path }}`	Path to the trace file inside the sandbox
`{{ run_id }}`	The current scenario run ID (scoped to one replica)
`{{ scenario_id }}`	The scenario ID (`scenario-000`, etc.)
`{{ determinism.seed }}`	The seed for this scenario
`{{ determinism.clock }}`	The frozen clock value

Templates work in: setup.commands, setup.files[].content, setup.env, services[].env, agent.args, agent.endpoint, agent.input_template, fixtures[].seed, teardown.export[].to.

A complete real-world spec

Here's a full spec from the open-source examples — agent reconciles two databases, sends a summary email, must avoid forbidden DB writes:

version: 1
id: reconciliation-scenario
description: "Agent reconciles two customer databases and emails a summary"
 
task:
  prompt: |
    Reconcile customers_a and customers_b tables, fix all mismatches,
    and email a summary to finance@co
  context:
    ticket_url: "https://linear.app/co/issue/ENG-1234"
    success_criteria: "All rows match after reconciliation, exactly one summary email sent"
 
base: "ubuntu:24.04"
 
setup:
  packages: [nodejs, npm, git, python3]
  commands:
    - "npm install -g typescript"
  env:
    NODE_ENV: "test"
    DATABASE_URL: "postgres://postgres:{{ secrets.DB_PASSWORD }}@db:5432/testdb"
 
resources:
  timeout: "10m"
  memory: "2Gi"
  cpu: 2
  concurrency_limit: 5
 
fixtures:
  - type: sql
    service: db
    sql: |
      CREATE TABLE customers_a (id INT PRIMARY KEY, email TEXT, name TEXT);
      CREATE TABLE customers_b (id INT PRIMARY KEY, email TEXT, name TEXT);
      INSERT INTO customers_a VALUES
        (1, 'a@co', 'Alice'), (2, 'b@co', 'Ben'), (3, 'c@co', 'Carol');
      INSERT INTO customers_b SELECT * FROM customers_a;
  - type: drift
    target: db.customers_a
    strategy: random_mismatches
    count: 15
    seed: "{{ determinism.seed }}"
 
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: "pg_isready"
  - name: smtp
    image: mailhog/mailhog
    ports: [1025, 8025]
  - name: mock_api
    type: http_mock
    record: true
    default_response: 404
    routes:
      - method: GET
        path: /v1/customers
        response: '{"customers": []}'
        status: 200
 
secrets:
  - name: DB_PASSWORD
    from: generated
    scope: env
 
network:
  egress:
    default: deny
    allow: ["*.services.internal"]
  dns_overrides:
    api.stripe.com: mock_api.services.internal
 
audit:
  db_writes: true
  http_calls: true
  file_system:
    watch: [/workspace]
    track: [writes, deletes]
  process_spawns: true
  stdout_capture: true
 
snapshots:
  before_run: true
  checkpoints: per_action
  retain_on: [failure]
 
agent:
  type: paragon
  model: paragon-max
  args: ["--yolo", "--quiet"]
  timeout: 5m
 
invariants:
  databases_match:
    description: "No drift between customers_a and customers_b"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db
      query: |
        SELECT count(*) FROM customers_a a
        LEFT JOIN customers_b b ON a.id = b.id
        WHERE b.id IS NULL OR a.email != b.email
      equals: 0
 
  email_sent:
    description: "Exactly one summary email to finance"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  code_quality:
    description: "Solution is clean and idiomatic"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Is the reconciliation code clean and well-structured?"
      input_from: workspace
      rubric:
        pass: "Clean, handles edge cases"
        fail: "Messy, fragile, or incomplete"
      temperature: 0
 
forbidden:
  db_writes_outside: [customers_a, customers_b, reconciliation_log]
  http_except: [smtp, mock_api]
  secrets_in_logs: deny
  file_writes_outside: [/workspace/src, /workspace/logs]
 
scoring:
  pass_threshold: 0.95
  replica_aggregation:
    strategy: all_must_pass
    min_pass_rate: 0.90
 
parallelism:
  replicas: 3
  isolation: per_run
  matrix:
    - { input_size: small, locale: en_US }
    - { input_size: large, locale: en_US }
 
determinism:
  clock: "2026-01-01T00:00:00Z"
  seed: 42
  network_latency: 0ms
  dns: static
 
retention:
  audit_logs: 24h
  snapshots: 7d
  teardown_exports: 30d
  traces: 30d
 
teardown:
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/{{ run_id }}/db.sql"
    - type: mock_requests
      service: mock_api
      to: "results/{{ run_id }}/mock.jsonl"
  always_run: true

This single spec covers every block. Use it as a copy-paste template and remove what you don't need.

Validation

Specs are validated server-side at upload time. Common errors:

Error	Cause
`version: must be 1`	Missing or wrong `version:`
`id: must be kebab-case`	Spaces, underscores, or capitals in `id:`
`task.prompt: required`	Missing prompt
`invariants: must have at least one`	Empty invariants block
`scoring.pass_threshold: out of range`	Not in `[0, 1]`
`services[*].name: duplicate`	Two services with the same name
`fixtures[*].service: not found`	Service name doesn't match
`agent.type: unknown`	Not one of the six supported types
`secrets[*].name: not in scope`	Referenced via `{{ secrets.X }}` but not declared

Run ks specs validate path/to/spec.yaml (or the CLI's eval run does it implicitly) to validate locally before uploading.

Concepts Sandboxes

Spec Reference

The minimum viable spec

Top-level fields

task — what the agent should do

Best practices

base — the base Docker image

setup — environment prep

setup.files — templating

agent — how the agent gets invoked

type: paragon — Polarity's built-in agent

type: cli — any binary

type: http — call an HTTP endpoint

type: python — Python script

type: image — Docker image

type: snapshot — uploaded agent

Common fields across all agent types

resources — compute & timeout limits

fixtures — seed data

services — backing containers

secrets — credentials