Spec

Spec Reference

Every section of a Keystone spec — fields, defaults, types, and worked examples for each block.

A Keystone spec is a YAML file that describes everything about one scenario: the environment to spin up, the task the agent should do, the credentials it needs, the policies it operates under, and how to score the result. This page is the complete field-by-field reference. For an introduction to the concepts, start with Concepts; for end-to-end examples, see Examples.

The minimum viable spec

Five required fields — version, id, base, task, invariants — are enough to run a hello-world:

version: 1
id: hello-world
description: "Smoke test — does the agent write a file?"
 
base: "ubuntu:24.04"
 
task:
  prompt: "Create a file called hello.txt with the text 'Hello, World!' inside."
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  hello_file_exists:
    description: "hello.txt was created"
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: hello.txt
 
  hello_file_correct:
    description: "hello.txt contains the right text"
    weight: 1.0
    check:
      type: file_content
      path: hello.txt
      contains: "Hello, World!"
 
scoring:
  pass_threshold: 1.0

Run with ks eval run hello.yaml. Every other section in the spec is optional — add what you need.

Top-level fields

FieldTypeRequiredDefaultNotes
versionintyesAlways 1 for now
idstringyesKebab-case ID. Re-uploading the same id increments version
descriptionstringno""Human-readable description
extendsstringno""Path to a base spec to inherit from (v1.1 future)
taskobjectyesWhat the agent should do
basestringyesBase Docker image
agentobjectyesHow the agent gets invoked
invariantsmapyesPass/fail checks
scoringobjectyesComposite scoring + replica aggregation
setupobjectnoPackages, commands, files, env vars
resourcesobjectno(see below)CPU/memory/disk/timeout
fixtureslistno[]Seed data (git repos, SQL, files)
serviceslistno[]Backing containers (DB, cache, mocks)
secretslistno[]Credentials
networkobjectno(see below)Egress/ingress rules, DNS overrides
auditobjectno(see below)Structured capture of agent activity
snapshotsobjectno(see below)World-state checkpointing
forbiddenobjectnoTrajectory constraints
parallelismobjectno(see below)Replicas + matrix
determinismobjectnoClock/seed/DNS pinning
retentionobjectno(see below)How long to keep artifacts
teardownobjectnoExport actions after run

The next sections walk through every block.


task — what the agent should do

task:
  prompt: |
    Reconcile customers_a and customers_b tables, fix all mismatches,
    and email a summary to finance@co.
  context:
    ticket_url: "https://linear.app/co/issue/ENG-1234"
    success_criteria: "All rows match, exactly one summary email sent"
    repo: "acme/backend"
    language: "typescript"
FieldTypeRequiredNotes
promptstringyesThe instruction the agent receives, verbatim
contextmap of stringsnoExtra structured info passed alongside the prompt

context is free-form — your agent decides how to use it. Common keys:

  • success_criteria — what counts as done (also used in some LLM judges)
  • ticket_url — link to the originating ticket
  • repo / language / framework — type hints
  • style_guide — link or inline guidance

The prompt + context are passed to the agent via stdin (for cli, paragon, python agents) or in the request body (for http agents).

Best practices

  • Be specific. "Fix the bug" is too vague. "Fix the failing test in src/api.test.ts so npm test returns exit 0 without modifying src/api.ts" is testable.
  • Restate the invariants in plain English. If the invariant requires a file at output.json, say so in the prompt. The model can read the spec.
  • Bound the work. "Refactor the codebase" is unbounded; "Refactor the users.py module to use async SQLAlchemy" is bounded.

base — the base Docker image

base: "ubuntu:24.04"

Any Docker image works. Common choices:

ImageWhen to use
ubuntu:24.04General Linux, install whatever you need in setup
node:20 / node:22Node/TS agents and tasks
python:3.12 / python:3.11Python agents and tasks
golang:1.22Go agents and tasks
mcr.microsoft.com/playwrightBrowser automation
your-registry.io/your-image:tagPre-baked custom image

Custom images need to be reachable by the Keystone server (Hub, GHCR, ECR, GAR, or a registry the server has credentials for).

base: is a string — there is no nested object form. To customize the environment beyond the bare image, use the setup: block (packages, commands, files, env).


setup — environment prep

setup:
  packages: [nodejs, npm, python3, git, postgresql-client]
  commands:
    - "npm install -g typescript"
    - "pip install pytest"
  files:
    - path: ".env"
      content: |
        DATABASE_URL=postgres://postgres:test@db:5432/testdb
        NODE_ENV=test
    - path: "tsconfig.json"
      content: '{"compilerOptions": {"strict": true}}'
  env:
    CI: "true"
    NODE_ENV: "test"
FieldTypeNotes
packageslist of stringsapt/apk packages installed before any commands run
commandslist of stringsShell commands run sequentially in the workspace
fileslist of {path, content, template?}Files written into the workspace
envmapEnv vars set for the agent (and for setup.commands)

Order: packages first, then files written, then commands run. If any command fails, the sandbox fails to boot.

Setup vs. fixtures: setup is for static prep (install Node, write tsconfig). Fixtures are for seed data (clone a repo, load SQL). The two run at different boot stages — see Fixtures.

setup.files — templating

setup:
  files:
    - path: "config.json"
      template: '{"key": "{{ secrets.API_KEY }}"}'

template interpolates {{ secrets.NAME }} placeholders before writing. Use template: instead of content: when you need secret values inside the file. Otherwise prefer content:.


agent — how the agent gets invoked

Six agent types ship with Keystone. Pick whichever fits how your agent is packaged.

type: paragon — Polarity's built-in agent

agent:
  type: paragon
  model: paragon-fast      # or paragon-max, claude-sonnet-4-5, gpt-4o, etc.
  args: ["--yolo", "--quiet"]
  timeout: 5m
  env:
    PARAGON_TOOL_TRACE_FILE: "{{ sandbox.trace_path }}"

The Paragon CLI runs inside the sandbox with the task prompt on stdin. Default for "I just want a coding agent."

type: cli — any binary

agent:
  type: cli
  binary: /workspace/my-agent
  args: ["--task", "{{ task.prompt }}", "--workspace", "{{ sandbox.path }}"]
  timeout: 5m
  env:
    LOG_LEVEL: "info"

Runs an executable. Template variables: {{ task.prompt }}, {{ task.context.<key> }}, {{ sandbox.path }}.

type: http — call an HTTP endpoint

agent:
  type: http
  endpoint: "https://my-agent.internal/run"
  auth:
    bearer: "{{ secrets.AGENT_TOKEN }}"
  input_template: |
    {
      "task": "{{ task.prompt }}",
      "workspace": "{{ sandbox.url }}",
      "context": {{ task.context | tojson }}
    }
  timeout: 5m

POSTs the rendered input_template to endpoint. Use this when your agent runs as a service outside the sandbox.

type: python — Python script

agent:
  type: python
  binary: agent.py
  args: ["--mode", "eval"]
  timeout: 5m

Runs python3 agent.py [args] with the task as JSON on stdin.

type: image — Docker image

agent:
  type: image
  image: "your-registry/your-agent:v3"
  entrypoint: ["python", "main.py"]   # optional override
  timeout: 5m
  env:
    LOG_LEVEL: "debug"

Pulls the image and runs it on the sandbox's service network. Best for reproducibility.

type: snapshot — uploaded agent

agent:
  type: snapshot
  snapshot: my-agent              # latest version
  # or pin a specific version:
  # snapshot_id: snap_abc123
  entrypoint: ["python", "main.py"]   # optional override
  timeout: 5m

Resolves to an agent snapshot you uploaded via ks.agents.upload(). Best for tracking which agent version produced which results.

Common fields across all agent types

FieldTypeDefaultNotes
typestringrequiredparagon, cli, http, python, image, snapshot
timeoutduration5mMax wall time before agent is killed
envmapEnv vars for the agent process

resources — compute & timeout limits

resources:
  timeout: "10m"
  memory: "4Gi"
  cpu: 4
  disk: "20Gi"
  desktop: false               # true = XFCE desktop with browser available
  concurrency_limit: 5         # max sandboxes for this spec at once
FieldTypeDefaultNotes
timeoutduration10mTotal sandbox lifecycle (setup + agent + scoring)
memorystring2GiMemory limit (Ki/Mi/Gi)
cpuint2CPU cores
diskstring10GiDisk quota
desktopboolfalseSpin up XFCE for browser-using agents
concurrency_limitinttenant defaultMax parallel sandboxes for this spec

The timeout here covers the entire sandbox lifecycle, not just the agent. If your agent has a 5-minute timeout and your invariants need another 2 minutes, set resources.timeout: 8m (or more, for safety).


fixtures — seed data

Fixtures load data into the sandbox after services boot but before the agent runs. Four types: git_repo, sql, directory, drift. See Fixtures for the deep dive.

fixtures:
  # Clone a git repository into /workspace
  - type: git_repo
    url: "https://github.com/your-org/your-repo"
    branch: "main"
    depth: 1
    path: /workspace
 
  # Run inline SQL against a service (preferred over path:)
  - type: sql
    service: db
    sql: |
      CREATE TABLE customers (
        email TEXT PRIMARY KEY,
        name  TEXT NOT NULL
      );
      INSERT INTO customers VALUES
        ('alice@example.com', 'Alice'),
        ('ben@example.com',   'Ben');
 
  # Copy a directory from the Keystone server's filesystem
  - type: directory
    source: /var/keystone/seeds/test-data
    target: data/
 
  # Inject random data corruption for adversarial testing
  - type: drift
    target: db.customers
    strategy: random_mismatches      # or random_nulls, duplicate_rows
    count: 15
    seed: "{{ determinism.seed }}"

Fixtures run in declaration order. A failure aborts sandbox boot.


services — backing containers

services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: "pg_isready -U postgres"
 
  - name: cache
    image: redis:7
    ports: [6379]
    wait_for: "redis-cli ping"
 
  - name: smtp
    image: mailhog/mailhog
    ports: [1025, 8025]    # SMTP + HTTP UI
 
  - name: stripe-mock
    type: http_mock
    record: true
    default_response: 404
    routes:
      - method: POST
        path: /v1/charge
        response: '{"id":"ch_test","status":"succeeded"}'
        status: 200
      - method: ANY
        path: "/v1/webhooks/.*"   # regex match
        response: '{"ok": true}'

The full reference is in Services. Service field summary:

FieldTypeRequiredNotes
namestringyesDNS alias on the shared network
imagestringyes (unless type: http_mock)Any Docker image
typestringnoEmpty (default) or http_mock for built-in mocks
envmapnoContainer env. Supports {{ secrets.X }}
portslist of intsnoContainer ports exposed on the network
wait_forstringnoShell command run inside; gates readiness
recordboolno(http_mock only) capture all requests
default_responseintno(http_mock) status for unmatched routes
routeslistno(http_mock) per-method/path responders

Services are reachable from the agent by name: postgres://db:5432, redis://cache:6379, http://stripe-mock:9090. Auto-injected env vars: KEYSTONE_SERVICE_<NAME>_HOST, KEYSTONE_SERVICE_<NAME>_PORT.


secrets — credentials

secrets:
  - name: ANTHROPIC_API_KEY
    source: env                  # default; pulls $ANTHROPIC_API_KEY from caller's env
 
  - name: DB_PASSWORD
    source: env:LOCAL_DB_PASS    # rename — pull $LOCAL_DB_PASS, expose as DB_PASSWORD
 
  - name: GCP_KEY
    source: "file:~/.config/gcloud/sa.json"
 
  - name: VAULT_TOKEN
    source: 'command:vault token lookup -field=id'
 
  - name: STRIPE_LIVE_KEY
    source: dashboard            # server-side only; SDK refuses local override
 
  - name: TEST_FIXTURE_TOKEN
    from: "static://fake-test-value"   # spec literal; deterministic
 
  - name: GENERATED_DB_PASSWORD
    from: generated              # random per-run
FieldTypeNotes
namestringEnv var name inside the sandbox
sourcestringenv, env:OTHER, file:..., command:..., dashboard
fromstringstatic://<value> or generated (mutually exclusive with source)
scopestring or {env, file_template}Where the secret is exposed (default: env)

Precedence: from: > source: (resolved locally) > Dashboard. A secret that resolves to nothing fails the sandbox boot. See Secrets for the full deep dive.

Scope examples

secrets:
  # Default: env var only
  - name: API_KEY
    source: env
 
  # Env var AND templated into a config file
  - name: STRIPE_KEY
    from: "static://sk_test_xxx"
    scope:
      env: true
      file_template: "config/stripe.json"

file_template: is templated at sandbox boot — {{ STRIPE_KEY }} placeholders in the named file are replaced with the resolved value.


network — egress, ingress, DNS overrides

network:
  egress:
    default: deny           # block all outbound by default
    allow:
      - registry.npmjs.org
      - github.com
      - "*.services.internal"  # internal service network
  ingress:
    default: deny
    allow:
      - from: host
        to_port: 3000
  dns_overrides:
    api.stripe.com: stripe-mock.services.internal
    smtp.sendgrid.net: smtp.services.internal
SectionWhat it controls
egress.defaultdeny (block all outbound) or allow (let everything through)
egress.allowHostnames or globs that bypass the default
ingress.defaultInbound from outside the sandbox
ingress.allowSpecific inbound rules (e.g., from host to port)
dns_overridesMap real hostnames to in-sandbox services

Default-deny + DNS overrides is the canonical pattern: the agent's existing code calls https://api.stripe.com/..., but inside the sandbox that resolves to your stripe-mock container. See Network & Audit.


audit — capture activity

audit:
  db_writes: true              # log every INSERT/UPDATE/DELETE
  http_calls: true             # log every outbound HTTP
  process_spawns: true         # log every child process
  stdout_capture: true         # capture stdout for secret detection
  file_system:
    watch: ["src/", "config/"]
    track: [writes, reads, deletes]
FieldTypeDefaultNotes
db_writesboolfalseRequired for forbidden.db_writes_outside
http_callsboolfalseRequired for forbidden.http_except
process_spawnsboolfalseLogs every child process exec
stdout_captureboolfalseRequired for forbidden.secrets_in_logs
file_system.watchlist of strings[]Directories to monitor
file_system.tracklist of strings[]writes, reads, deletes

Audit events stream to <workspace>/.keystone/audit.jsonl and to the /v1/sandboxes/:id/events SSE endpoint. See Network & Audit.


snapshots — world-state checkpointing

snapshots:
  before_run: true              # always capture before agent starts
  checkpoints: per_action       # snapshot after each agent tool call
  retain_on: [failure]          # only persist snapshots for failed runs
FieldTypeDefaultNotes
before_runbooltrueSnapshot taken once when sandbox enters ready
checkpointsstringnonenone or per_action (after every tool call)
retain_onlist[failure]When to persist — [failure], [always], or []

Snapshots enable sandboxes.diff() (what did the agent change?) and the per-action replay in the dashboard's trace viewer. Stored under <workspace>/.keystone/snapshots/ until destroyed.


invariants — pass/fail checks

The most important block. Eight check types:

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true                 # hard fail if this fails
    check:
      type: command_exit
      command: "npm test"
      exit_code: 0
 
  output_correct:
    description: "Output file contains expected data"
    weight: 0.5
    check:
      type: file_content
      path: output.json
      contains: '"status": "success"'
      not_contains: "ERROR"
 
  no_orphan_orders:
    description: "Every order has a matching customer"
    weight: 0.3
    gate: true
    check:
      type: sql
      service: db
      query: "SELECT count(*) FROM orders WHERE customer_id NOT IN (SELECT id FROM customers)"
      equals: 0
 
  email_sent:
    description: "Exactly one email to finance"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  custom_check:
    description: "Custom Python validation"
    weight: 0.2
    check:
      type: custom
      script: checks/validate.py
      runs_in: host           # host (default) or sandbox
 
  code_quality:
    description: "Code is well-structured"
    weight: 0.4
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Is the diff minimal and idiomatic?"
      input_from: workspace
      rubric:
        pass: "Clean, minimal change"
        fail: "Over-engineered or invasive"
      pass_threshold: 0.7

Each invariant has:

FieldTypeRequiredNotes
descriptionstringyesHuman-readable
weightfloatno (default 1.0)Relative importance
gateboolno (default false)If true, scenario fails when this fails
checkobjectyesThe actual check (type-specific)

The full reference is in Invariants.

check.type — full table

TypeAsks
command_exitDoes the command return the expected exit code?
file_existsDoes this file exist?
file_absentIs this file missing?
file_contentDoes this file contain (or not contain) a string or regex?
sqlDoes this SQL query return the expected value?
http_mock_assertionsWas this mock service called the right way?
customDid your Python script return passed: true?
llm_as_judgeDid the model rate the output above the threshold?

forbidden — trajectory constraints

Things the agent must NOT do. Backed by the audit log; any violation auto-fails the scenario.

forbidden:
  db_writes_outside: [users, orders, audit_log]
  http_except: [stripe-mock, smtp]
  secrets_in_logs: deny
  file_writes_outside: [src/, output/, .keystone/]

See Forbidden Rules.


scoring — composite + replica aggregation

scoring:
  pass_threshold: 0.9          # composite must be >= 0.9 to pass
  replica_aggregation:
    strategy: majority          # all_must_pass | majority | percentage
    min_pass_rate: 0.85
FieldTypeDefaultNotes
pass_thresholdfloat in [0, 1]requiredComposite minimum
replica_aggregation.strategystringall_must_passHow replica statuses combine
replica_aggregation.min_pass_ratefloat0.5For percentage strategy

Composite computation

For each scenario:

  1. Each invariant produces a score in [0, 1].
  2. Composite = sum(weight × score) / sum(weight).
  3. Any gate failure → composite forced to 0.
  4. Any forbidden violation → composite forced to 0.
  5. composite >= pass_threshold → status pass. Else fail.

Replica aggregation strategies

StrategyVerdict
all_must_passEvery replica must pass. Otherwise fail.
majorityMore than 50% pass → pass; else fail (ties → flaky)
percentageAt least min_pass_rate fraction pass → pass. Otherwise check spread → flaky if mixed, else fail

flaky is its own status — distinct from fail — for "passed sometimes, failed sometimes" cases that you typically want to investigate, not gate on.


parallelism — replicas and matrix

parallelism:
  replicas: 10                 # runs PER matrix entry
  isolation: per_run           # per_run = own sandbox | shared = reuse
  matrix:
    - { input_size: small, locale: en_US }
    - { input_size: large, locale: en_US }
    - { input_size: small, locale: ja_JP }
FieldTypeDefaultNotes
replicasint1Runs per matrix entry
isolationstringper_runper_run (own sandbox) or shared (reuse)
matrixlist of maps[]Parameter combinations

replicas: 10 × 3 matrix entries = 30 total scenarios.

Matrix interpolation

Matrix values are available as {{ matrix.<key> }} in any string field:

parallelism:
  matrix:
    - { model: claude-sonnet-4-5 }
    - { model: gpt-4o }
 
agent:
  type: cli
  binary: my-agent
  args: ["--model", "{{ matrix.model }}"]

Isolation: per_run vs shared

  • per_run (default) — every scenario gets its own sandbox + services. Slow but accurate.
  • shared — sandboxes are reused across scenarios. Fast but state can leak.

Use per_run for any real measurement. Use shared only for cheap smoke tests where state pollution is acceptable.


determinism — pin sources of non-determinism

determinism:
  clock: "2026-01-01T00:00:00Z"   # frozen time (libfaketime)
  seed: 42                         # PRNG seed
  network_latency: 0ms              # injected latency
  dns: static                       # static or live

See Determinism.


retention — artifact lifetime

retention:
  audit_logs: 24h
  snapshots: 7d
  teardown_exports: 30d
  traces: 30d
FieldDefault
audit_logs24h
snapshots7d
teardown_exports30d (Free), 6mo (Pro), 2y (Enterprise)
traces30d (Free), 6mo (Pro), 2y (Enterprise)

teardown — export artifacts after run

teardown:
  always_run: true
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/{{ run_id }}/db.sql"
    - type: snapshot
      to: "results/{{ run_id }}/final_state.snap"
    - type: mock_requests
      service: stripe-mock
      to: "results/{{ run_id }}/stripe-calls.jsonl"
FieldTypeNotes
always_runboolRun teardown even on agent failure
export[].typestringaudit_log, db_dump, snapshot, mock_requests
export[].servicestring(for db_dump/mock_requests) which service
export[].tostringDestination path; {{ run_id }} and {{ scenario_id }} interpolated

Destination paths are relative to the artifact store the server is configured for (S3, local FS).


Templating reference

These template variables are available in every string field of the spec:

VariableResolves to
{{ secrets.NAME }}The resolved value of secret NAME
{{ matrix.KEY }}The matrix entry's value for KEY
{{ task.prompt }}The task prompt
{{ task.context.KEY }}A task.context value
{{ sandbox.path }}The workspace path inside the sandbox
{{ sandbox.url }}The Keystone API URL for this sandbox
{{ sandbox.trace_path }}Path to the trace file inside the sandbox
{{ run_id }}The current scenario run ID (scoped to one replica)
{{ scenario_id }}The scenario ID (scenario-000, etc.)
{{ determinism.seed }}The seed for this scenario
{{ determinism.clock }}The frozen clock value

Templates work in: setup.commands, setup.files[].content, setup.env, services[].env, agent.args, agent.endpoint, agent.input_template, fixtures[].seed, teardown.export[].to.


A complete real-world spec

Here's a full spec from the open-source examples — agent reconciles two databases, sends a summary email, must avoid forbidden DB writes:

version: 1
id: reconciliation-scenario
description: "Agent reconciles two customer databases and emails a summary"
 
task:
  prompt: |
    Reconcile customers_a and customers_b tables, fix all mismatches,
    and email a summary to finance@co
  context:
    ticket_url: "https://linear.app/co/issue/ENG-1234"
    success_criteria: "All rows match after reconciliation, exactly one summary email sent"
 
base: "ubuntu:24.04"
 
setup:
  packages: [nodejs, npm, git, python3]
  commands:
    - "npm install -g typescript"
  env:
    NODE_ENV: "test"
    DATABASE_URL: "postgres://postgres:{{ secrets.DB_PASSWORD }}@db:5432/testdb"
 
resources:
  timeout: "10m"
  memory: "2Gi"
  cpu: 2
  concurrency_limit: 5
 
fixtures:
  - type: sql
    service: db
    sql: |
      CREATE TABLE customers_a (id INT PRIMARY KEY, email TEXT, name TEXT);
      CREATE TABLE customers_b (id INT PRIMARY KEY, email TEXT, name TEXT);
      INSERT INTO customers_a VALUES
        (1, 'a@co', 'Alice'), (2, 'b@co', 'Ben'), (3, 'c@co', 'Carol');
      INSERT INTO customers_b SELECT * FROM customers_a;
  - type: drift
    target: db.customers_a
    strategy: random_mismatches
    count: 15
    seed: "{{ determinism.seed }}"
 
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: "pg_isready"
  - name: smtp
    image: mailhog/mailhog
    ports: [1025, 8025]
  - name: mock_api
    type: http_mock
    record: true
    default_response: 404
    routes:
      - method: GET
        path: /v1/customers
        response: '{"customers": []}'
        status: 200
 
secrets:
  - name: DB_PASSWORD
    from: generated
    scope: env
 
network:
  egress:
    default: deny
    allow: ["*.services.internal"]
  dns_overrides:
    api.stripe.com: mock_api.services.internal
 
audit:
  db_writes: true
  http_calls: true
  file_system:
    watch: [/workspace]
    track: [writes, deletes]
  process_spawns: true
  stdout_capture: true
 
snapshots:
  before_run: true
  checkpoints: per_action
  retain_on: [failure]
 
agent:
  type: paragon
  model: paragon-max
  args: ["--yolo", "--quiet"]
  timeout: 5m
 
invariants:
  databases_match:
    description: "No drift between customers_a and customers_b"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db
      query: |
        SELECT count(*) FROM customers_a a
        LEFT JOIN customers_b b ON a.id = b.id
        WHERE b.id IS NULL OR a.email != b.email
      equals: 0
 
  email_sent:
    description: "Exactly one summary email to finance"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  code_quality:
    description: "Solution is clean and idiomatic"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Is the reconciliation code clean and well-structured?"
      input_from: workspace
      rubric:
        pass: "Clean, handles edge cases"
        fail: "Messy, fragile, or incomplete"
      temperature: 0
 
forbidden:
  db_writes_outside: [customers_a, customers_b, reconciliation_log]
  http_except: [smtp, mock_api]
  secrets_in_logs: deny
  file_writes_outside: [/workspace/src, /workspace/logs]
 
scoring:
  pass_threshold: 0.95
  replica_aggregation:
    strategy: all_must_pass
    min_pass_rate: 0.90
 
parallelism:
  replicas: 3
  isolation: per_run
  matrix:
    - { input_size: small, locale: en_US }
    - { input_size: large, locale: en_US }
 
determinism:
  clock: "2026-01-01T00:00:00Z"
  seed: 42
  network_latency: 0ms
  dns: static
 
retention:
  audit_logs: 24h
  snapshots: 7d
  teardown_exports: 30d
  traces: 30d
 
teardown:
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/{{ run_id }}/db.sql"
    - type: mock_requests
      service: mock_api
      to: "results/{{ run_id }}/mock.jsonl"
  always_run: true

This single spec covers every block. Use it as a copy-paste template and remove what you don't need.


Validation

Specs are validated server-side at upload time. Common errors:

ErrorCause
version: must be 1Missing or wrong version:
id: must be kebab-caseSpaces, underscores, or capitals in id:
task.prompt: requiredMissing prompt
invariants: must have at least oneEmpty invariants block
scoring.pass_threshold: out of rangeNot in [0, 1]
services[*].name: duplicateTwo services with the same name
fixtures[*].service: not foundService name doesn't match
agent.type: unknownNot one of the six supported types
secrets[*].name: not in scopeReferenced via {{ secrets.X }} but not declared

Run ks specs validate path/to/spec.yaml (or the CLI's eval run does it implicitly) to validate locally before uploading.