Spec Reference
Every section of a Keystone spec — fields, defaults, types, and worked examples for each block.
A Keystone spec is a YAML file that describes everything about one scenario: the environment to spin up, the task the agent should do, the credentials it needs, the policies it operates under, and how to score the result. This page is the complete field-by-field reference. For an introduction to the concepts, start with Concepts; for end-to-end examples, see Examples.
The minimum viable spec
Five required fields — version, id, base, task, invariants — are enough to run a hello-world:
version: 1
id: hello-world
description: "Smoke test — does the agent write a file?"
base: "ubuntu:24.04"
task:
prompt: "Create a file called hello.txt with the text 'Hello, World!' inside."
agent:
type: paragon
timeout: 5m
invariants:
hello_file_exists:
description: "hello.txt was created"
weight: 1.0
gate: true
check:
type: file_exists
path: hello.txt
hello_file_correct:
description: "hello.txt contains the right text"
weight: 1.0
check:
type: file_content
path: hello.txt
contains: "Hello, World!"
scoring:
pass_threshold: 1.0Run with ks eval run hello.yaml. Every other section in the spec is optional — add what you need.
Top-level fields
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
version | int | yes | — | Always 1 for now |
id | string | yes | — | Kebab-case ID. Re-uploading the same id increments version |
description | string | no | "" | Human-readable description |
extends | string | no | "" | Path to a base spec to inherit from (v1.1 future) |
task | object | yes | — | What the agent should do |
base | string | yes | — | Base Docker image |
agent | object | yes | — | How the agent gets invoked |
invariants | map | yes | — | Pass/fail checks |
scoring | object | yes | — | Composite scoring + replica aggregation |
setup | object | no | — | Packages, commands, files, env vars |
resources | object | no | (see below) | CPU/memory/disk/timeout |
fixtures | list | no | [] | Seed data (git repos, SQL, files) |
services | list | no | [] | Backing containers (DB, cache, mocks) |
secrets | list | no | [] | Credentials |
network | object | no | (see below) | Egress/ingress rules, DNS overrides |
audit | object | no | (see below) | Structured capture of agent activity |
snapshots | object | no | (see below) | World-state checkpointing |
forbidden | object | no | — | Trajectory constraints |
parallelism | object | no | (see below) | Replicas + matrix |
determinism | object | no | — | Clock/seed/DNS pinning |
retention | object | no | (see below) | How long to keep artifacts |
teardown | object | no | — | Export actions after run |
The next sections walk through every block.
task — what the agent should do
task:
prompt: |
Reconcile customers_a and customers_b tables, fix all mismatches,
and email a summary to finance@co.
context:
ticket_url: "https://linear.app/co/issue/ENG-1234"
success_criteria: "All rows match, exactly one summary email sent"
repo: "acme/backend"
language: "typescript"| Field | Type | Required | Notes |
|---|---|---|---|
prompt | string | yes | The instruction the agent receives, verbatim |
context | map of strings | no | Extra structured info passed alongside the prompt |
context is free-form — your agent decides how to use it. Common keys:
success_criteria— what counts as done (also used in some LLM judges)ticket_url— link to the originating ticketrepo/language/framework— type hintsstyle_guide— link or inline guidance
The prompt + context are passed to the agent via stdin (for cli, paragon, python agents) or in the request body (for http agents).
Best practices
- Be specific. "Fix the bug" is too vague. "Fix the failing test in
src/api.test.tssonpm testreturns exit 0 without modifyingsrc/api.ts" is testable. - Restate the invariants in plain English. If the invariant requires a file at
output.json, say so in the prompt. The model can read the spec. - Bound the work. "Refactor the codebase" is unbounded; "Refactor the
users.pymodule to use async SQLAlchemy" is bounded.
base — the base Docker image
base: "ubuntu:24.04"Any Docker image works. Common choices:
| Image | When to use |
|---|---|
ubuntu:24.04 | General Linux, install whatever you need in setup |
node:20 / node:22 | Node/TS agents and tasks |
python:3.12 / python:3.11 | Python agents and tasks |
golang:1.22 | Go agents and tasks |
mcr.microsoft.com/playwright | Browser automation |
your-registry.io/your-image:tag | Pre-baked custom image |
Custom images need to be reachable by the Keystone server (Hub, GHCR, ECR, GAR, or a registry the server has credentials for).
base: is a string — there is no nested object form. To customize the environment beyond the bare image, use the setup: block (packages, commands, files, env).
setup — environment prep
setup:
packages: [nodejs, npm, python3, git, postgresql-client]
commands:
- "npm install -g typescript"
- "pip install pytest"
files:
- path: ".env"
content: |
DATABASE_URL=postgres://postgres:test@db:5432/testdb
NODE_ENV=test
- path: "tsconfig.json"
content: '{"compilerOptions": {"strict": true}}'
env:
CI: "true"
NODE_ENV: "test"| Field | Type | Notes |
|---|---|---|
packages | list of strings | apt/apk packages installed before any commands run |
commands | list of strings | Shell commands run sequentially in the workspace |
files | list of {path, content, template?} | Files written into the workspace |
env | map | Env vars set for the agent (and for setup.commands) |
Order: packages first, then files written, then commands run. If any command fails, the sandbox fails to boot.
Setup vs. fixtures: setup is for static prep (install Node, write tsconfig). Fixtures are for seed data (clone a repo, load SQL). The two run at different boot stages — see Fixtures.
setup.files — templating
setup:
files:
- path: "config.json"
template: '{"key": "{{ secrets.API_KEY }}"}'template interpolates {{ secrets.NAME }} placeholders before writing. Use template: instead of content: when you need secret values inside the file. Otherwise prefer content:.
agent — how the agent gets invoked
Six agent types ship with Keystone. Pick whichever fits how your agent is packaged.
type: paragon — Polarity's built-in agent
agent:
type: paragon
model: paragon-fast # or paragon-max, claude-sonnet-4-5, gpt-4o, etc.
args: ["--yolo", "--quiet"]
timeout: 5m
env:
PARAGON_TOOL_TRACE_FILE: "{{ sandbox.trace_path }}"The Paragon CLI runs inside the sandbox with the task prompt on stdin. Default for "I just want a coding agent."
type: cli — any binary
agent:
type: cli
binary: /workspace/my-agent
args: ["--task", "{{ task.prompt }}", "--workspace", "{{ sandbox.path }}"]
timeout: 5m
env:
LOG_LEVEL: "info"Runs an executable. Template variables: {{ task.prompt }}, {{ task.context.<key> }}, {{ sandbox.path }}.
type: http — call an HTTP endpoint
agent:
type: http
endpoint: "https://my-agent.internal/run"
auth:
bearer: "{{ secrets.AGENT_TOKEN }}"
input_template: |
{
"task": "{{ task.prompt }}",
"workspace": "{{ sandbox.url }}",
"context": {{ task.context | tojson }}
}
timeout: 5mPOSTs the rendered input_template to endpoint. Use this when your agent runs as a service outside the sandbox.
type: python — Python script
agent:
type: python
binary: agent.py
args: ["--mode", "eval"]
timeout: 5mRuns python3 agent.py [args] with the task as JSON on stdin.
type: image — Docker image
agent:
type: image
image: "your-registry/your-agent:v3"
entrypoint: ["python", "main.py"] # optional override
timeout: 5m
env:
LOG_LEVEL: "debug"Pulls the image and runs it on the sandbox's service network. Best for reproducibility.
type: snapshot — uploaded agent
agent:
type: snapshot
snapshot: my-agent # latest version
# or pin a specific version:
# snapshot_id: snap_abc123
entrypoint: ["python", "main.py"] # optional override
timeout: 5mResolves to an agent snapshot you uploaded via ks.agents.upload(). Best for tracking which agent version produced which results.
Common fields across all agent types
| Field | Type | Default | Notes |
|---|---|---|---|
type | string | required | paragon, cli, http, python, image, snapshot |
timeout | duration | 5m | Max wall time before agent is killed |
env | map | Env vars for the agent process |
resources — compute & timeout limits
resources:
timeout: "10m"
memory: "4Gi"
cpu: 4
disk: "20Gi"
desktop: false # true = XFCE desktop with browser available
concurrency_limit: 5 # max sandboxes for this spec at once| Field | Type | Default | Notes |
|---|---|---|---|
timeout | duration | 10m | Total sandbox lifecycle (setup + agent + scoring) |
memory | string | 2Gi | Memory limit (Ki/Mi/Gi) |
cpu | int | 2 | CPU cores |
disk | string | 10Gi | Disk quota |
desktop | bool | false | Spin up XFCE for browser-using agents |
concurrency_limit | int | tenant default | Max parallel sandboxes for this spec |
The timeout here covers the entire sandbox lifecycle, not just the agent. If your agent has a 5-minute timeout and your invariants need another 2 minutes, set resources.timeout: 8m (or more, for safety).
fixtures — seed data
Fixtures load data into the sandbox after services boot but before the agent runs. Four types: git_repo, sql, directory, drift. See Fixtures for the deep dive.
fixtures:
# Clone a git repository into /workspace
- type: git_repo
url: "https://github.com/your-org/your-repo"
branch: "main"
depth: 1
path: /workspace
# Run inline SQL against a service (preferred over path:)
- type: sql
service: db
sql: |
CREATE TABLE customers (
email TEXT PRIMARY KEY,
name TEXT NOT NULL
);
INSERT INTO customers VALUES
('alice@example.com', 'Alice'),
('ben@example.com', 'Ben');
# Copy a directory from the Keystone server's filesystem
- type: directory
source: /var/keystone/seeds/test-data
target: data/
# Inject random data corruption for adversarial testing
- type: drift
target: db.customers
strategy: random_mismatches # or random_nulls, duplicate_rows
count: 15
seed: "{{ determinism.seed }}"Fixtures run in declaration order. A failure aborts sandbox boot.
services — backing containers
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
POSTGRES_DB: testdb
ports: [5432]
wait_for: "pg_isready -U postgres"
- name: cache
image: redis:7
ports: [6379]
wait_for: "redis-cli ping"
- name: smtp
image: mailhog/mailhog
ports: [1025, 8025] # SMTP + HTTP UI
- name: stripe-mock
type: http_mock
record: true
default_response: 404
routes:
- method: POST
path: /v1/charge
response: '{"id":"ch_test","status":"succeeded"}'
status: 200
- method: ANY
path: "/v1/webhooks/.*" # regex match
response: '{"ok": true}'The full reference is in Services. Service field summary:
| Field | Type | Required | Notes |
|---|---|---|---|
name | string | yes | DNS alias on the shared network |
image | string | yes (unless type: http_mock) | Any Docker image |
type | string | no | Empty (default) or http_mock for built-in mocks |
env | map | no | Container env. Supports {{ secrets.X }} |
ports | list of ints | no | Container ports exposed on the network |
wait_for | string | no | Shell command run inside; gates readiness |
record | bool | no | (http_mock only) capture all requests |
default_response | int | no | (http_mock) status for unmatched routes |
routes | list | no | (http_mock) per-method/path responders |
Services are reachable from the agent by name: postgres://db:5432, redis://cache:6379, http://stripe-mock:9090. Auto-injected env vars: KEYSTONE_SERVICE_<NAME>_HOST, KEYSTONE_SERVICE_<NAME>_PORT.
secrets — credentials
secrets:
- name: ANTHROPIC_API_KEY
source: env # default; pulls $ANTHROPIC_API_KEY from caller's env
- name: DB_PASSWORD
source: env:LOCAL_DB_PASS # rename — pull $LOCAL_DB_PASS, expose as DB_PASSWORD
- name: GCP_KEY
source: "file:~/.config/gcloud/sa.json"
- name: VAULT_TOKEN
source: 'command:vault token lookup -field=id'
- name: STRIPE_LIVE_KEY
source: dashboard # server-side only; SDK refuses local override
- name: TEST_FIXTURE_TOKEN
from: "static://fake-test-value" # spec literal; deterministic
- name: GENERATED_DB_PASSWORD
from: generated # random per-run| Field | Type | Notes |
|---|---|---|
name | string | Env var name inside the sandbox |
source | string | env, env:OTHER, file:..., command:..., dashboard |
from | string | static://<value> or generated (mutually exclusive with source) |
scope | string or {env, file_template} | Where the secret is exposed (default: env) |
Precedence: from: > source: (resolved locally) > Dashboard. A secret that resolves to nothing fails the sandbox boot. See Secrets for the full deep dive.
Scope examples
secrets:
# Default: env var only
- name: API_KEY
source: env
# Env var AND templated into a config file
- name: STRIPE_KEY
from: "static://sk_test_xxx"
scope:
env: true
file_template: "config/stripe.json"file_template: is templated at sandbox boot — {{ STRIPE_KEY }} placeholders in the named file are replaced with the resolved value.
network — egress, ingress, DNS overrides
network:
egress:
default: deny # block all outbound by default
allow:
- registry.npmjs.org
- github.com
- "*.services.internal" # internal service network
ingress:
default: deny
allow:
- from: host
to_port: 3000
dns_overrides:
api.stripe.com: stripe-mock.services.internal
smtp.sendgrid.net: smtp.services.internal| Section | What it controls |
|---|---|
egress.default | deny (block all outbound) or allow (let everything through) |
egress.allow | Hostnames or globs that bypass the default |
ingress.default | Inbound from outside the sandbox |
ingress.allow | Specific inbound rules (e.g., from host to port) |
dns_overrides | Map real hostnames to in-sandbox services |
Default-deny + DNS overrides is the canonical pattern: the agent's existing code calls https://api.stripe.com/..., but inside the sandbox that resolves to your stripe-mock container. See Network & Audit.
audit — capture activity
audit:
db_writes: true # log every INSERT/UPDATE/DELETE
http_calls: true # log every outbound HTTP
process_spawns: true # log every child process
stdout_capture: true # capture stdout for secret detection
file_system:
watch: ["src/", "config/"]
track: [writes, reads, deletes]| Field | Type | Default | Notes |
|---|---|---|---|
db_writes | bool | false | Required for forbidden.db_writes_outside |
http_calls | bool | false | Required for forbidden.http_except |
process_spawns | bool | false | Logs every child process exec |
stdout_capture | bool | false | Required for forbidden.secrets_in_logs |
file_system.watch | list of strings | [] | Directories to monitor |
file_system.track | list of strings | [] | writes, reads, deletes |
Audit events stream to <workspace>/.keystone/audit.jsonl and to the /v1/sandboxes/:id/events SSE endpoint. See Network & Audit.
snapshots — world-state checkpointing
snapshots:
before_run: true # always capture before agent starts
checkpoints: per_action # snapshot after each agent tool call
retain_on: [failure] # only persist snapshots for failed runs| Field | Type | Default | Notes |
|---|---|---|---|
before_run | bool | true | Snapshot taken once when sandbox enters ready |
checkpoints | string | none | none or per_action (after every tool call) |
retain_on | list | [failure] | When to persist — [failure], [always], or [] |
Snapshots enable sandboxes.diff() (what did the agent change?) and the per-action replay in the dashboard's trace viewer. Stored under <workspace>/.keystone/snapshots/ until destroyed.
invariants — pass/fail checks
The most important block. Eight check types:
invariants:
tests_pass:
description: "All tests pass"
weight: 1.0
gate: true # hard fail if this fails
check:
type: command_exit
command: "npm test"
exit_code: 0
output_correct:
description: "Output file contains expected data"
weight: 0.5
check:
type: file_content
path: output.json
contains: '"status": "success"'
not_contains: "ERROR"
no_orphan_orders:
description: "Every order has a matching customer"
weight: 0.3
gate: true
check:
type: sql
service: db
query: "SELECT count(*) FROM orders WHERE customer_id NOT IN (SELECT id FROM customers)"
equals: 0
email_sent:
description: "Exactly one email to finance"
weight: 0.3
check:
type: http_mock_assertions
service: smtp
assertions:
- field: request_count
filters: { to: "finance@co" }
equals: 1
custom_check:
description: "Custom Python validation"
weight: 0.2
check:
type: custom
script: checks/validate.py
runs_in: host # host (default) or sandbox
code_quality:
description: "Code is well-structured"
weight: 0.4
check:
type: llm_as_judge
model: paragon-fast
criteria: "Is the diff minimal and idiomatic?"
input_from: workspace
rubric:
pass: "Clean, minimal change"
fail: "Over-engineered or invasive"
pass_threshold: 0.7Each invariant has:
| Field | Type | Required | Notes |
|---|---|---|---|
description | string | yes | Human-readable |
weight | float | no (default 1.0) | Relative importance |
gate | bool | no (default false) | If true, scenario fails when this fails |
check | object | yes | The actual check (type-specific) |
The full reference is in Invariants.
check.type — full table
| Type | Asks |
|---|---|
command_exit | Does the command return the expected exit code? |
file_exists | Does this file exist? |
file_absent | Is this file missing? |
file_content | Does this file contain (or not contain) a string or regex? |
sql | Does this SQL query return the expected value? |
http_mock_assertions | Was this mock service called the right way? |
custom | Did your Python script return passed: true? |
llm_as_judge | Did the model rate the output above the threshold? |
forbidden — trajectory constraints
Things the agent must NOT do. Backed by the audit log; any violation auto-fails the scenario.
forbidden:
db_writes_outside: [users, orders, audit_log]
http_except: [stripe-mock, smtp]
secrets_in_logs: deny
file_writes_outside: [src/, output/, .keystone/]See Forbidden Rules.
scoring — composite + replica aggregation
scoring:
pass_threshold: 0.9 # composite must be >= 0.9 to pass
replica_aggregation:
strategy: majority # all_must_pass | majority | percentage
min_pass_rate: 0.85| Field | Type | Default | Notes |
|---|---|---|---|
pass_threshold | float in [0, 1] | required | Composite minimum |
replica_aggregation.strategy | string | all_must_pass | How replica statuses combine |
replica_aggregation.min_pass_rate | float | 0.5 | For percentage strategy |
Composite computation
For each scenario:
- Each invariant produces a score in
[0, 1]. - Composite =
sum(weight × score) / sum(weight). - Any gate failure → composite forced to 0.
- Any forbidden violation → composite forced to 0.
composite >= pass_threshold→ statuspass. Elsefail.
Replica aggregation strategies
| Strategy | Verdict |
|---|---|
all_must_pass | Every replica must pass. Otherwise fail. |
majority | More than 50% pass → pass; else fail (ties → flaky) |
percentage | At least min_pass_rate fraction pass → pass. Otherwise check spread → flaky if mixed, else fail |
flaky is its own status — distinct from fail — for "passed sometimes, failed sometimes" cases that you typically want to investigate, not gate on.
parallelism — replicas and matrix
parallelism:
replicas: 10 # runs PER matrix entry
isolation: per_run # per_run = own sandbox | shared = reuse
matrix:
- { input_size: small, locale: en_US }
- { input_size: large, locale: en_US }
- { input_size: small, locale: ja_JP }| Field | Type | Default | Notes |
|---|---|---|---|
replicas | int | 1 | Runs per matrix entry |
isolation | string | per_run | per_run (own sandbox) or shared (reuse) |
matrix | list of maps | [] | Parameter combinations |
replicas: 10 × 3 matrix entries = 30 total scenarios.
Matrix interpolation
Matrix values are available as {{ matrix.<key> }} in any string field:
parallelism:
matrix:
- { model: claude-sonnet-4-5 }
- { model: gpt-4o }
agent:
type: cli
binary: my-agent
args: ["--model", "{{ matrix.model }}"]Isolation: per_run vs shared
per_run(default) — every scenario gets its own sandbox + services. Slow but accurate.shared— sandboxes are reused across scenarios. Fast but state can leak.
Use per_run for any real measurement. Use shared only for cheap smoke tests where state pollution is acceptable.
determinism — pin sources of non-determinism
determinism:
clock: "2026-01-01T00:00:00Z" # frozen time (libfaketime)
seed: 42 # PRNG seed
network_latency: 0ms # injected latency
dns: static # static or liveSee Determinism.
retention — artifact lifetime
retention:
audit_logs: 24h
snapshots: 7d
teardown_exports: 30d
traces: 30d| Field | Default |
|---|---|
audit_logs | 24h |
snapshots | 7d |
teardown_exports | 30d (Free), 6mo (Pro), 2y (Enterprise) |
traces | 30d (Free), 6mo (Pro), 2y (Enterprise) |
teardown — export artifacts after run
teardown:
always_run: true
export:
- type: audit_log
to: "results/{{ run_id }}/audit.jsonl"
- type: db_dump
service: db
to: "results/{{ run_id }}/db.sql"
- type: snapshot
to: "results/{{ run_id }}/final_state.snap"
- type: mock_requests
service: stripe-mock
to: "results/{{ run_id }}/stripe-calls.jsonl"| Field | Type | Notes |
|---|---|---|
always_run | bool | Run teardown even on agent failure |
export[].type | string | audit_log, db_dump, snapshot, mock_requests |
export[].service | string | (for db_dump/mock_requests) which service |
export[].to | string | Destination path; {{ run_id }} and {{ scenario_id }} interpolated |
Destination paths are relative to the artifact store the server is configured for (S3, local FS).
Templating reference
These template variables are available in every string field of the spec:
| Variable | Resolves to |
|---|---|
{{ secrets.NAME }} | The resolved value of secret NAME |
{{ matrix.KEY }} | The matrix entry's value for KEY |
{{ task.prompt }} | The task prompt |
{{ task.context.KEY }} | A task.context value |
{{ sandbox.path }} | The workspace path inside the sandbox |
{{ sandbox.url }} | The Keystone API URL for this sandbox |
{{ sandbox.trace_path }} | Path to the trace file inside the sandbox |
{{ run_id }} | The current scenario run ID (scoped to one replica) |
{{ scenario_id }} | The scenario ID (scenario-000, etc.) |
{{ determinism.seed }} | The seed for this scenario |
{{ determinism.clock }} | The frozen clock value |
Templates work in: setup.commands, setup.files[].content, setup.env, services[].env, agent.args, agent.endpoint, agent.input_template, fixtures[].seed, teardown.export[].to.
A complete real-world spec
Here's a full spec from the open-source examples — agent reconciles two databases, sends a summary email, must avoid forbidden DB writes:
version: 1
id: reconciliation-scenario
description: "Agent reconciles two customer databases and emails a summary"
task:
prompt: |
Reconcile customers_a and customers_b tables, fix all mismatches,
and email a summary to finance@co
context:
ticket_url: "https://linear.app/co/issue/ENG-1234"
success_criteria: "All rows match after reconciliation, exactly one summary email sent"
base: "ubuntu:24.04"
setup:
packages: [nodejs, npm, git, python3]
commands:
- "npm install -g typescript"
env:
NODE_ENV: "test"
DATABASE_URL: "postgres://postgres:{{ secrets.DB_PASSWORD }}@db:5432/testdb"
resources:
timeout: "10m"
memory: "2Gi"
cpu: 2
concurrency_limit: 5
fixtures:
- type: sql
service: db
sql: |
CREATE TABLE customers_a (id INT PRIMARY KEY, email TEXT, name TEXT);
CREATE TABLE customers_b (id INT PRIMARY KEY, email TEXT, name TEXT);
INSERT INTO customers_a VALUES
(1, 'a@co', 'Alice'), (2, 'b@co', 'Ben'), (3, 'c@co', 'Carol');
INSERT INTO customers_b SELECT * FROM customers_a;
- type: drift
target: db.customers_a
strategy: random_mismatches
count: 15
seed: "{{ determinism.seed }}"
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
POSTGRES_DB: testdb
ports: [5432]
wait_for: "pg_isready"
- name: smtp
image: mailhog/mailhog
ports: [1025, 8025]
- name: mock_api
type: http_mock
record: true
default_response: 404
routes:
- method: GET
path: /v1/customers
response: '{"customers": []}'
status: 200
secrets:
- name: DB_PASSWORD
from: generated
scope: env
network:
egress:
default: deny
allow: ["*.services.internal"]
dns_overrides:
api.stripe.com: mock_api.services.internal
audit:
db_writes: true
http_calls: true
file_system:
watch: [/workspace]
track: [writes, deletes]
process_spawns: true
stdout_capture: true
snapshots:
before_run: true
checkpoints: per_action
retain_on: [failure]
agent:
type: paragon
model: paragon-max
args: ["--yolo", "--quiet"]
timeout: 5m
invariants:
databases_match:
description: "No drift between customers_a and customers_b"
weight: 0.5
gate: true
check:
type: sql
service: db
query: |
SELECT count(*) FROM customers_a a
LEFT JOIN customers_b b ON a.id = b.id
WHERE b.id IS NULL OR a.email != b.email
equals: 0
email_sent:
description: "Exactly one summary email to finance"
weight: 0.3
check:
type: http_mock_assertions
service: smtp
assertions:
- field: request_count
filters: { to: "finance@co" }
equals: 1
code_quality:
description: "Solution is clean and idiomatic"
weight: 0.2
check:
type: llm_as_judge
model: paragon-fast
criteria: "Is the reconciliation code clean and well-structured?"
input_from: workspace
rubric:
pass: "Clean, handles edge cases"
fail: "Messy, fragile, or incomplete"
temperature: 0
forbidden:
db_writes_outside: [customers_a, customers_b, reconciliation_log]
http_except: [smtp, mock_api]
secrets_in_logs: deny
file_writes_outside: [/workspace/src, /workspace/logs]
scoring:
pass_threshold: 0.95
replica_aggregation:
strategy: all_must_pass
min_pass_rate: 0.90
parallelism:
replicas: 3
isolation: per_run
matrix:
- { input_size: small, locale: en_US }
- { input_size: large, locale: en_US }
determinism:
clock: "2026-01-01T00:00:00Z"
seed: 42
network_latency: 0ms
dns: static
retention:
audit_logs: 24h
snapshots: 7d
teardown_exports: 30d
traces: 30d
teardown:
export:
- type: audit_log
to: "results/{{ run_id }}/audit.jsonl"
- type: db_dump
service: db
to: "results/{{ run_id }}/db.sql"
- type: mock_requests
service: mock_api
to: "results/{{ run_id }}/mock.jsonl"
always_run: trueThis single spec covers every block. Use it as a copy-paste template and remove what you don't need.
Validation
Specs are validated server-side at upload time. Common errors:
| Error | Cause |
|---|---|
version: must be 1 | Missing or wrong version: |
id: must be kebab-case | Spaces, underscores, or capitals in id: |
task.prompt: required | Missing prompt |
invariants: must have at least one | Empty invariants block |
scoring.pass_threshold: out of range | Not in [0, 1] |
services[*].name: duplicate | Two services with the same name |
fixtures[*].service: not found | Service name doesn't match |
agent.type: unknown | Not one of the six supported types |
secrets[*].name: not in scope | Referenced via {{ secrets.X }} but not declared |
Run ks specs validate path/to/spec.yaml (or the CLI's eval run does it implicitly) to validate locally before uploading.