Invariants
Pass/fail checks that run after the agent finishes. Eight built-in types, plus weights, gates, and composite scoring.
An invariant is a postcondition: a yes/no check that runs after your agent finishes. Each one returns a score in [0, 1] (pass = 1, fail = 0; LLM-as-judge produces continuous scores). The composite score is the weighted average of all invariants — and if any "gate" invariant fails, the scenario fails outright regardless of the rest.
Anatomy
invariants:
tests_pass:
description: "All tests pass"
weight: 1.0 # relative importance in composite
gate: true # hard fail if this fails
check:
type: command_exit
command: "npm test"| Field | Type | Required | Default |
|---|---|---|---|
description | string | yes | — |
weight | float | no | 1.0 |
gate | bool | no | false |
check | object | yes | — |
check.type selects the check, and the rest of check is type-specific.
The eight check types
| Type | Asks |
|---|---|
command_exit | Does the command return the expected exit code? |
file_exists | Does this file exist? |
file_absent | Is this file missing? |
file_content | Does this file contain (or not contain) a substring or regex? |
sql | Does this SQL query return the expected value? |
http_mock_assertions | Was this mock service called the right way? |
custom | Did your Python script return passed: true? |
llm_as_judge | Did the model rate the output above the threshold? |
command_exit
invariants:
tests_pass:
description: "All tests pass"
weight: 1.0
gate: true
check:
type: command_exit
command: "npm test"
exit_code: 0 # optional; default 0Runs the command via sh -c from the workspace. Passes if the exit code matches. Stdout and stderr are captured and shown in the failure message.
file_exists / file_absent
invariants:
output_created:
description: "Output file exists"
check:
type: file_exists
path: "output.json"
no_temp_files:
description: "Temp file cleaned up"
check:
type: file_absent
path: ".tmp/cache"Path is relative to the workspace.
file_content
invariants:
output_correct:
description: "Output contains success marker"
check:
type: file_content
path: "output.json"
contains: '"status": "success"' # substring match
no_debug_logs:
description: "No console.log left in source"
check:
type: file_content
path: "src/main.ts"
not_contains: "console.log"
matches_pattern:
description: "Email follows expected format"
check:
type: file_content
path: "draft.txt"
pattern: "^Subject: Renewal\\b" # regex (matches anywhere unless ^/$)| Field | What it does |
|---|---|
contains | Substring must appear |
not_contains | Substring must NOT appear |
pattern | Regex must match |
You can combine them — all conditions must be true.
sql
invariants:
reconciliation_clean:
description: "No mismatches between A and B"
weight: 0.5
gate: true
check:
type: sql
service: db # name from services[]
query: |
SELECT count(*) FROM customers_a a
LEFT JOIN customers_b b ON a.id = b.id
WHERE b.id IS NULL OR a.email != b.email
equals: 0Runs the query against the named service (must be a Postgres or MySQL container), expects the first column of the first row to equal equals. Use it for any "was this state reached" question.
http_mock_assertions
invariants:
email_sent_once:
description: "Exactly one renewal email"
check:
type: http_mock_assertions
service: smtp # must have `record: true`
assertions:
- field: request_count
filters:
method: POST
path: "/api/messages"
equals: 1
- field: last_request.body
contains: "renewal"Runs against an http_mock service with record: true. Each assertion checks one field — if any fails, the invariant fails.
field | What it returns |
|---|---|
request_count | Number of matching requests |
last_request.body | Body of the most recent matching request |
last_request.headers | Headers map |
requests[N] | Nth request (zero-indexed) |
requests[N].body / .headers | Sub-fields of the Nth request |
filters: narrow which requests count: method, path, plus header filters.
custom
invariants:
summary_accuracy:
description: "Numbers in email match log"
weight: 0.3
check:
type: custom
script: checks/verify_summary.py
runs_in: host # host (default) or sandboxThe script reads KeystoneContext JSON on stdin and prints CheckResult JSON on stdout.
Input on stdin:
{
"sandbox_id": "sb-abc",
"workspace_path": "/var/keystone/workspaces/sb-abc",
"services": { "db": {"host": "...", "port": 5432} },
"audit_log_path": "/var/keystone/workspaces/sb-abc/.keystone/audit.jsonl",
"trace_path": "/var/keystone/workspaces/sb-abc/.keystone/trace.jsonl",
"task": { "prompt": "..." },
"scenario_params": { "model": "claude-sonnet-4-5" }
}Output on stdout:
{
"passed": true,
"score": 1.0,
"reason": "All numbers match",
"details": { "matched": 5, "total": 5 }
}| Exit code | Meaning |
|---|---|
| 0 | Check ran successfully — pass/fail in JSON |
| ≠ 0 | Check itself errored — scenario status becomes error |
runs_in: host runs the script on the Keystone server (with read-only access to the sandbox). runs_in: sandbox runs it inside the sandbox container — useful when the script needs python deps the agent already installed.
llm_as_judge
invariants:
email_tone:
description: "Email is professional and actionable"
weight: 0.3
check:
type: llm_as_judge
model: paragon-fast
criteria: |
Evaluate the email for:
1. Professional tone (no casual language)
2. Actionable content (clear next steps)
3. Accurate numbers
input_from: smtp.last_request.body # what to judge
rubric:
pass: "Professional, accurate, actionable"
fail: "Casual, missing numbers, or unclear"
temperature: 0
pass_threshold: 0.7 # optional; default 0.5Runs the named model with the criteria + input, asks for a JSON {score, passed, reason} response. The score becomes the invariant's score. pass_threshold decides whether it counts as passed: true.
| Field | Notes |
|---|---|
model | paragon-fast, paragon-max, or any provider model — see LLM-as-Judge |
criteria | What to evaluate (free text) |
input_from | Where to read input — agent_output, stdout, workspace, or <service>.<field> |
rubric | Optional {pass, fail} pair shown to the judge |
temperature | Default 0 (deterministic) |
pass_threshold | Score ≥ this passes. Default 0.5 |
Weights and gates
The composite score is:
composite = sum(weight * score) / sum(weight)
Weights are relative — they don't have to sum to 1.0. Larger weights matter more.
invariants:
must_pass:
weight: 1.0
gate: true # if this fails, composite = 0 regardless
check: { ... }
nice_to_have:
weight: 0.3
check: { ... }
scoring:
pass_threshold: 0.85If must_pass fails, the composite is forced to 0. If only nice_to_have fails, the composite is (1.0 × 1) / (1.0 + 0.3) ≈ 0.77 — below the 0.85 threshold, so the scenario fails.
Gate invariants are how you say "this MUST be true." Use them sparingly — multiple gates effectively AND, which can make a spec brittle.
Patterns
"All tests pass" gate + nice-to-have quality checks
invariants:
tests_pass:
description: "All unit tests pass"
weight: 1.0
gate: true
check:
type: command_exit
command: "npm test"
no_console_log:
description: "No debug logs left in source"
weight: 0.3
check:
type: file_content
path: "src/main.ts"
not_contains: "console.log"
small_diff:
description: "Diff is small"
weight: 0.2
check:
type: command_exit
command: "[ $(git diff --stat | tail -1 | awk '{print $4+$6}') -lt 50 ]"
scoring:
pass_threshold: 0.85If tests fail → composite is 0, scenario fails. If tests pass + everything else perfect → composite is 1.0, pass. If tests pass + one nice-to-have fails → composite around 0.83, fail.
SQL state + HTTP mock assertion
invariants:
reconciliation_clean:
description: "No mismatches"
weight: 0.5
gate: true
check:
type: sql
service: db
query: "SELECT count(*) FROM customers_a a LEFT JOIN customers_b b USING (id) WHERE a.email != b.email"
equals: 0
email_sent_once:
description: "Exactly one summary email"
weight: 0.3
check:
type: http_mock_assertions
service: smtp
assertions:
- field: request_count
filters: { to: "finance@co" }
equals: 1
email_quality:
description: "Email is well-written"
weight: 0.2
check:
type: llm_as_judge
model: paragon-fast
criteria: "Professional, accurate, actionable"
input_from: smtp.last_request.body
pass_threshold: 0.7The classic pattern: end-state correctness (SQL), behavior correctness (HTTP), subjective quality (LLM-judge).
Custom Python invariant
invariants:
api_responses_valid:
description: "Every API response is valid JSON with required fields"
weight: 0.5
check:
type: custom
script: checks/validate_api_responses.py
runs_in: sandbox# checks/validate_api_responses.py
import json
import sys
ctx = json.load(sys.stdin)
audit_path = ctx["audit_log_path"]
valid = 0
total = 0
for line in open(audit_path):
event = json.loads(line)
if event["type"] != "http_call":
continue
total += 1
response = event["details"].get("response_body", "")
try:
data = json.loads(response)
if "id" in data and "status" in data:
valid += 1
except json.JSONDecodeError:
pass
passed = total > 0 and valid == total
print(json.dumps({
"passed": passed,
"score": valid / total if total else 0,
"reason": f"{valid}/{total} responses valid",
}))Scoring summary
For each scenario:
- Run every invariant. Each produces a score in
[0, 1]. - If any
gate: trueinvariant has score < 1.0, composite = 0. - Else, composite = sum(weight × score) / sum(weight).
- Pass if
composite ≥ scoring.pass_threshold. Else fail.
For multi-replica or matrix experiments, scoring.replica_aggregation: decides how replica-level pass/fail combines into the scenario's verdict — see Spec Reference.