Invariants

Pass/fail checks that run after the agent finishes. Eight built-in types, plus weights, gates, and composite scoring.

An invariant is a postcondition: a yes/no check that runs after your agent finishes. Each one returns a score in [0, 1] (pass = 1, fail = 0; LLM-as-judge produces continuous scores). The composite score is the weighted average of all invariants — and if any "gate" invariant fails, the scenario fails outright regardless of the rest.

Anatomy

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0                 # relative importance in composite
    gate: true                   # hard fail if this fails
    check:
      type: command_exit
      command: "npm test"

Field	Type	Required	Default
`description`	string	yes	—
`weight`	float	no	1.0
`gate`	bool	no	false
`check`	object	yes	—

check.type selects the check, and the rest of check is type-specific.

The eight check types

Type	Asks
`command_exit`	Does the command return the expected exit code?
`file_exists`	Does this file exist?
`file_absent`	Is this file missing?
`file_content`	Does this file contain (or not contain) a substring or regex?
`sql`	Does this SQL query return the expected value?
`http_mock_assertions`	Was this mock service called the right way?
`custom`	Did your Python script return `passed: true`?
`llm_as_judge`	Did the model rate the output above the threshold?

`command_exit`

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "npm test"
      exit_code: 0              # optional; default 0

Runs the command via sh -c from the workspace. Passes if the exit code matches. Stdout and stderr are captured and shown in the failure message.

`file_exists` / `file_absent`

invariants:
  output_created:
    description: "Output file exists"
    check:
      type: file_exists
      path: "output.json"
 
  no_temp_files:
    description: "Temp file cleaned up"
    check:
      type: file_absent
      path: ".tmp/cache"

Path is relative to the workspace.

`file_content`

invariants:
  output_correct:
    description: "Output contains success marker"
    check:
      type: file_content
      path: "output.json"
      contains: '"status": "success"'      # substring match
 
  no_debug_logs:
    description: "No console.log left in source"
    check:
      type: file_content
      path: "src/main.ts"
      not_contains: "console.log"
 
  matches_pattern:
    description: "Email follows expected format"
    check:
      type: file_content
      path: "draft.txt"
      pattern: "^Subject: Renewal\\b"     # regex (matches anywhere unless ^/$)

Field	What it does
`contains`	Substring must appear
`not_contains`	Substring must NOT appear
`pattern`	Regex must match

You can combine them — all conditions must be true.

`sql`

invariants:
  reconciliation_clean:
    description: "No mismatches between A and B"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db                       # name from services[]
      query: |
        SELECT count(*) FROM customers_a a
        LEFT JOIN customers_b b ON a.id = b.id
        WHERE b.id IS NULL OR a.email != b.email
      equals: 0

Runs the query against the named service (must be a Postgres or MySQL container), expects the first column of the first row to equal equals. Use it for any "was this state reached" question.

`http_mock_assertions`

invariants:
  email_sent_once:
    description: "Exactly one renewal email"
    check:
      type: http_mock_assertions
      service: smtp                       # must have `record: true`
      assertions:
        - field: request_count
          filters:
            method: POST
            path: "/api/messages"
          equals: 1
        - field: last_request.body
          contains: "renewal"

Runs against an http_mock service with record: true. Each assertion checks one field — if any fails, the invariant fails.

`field`	What it returns
`request_count`	Number of matching requests
`last_request.body`	Body of the most recent matching request
`last_request.headers`	Headers map
`requests[N]`	Nth request (zero-indexed)
`requests[N].body` / `.headers`	Sub-fields of the Nth request

filters: narrow which requests count: method, path, plus header filters.

`custom`

invariants:
  summary_accuracy:
    description: "Numbers in email match log"
    weight: 0.3
    check:
      type: custom
      script: checks/verify_summary.py
      runs_in: host                # host (default) or sandbox

The script reads KeystoneContext JSON on stdin and prints CheckResult JSON on stdout.

Input on stdin:

{
  "sandbox_id": "sb-abc",
  "workspace_path": "/var/keystone/workspaces/sb-abc",
  "services": { "db": {"host": "...", "port": 5432} },
  "audit_log_path": "/var/keystone/workspaces/sb-abc/.keystone/audit.jsonl",
  "trace_path": "/var/keystone/workspaces/sb-abc/.keystone/trace.jsonl",
  "task": { "prompt": "..." },
  "scenario_params": { "model": "claude-sonnet-4-5" }
}

Output on stdout:

{
  "passed": true,
  "score": 1.0,
  "reason": "All numbers match",
  "details": { "matched": 5, "total": 5 }
}

Exit code	Meaning
0	Check ran successfully — pass/fail in JSON
≠ 0	Check itself errored — scenario status becomes `error`

runs_in: host runs the script on the Keystone server (with read-only access to the sandbox). runs_in: sandbox runs it inside the sandbox container — useful when the script needs python deps the agent already installed.

`llm_as_judge`

invariants:
  email_tone:
    description: "Email is professional and actionable"
    weight: 0.3
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Evaluate the email for:
        1. Professional tone (no casual language)
        2. Actionable content (clear next steps)
        3. Accurate numbers
      input_from: smtp.last_request.body   # what to judge
      rubric:
        pass: "Professional, accurate, actionable"
        fail: "Casual, missing numbers, or unclear"
      temperature: 0
      pass_threshold: 0.7        # optional; default 0.5

Runs the named model with the criteria + input, asks for a JSON {score, passed, reason} response. The score becomes the invariant's score. pass_threshold decides whether it counts as passed: true.

Field	Notes
`model`	`paragon-fast`, `paragon-max`, or any provider model — see LLM-as-Judge
`criteria`	What to evaluate (free text)
`input_from`	Where to read input — `agent_output`, `stdout`, `workspace`, or `<service>.<field>`
`rubric`	Optional `{pass, fail}` pair shown to the judge
`temperature`	Default 0 (deterministic)
`pass_threshold`	Score ≥ this passes. Default 0.5

Weights and gates

The composite score is:

composite = sum(weight * score) / sum(weight)

Weights are relative — they don't have to sum to 1.0. Larger weights matter more.

invariants:
  must_pass:
    weight: 1.0
    gate: true                  # if this fails, composite = 0 regardless
    check: { ... }
 
  nice_to_have:
    weight: 0.3
    check: { ... }
 
scoring:
  pass_threshold: 0.85

If must_pass fails, the composite is forced to 0. If only nice_to_have fails, the composite is (1.0 × 1) / (1.0 + 0.3) ≈ 0.77 — below the 0.85 threshold, so the scenario fails.

Gate invariants are how you say "this MUST be true." Use them sparingly — multiple gates effectively AND, which can make a spec brittle.

Patterns

"All tests pass" gate + nice-to-have quality checks

invariants:
  tests_pass:
    description: "All unit tests pass"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "npm test"
 
  no_console_log:
    description: "No debug logs left in source"
    weight: 0.3
    check:
      type: file_content
      path: "src/main.ts"
      not_contains: "console.log"
 
  small_diff:
    description: "Diff is small"
    weight: 0.2
    check:
      type: command_exit
      command: "[ $(git diff --stat | tail -1 | awk '{print $4+$6}') -lt 50 ]"
 
scoring:
  pass_threshold: 0.85

If tests fail → composite is 0, scenario fails. If tests pass + everything else perfect → composite is 1.0, pass. If tests pass + one nice-to-have fails → composite around 0.83, fail.

SQL state + HTTP mock assertion

invariants:
  reconciliation_clean:
    description: "No mismatches"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db
      query: "SELECT count(*) FROM customers_a a LEFT JOIN customers_b b USING (id) WHERE a.email != b.email"
      equals: 0
 
  email_sent_once:
    description: "Exactly one summary email"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  email_quality:
    description: "Email is well-written"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Professional, accurate, actionable"
      input_from: smtp.last_request.body
      pass_threshold: 0.7

The classic pattern: end-state correctness (SQL), behavior correctness (HTTP), subjective quality (LLM-judge).

Custom Python invariant

invariants:
  api_responses_valid:
    description: "Every API response is valid JSON with required fields"
    weight: 0.5
    check:
      type: custom
      script: checks/validate_api_responses.py
      runs_in: sandbox

# checks/validate_api_responses.py
import json
import sys
 
ctx = json.load(sys.stdin)
audit_path = ctx["audit_log_path"]
 
valid = 0
total = 0
for line in open(audit_path):
    event = json.loads(line)
    if event["type"] != "http_call":
        continue
    total += 1
    response = event["details"].get("response_body", "")
    try:
        data = json.loads(response)
        if "id" in data and "status" in data:
            valid += 1
    except json.JSONDecodeError:
        pass
 
passed = total > 0 and valid == total
print(json.dumps({
    "passed": passed,
    "score": valid / total if total else 0,
    "reason": f"{valid}/{total} responses valid",
}))

Scoring summary

For each scenario:

Run every invariant. Each produces a score in [0, 1].
If any gate: true invariant has score < 1.0, composite = 0.
Else, composite = sum(weight × score) / sum(weight).
Pass if composite ≥ scoring.pass_threshold. Else fail.

For multi-replica or matrix experiments, scoring.replica_aggregation: decides how replica-level pass/fail combines into the scenario's verdict — see Spec Reference.

Determinism Forbidden Rules