Scoring

Invariants

Pass/fail checks that run after the agent finishes. Eight built-in types, plus weights, gates, and composite scoring.

An invariant is a postcondition: a yes/no check that runs after your agent finishes. Each one returns a score in [0, 1] (pass = 1, fail = 0; LLM-as-judge produces continuous scores). The composite score is the weighted average of all invariants — and if any "gate" invariant fails, the scenario fails outright regardless of the rest.

Anatomy

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0                 # relative importance in composite
    gate: true                   # hard fail if this fails
    check:
      type: command_exit
      command: "npm test"
FieldTypeRequiredDefault
descriptionstringyes
weightfloatno1.0
gateboolnofalse
checkobjectyes

check.type selects the check, and the rest of check is type-specific.

The eight check types

TypeAsks
command_exitDoes the command return the expected exit code?
file_existsDoes this file exist?
file_absentIs this file missing?
file_contentDoes this file contain (or not contain) a substring or regex?
sqlDoes this SQL query return the expected value?
http_mock_assertionsWas this mock service called the right way?
customDid your Python script return passed: true?
llm_as_judgeDid the model rate the output above the threshold?

command_exit

invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "npm test"
      exit_code: 0              # optional; default 0

Runs the command via sh -c from the workspace. Passes if the exit code matches. Stdout and stderr are captured and shown in the failure message.

file_exists / file_absent

invariants:
  output_created:
    description: "Output file exists"
    check:
      type: file_exists
      path: "output.json"
 
  no_temp_files:
    description: "Temp file cleaned up"
    check:
      type: file_absent
      path: ".tmp/cache"

Path is relative to the workspace.

file_content

invariants:
  output_correct:
    description: "Output contains success marker"
    check:
      type: file_content
      path: "output.json"
      contains: '"status": "success"'      # substring match
 
  no_debug_logs:
    description: "No console.log left in source"
    check:
      type: file_content
      path: "src/main.ts"
      not_contains: "console.log"
 
  matches_pattern:
    description: "Email follows expected format"
    check:
      type: file_content
      path: "draft.txt"
      pattern: "^Subject: Renewal\\b"     # regex (matches anywhere unless ^/$)
FieldWhat it does
containsSubstring must appear
not_containsSubstring must NOT appear
patternRegex must match

You can combine them — all conditions must be true.

sql

invariants:
  reconciliation_clean:
    description: "No mismatches between A and B"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db                       # name from services[]
      query: |
        SELECT count(*) FROM customers_a a
        LEFT JOIN customers_b b ON a.id = b.id
        WHERE b.id IS NULL OR a.email != b.email
      equals: 0

Runs the query against the named service (must be a Postgres or MySQL container), expects the first column of the first row to equal equals. Use it for any "was this state reached" question.

http_mock_assertions

invariants:
  email_sent_once:
    description: "Exactly one renewal email"
    check:
      type: http_mock_assertions
      service: smtp                       # must have `record: true`
      assertions:
        - field: request_count
          filters:
            method: POST
            path: "/api/messages"
          equals: 1
        - field: last_request.body
          contains: "renewal"

Runs against an http_mock service with record: true. Each assertion checks one field — if any fails, the invariant fails.

fieldWhat it returns
request_countNumber of matching requests
last_request.bodyBody of the most recent matching request
last_request.headersHeaders map
requests[N]Nth request (zero-indexed)
requests[N].body / .headersSub-fields of the Nth request

filters: narrow which requests count: method, path, plus header filters.

custom

invariants:
  summary_accuracy:
    description: "Numbers in email match log"
    weight: 0.3
    check:
      type: custom
      script: checks/verify_summary.py
      runs_in: host                # host (default) or sandbox

The script reads KeystoneContext JSON on stdin and prints CheckResult JSON on stdout.

Input on stdin:

{
  "sandbox_id": "sb-abc",
  "workspace_path": "/var/keystone/workspaces/sb-abc",
  "services": { "db": {"host": "...", "port": 5432} },
  "audit_log_path": "/var/keystone/workspaces/sb-abc/.keystone/audit.jsonl",
  "trace_path": "/var/keystone/workspaces/sb-abc/.keystone/trace.jsonl",
  "task": { "prompt": "..." },
  "scenario_params": { "model": "claude-sonnet-4-5" }
}

Output on stdout:

{
  "passed": true,
  "score": 1.0,
  "reason": "All numbers match",
  "details": { "matched": 5, "total": 5 }
}
Exit codeMeaning
0Check ran successfully — pass/fail in JSON
≠ 0Check itself errored — scenario status becomes error

runs_in: host runs the script on the Keystone server (with read-only access to the sandbox). runs_in: sandbox runs it inside the sandbox container — useful when the script needs python deps the agent already installed.

llm_as_judge

invariants:
  email_tone:
    description: "Email is professional and actionable"
    weight: 0.3
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Evaluate the email for:
        1. Professional tone (no casual language)
        2. Actionable content (clear next steps)
        3. Accurate numbers
      input_from: smtp.last_request.body   # what to judge
      rubric:
        pass: "Professional, accurate, actionable"
        fail: "Casual, missing numbers, or unclear"
      temperature: 0
      pass_threshold: 0.7        # optional; default 0.5

Runs the named model with the criteria + input, asks for a JSON {score, passed, reason} response. The score becomes the invariant's score. pass_threshold decides whether it counts as passed: true.

FieldNotes
modelparagon-fast, paragon-max, or any provider model — see LLM-as-Judge
criteriaWhat to evaluate (free text)
input_fromWhere to read input — agent_output, stdout, workspace, or <service>.<field>
rubricOptional {pass, fail} pair shown to the judge
temperatureDefault 0 (deterministic)
pass_thresholdScore ≥ this passes. Default 0.5

Weights and gates

The composite score is:

composite = sum(weight * score) / sum(weight)

Weights are relative — they don't have to sum to 1.0. Larger weights matter more.

invariants:
  must_pass:
    weight: 1.0
    gate: true                  # if this fails, composite = 0 regardless
    check: { ... }
 
  nice_to_have:
    weight: 0.3
    check: { ... }
 
scoring:
  pass_threshold: 0.85

If must_pass fails, the composite is forced to 0. If only nice_to_have fails, the composite is (1.0 × 1) / (1.0 + 0.3) ≈ 0.77 — below the 0.85 threshold, so the scenario fails.

Gate invariants are how you say "this MUST be true." Use them sparingly — multiple gates effectively AND, which can make a spec brittle.

Patterns

"All tests pass" gate + nice-to-have quality checks

invariants:
  tests_pass:
    description: "All unit tests pass"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "npm test"
 
  no_console_log:
    description: "No debug logs left in source"
    weight: 0.3
    check:
      type: file_content
      path: "src/main.ts"
      not_contains: "console.log"
 
  small_diff:
    description: "Diff is small"
    weight: 0.2
    check:
      type: command_exit
      command: "[ $(git diff --stat | tail -1 | awk '{print $4+$6}') -lt 50 ]"
 
scoring:
  pass_threshold: 0.85

If tests fail → composite is 0, scenario fails. If tests pass + everything else perfect → composite is 1.0, pass. If tests pass + one nice-to-have fails → composite around 0.83, fail.

SQL state + HTTP mock assertion

invariants:
  reconciliation_clean:
    description: "No mismatches"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db
      query: "SELECT count(*) FROM customers_a a LEFT JOIN customers_b b USING (id) WHERE a.email != b.email"
      equals: 0
 
  email_sent_once:
    description: "Exactly one summary email"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  email_quality:
    description: "Email is well-written"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Professional, accurate, actionable"
      input_from: smtp.last_request.body
      pass_threshold: 0.7

The classic pattern: end-state correctness (SQL), behavior correctness (HTTP), subjective quality (LLM-judge).

Custom Python invariant

invariants:
  api_responses_valid:
    description: "Every API response is valid JSON with required fields"
    weight: 0.5
    check:
      type: custom
      script: checks/validate_api_responses.py
      runs_in: sandbox
# checks/validate_api_responses.py
import json
import sys
 
ctx = json.load(sys.stdin)
audit_path = ctx["audit_log_path"]
 
valid = 0
total = 0
for line in open(audit_path):
    event = json.loads(line)
    if event["type"] != "http_call":
        continue
    total += 1
    response = event["details"].get("response_body", "")
    try:
        data = json.loads(response)
        if "id" in data and "status" in data:
            valid += 1
    except json.JSONDecodeError:
        pass
 
passed = total > 0 and valid == total
print(json.dumps({
    "passed": passed,
    "score": valid / total if total else 0,
    "reason": f"{valid}/{total} responses valid",
}))

Scoring summary

For each scenario:

  1. Run every invariant. Each produces a score in [0, 1].
  2. If any gate: true invariant has score < 1.0, composite = 0.
  3. Else, composite = sum(weight × score) / sum(weight).
  4. Pass if composite ≥ scoring.pass_threshold. Else fail.

For multi-replica or matrix experiments, scoring.replica_aggregation: decides how replica-level pass/fail combines into the scenario's verdict — see Spec Reference.