Examples
Real-world Keystone specs from hello-world to production reconciliation pipelines, with copy-pasteable YAML and explanations.
Specs in this page are ordered from simple to complex. Start with the hello-world to verify your setup, then work up to real-world scenarios. Every example is a complete spec — copy, paste, adapt.
Table of contents
- Hello world (file write)
- Build a REST API
- Bug fix in a linked-list
- Postgres schema design
- Security review
- Reconciliation pipeline (full real-world)
- Multi-language matrix
Hello world
The smoke test. Verifies the agent can write a file and the file has the right content.
version: 1
id: hello-world
description: "Smoke test — does the agent write a file?"
base: "ubuntu:24.04"
task:
prompt: "Create a file called hello.txt with the text 'Hello, World!' inside."
agent:
type: paragon
timeout: 5m
invariants:
hello_file_exists:
description: "hello.txt was created"
weight: 1.0
gate: true
check:
type: file_exists
path: hello.txt
hello_file_correct:
description: "hello.txt contains the right text"
weight: 1.0
check:
type: file_content
path: hello.txt
contains: "Hello, World!"
scoring:
pass_threshold: 1.0Run:
ks eval run hello.yamlExpected output:
{
"passed": 1,
"failed": 0,
"metrics": { "pass_rate": 1.0, "mean_wall_ms": 8200 },
"scenarios": [
{
"status": "pass",
"composite_score": 1.0,
"invariants": [
{ "name": "hello_file_exists", "passed": true, "gate": true, "weight": 1.0 },
{ "name": "hello_file_correct", "passed": true, "weight": 1.0 }
]
}
]
}If this works, you're set up correctly.
Build a REST API
Agent builds a Python HTTP server using only the standard library, then writes a test script that exercises every route.
version: 1
id: rest-api-todo
description: "Agent builds a Python REST API for a todo list, with input validation and tests"
base: "ubuntu:24.04"
setup:
packages: [python3, curl]
env:
PYTHONDONTWRITEBYTECODE: "1"
resources:
timeout: 5m
memory: 1Gi
cpu: 1
task:
prompt: |
Build a Python REST API server using only the standard library
(http.server + json). Do NOT use Flask, FastAPI, or any third-party
frameworks.
Requirements:
1. Create `server.py` that runs an HTTP server on port 8080
2. Implement these endpoints:
- GET /todos → JSON array of all todos
- POST /todos → create (body: {"title": "...", "done": false}), return 201
- GET /todos/<id> → return a single todo, 404 if missing
- DELETE /todos/<id> → delete, return 204, 404 if missing
3. Todos: id (auto-increment int), title (string), done (boolean)
4. Store todos in memory (a list or dict)
5. Validate POST input: title must be non-empty string; return 400 with
error message if invalid
6. Create `test_api.sh` that:
- Starts the server in background
- Tests all endpoints with curl
- Checks response codes and bodies
- Kills the server at the end
- Exits 0 only if all tests pass
In the test script, after starting the server in background, wait for it:
for i in $(seq 1 50); do
curl -s http://localhost:8080/todos > /dev/null 2>&1 && break
sleep 0.1
done
context:
success_criteria: server.py exists, serves correct responses, test script passes, code is clean
agent:
type: paragon
timeout: 4m
invariants:
server_file_exists:
description: "server.py was created"
weight: 1.0
gate: true
check: { type: file_exists, path: server.py }
test_script_exists:
description: "test_api.sh was created"
weight: 0.5
gate: true
check: { type: file_exists, path: test_api.sh }
server_has_routes:
description: "server.py implements the required HTTP methods"
weight: 1.0
check:
type: file_content
path: server.py
pattern: "(GET|POST|DELETE).*(todos|todo)"
server_has_validation:
description: "server.py includes input validation"
weight: 1.0
check:
type: file_content
path: server.py
contains: "400"
test_script_passes:
description: "Test script runs all API tests successfully"
weight: 3.0
gate: true
check:
type: command_exit
command: "chmod +x test_api.sh && bash test_api.sh"
exit_code: 0
code_quality:
description: "LLM judge evaluates code quality"
weight: 2.0
check:
type: llm_as_judge
model: paragon-fast
criteria: |
Evaluate this REST API:
1. Are all 4 endpoints implemented correctly?
2. Is error handling thorough (404, 400, proper status codes)?
3. Is the code well-structured?
4. Does it use only the standard library as requested?
5. Is the test script comprehensive?
input_from: workspace
rubric:
pass: "Complete API, proper error handling, clean code, comprehensive tests"
fail: "Missing endpoints, poor error handling, messy code"
temperature: 0
forbidden:
file_writes_outside: ["/workspace"]
secrets_in_logs: deny
scoring:
pass_threshold: 0.8
replica_aggregation:
strategy: percentage
min_pass_rate: 0.8
parallelism:
replicas: 5
isolation: per_run
determinism:
seed: 42
dns: static
retention:
audit_logs: 24h
traces: 7d
teardown:
export:
- type: audit_log
to: "results/{{ run_id }}/audit.jsonl"
always_run: trueThe structure to notice:
- Gates —
server_file_exists,test_script_exists,test_script_passesare allgate: true. If any one fails, the scenario fails outright. - Composite weighting —
test_script_passes(the actual end-to-end test) is weight 3.0; quality checks are weight 1–2. The test passing matters more than the code being pretty. - 5 replicas + 80% pass rate — measures consistency. Catches "passes 4/5 times" flakiness.
Bug fix in a linked-list
Agent fixes a known-buggy implementation. Tests verify the fix without changes to the test file.
version: 1
id: bugfix-linked-list
description: "Agent fixes a buggy linked-list implementation; tests must pass without test-file modifications"
base: "ubuntu:24.04"
setup:
packages: [python3, python3-pip]
files:
- path: linked_list.py
content: |
class Node:
def __init__(self, value):
self.value = value
self.next = None
class LinkedList:
def __init__(self):
self.head = None
def append(self, value):
# BUG: doesn't handle empty list
node = self.head
while node.next is not None:
node = node.next
node.next = Node(value)
def remove(self, value):
# BUG: doesn't update head if first node matches
node = self.head
while node is not None and node.next is not None:
if node.next.value == value:
node.next = node.next.next
return
node = node.next
def __len__(self):
# BUG: off-by-one
node = self.head
n = 0
while node is not None:
node = node.next
n += 1
return n - 1
- path: test_linked_list.py
content: |
from linked_list import LinkedList
def test_append_empty():
ll = LinkedList()
ll.append(1)
assert len(ll) == 1
def test_append_multiple():
ll = LinkedList()
ll.append(1)
ll.append(2)
ll.append(3)
assert len(ll) == 3
def test_remove_head():
ll = LinkedList()
ll.append(1)
ll.append(2)
ll.remove(1)
assert len(ll) == 1
assert ll.head.value == 2
def test_remove_middle():
ll = LinkedList()
ll.append(1)
ll.append(2)
ll.append(3)
ll.remove(2)
assert len(ll) == 2
commands:
- "pip install pytest"
task:
prompt: |
The file linked_list.py contains a buggy LinkedList implementation.
Fix the bugs so that all tests in test_linked_list.py pass.
Constraints:
- Do NOT modify test_linked_list.py
- Run `pytest test_linked_list.py -v` to verify
agent:
type: paragon
timeout: 5m
invariants:
test_file_unchanged:
description: "test_linked_list.py was not modified"
weight: 1.0
gate: true
check:
type: command_exit
command: |
sha256sum test_linked_list.py | grep -q $(cat .keystone/test_file_initial_hash 2>/dev/null || sha256sum test_linked_list.py | awk '{print $1}')
exit_code: 0
tests_pass:
description: "All tests pass"
weight: 5.0
gate: true
check:
type: command_exit
command: "pytest test_linked_list.py -v"
exit_code: 0
no_unrelated_changes:
description: "Diff is contained to linked_list.py"
weight: 1.0
check:
type: llm_as_judge
model: paragon-fast
criteria: |
Did the agent fix the bugs in a minimal way? Reject if unnecessary
files were created or unrelated code was added.
input_from: workspace
forbidden:
file_writes_outside: ["/workspace"]
secrets_in_logs: deny
scoring:
pass_threshold: 0.85
parallelism:
replicas: 5
determinism:
seed: 42Notice: pre-seeding test_linked_list.py via setup.files means every replica starts from the same buggy code. The test_file_unchanged invariant uses a SHA256 hash to verify the agent didn't cheat by editing the test file.
Postgres schema design
Agent designs an e-commerce database, seeds data, writes analytical queries. The DB is a real Postgres container, the agent uses psql to interact.
version: 1
id: postgres-ecommerce
description: "Agent designs an e-commerce schema, seeds data, writes analytical queries against live Postgres"
base: "ubuntu:24.04"
setup:
packages: [postgresql-client, python3]
env:
PGPASSWORD: test
PGHOST: db
PGUSER: postgres
PGDATABASE: testdb
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
ports: [5432]
wait_for: pg_isready
resources:
timeout: 6m
memory: 2Gi
cpu: 2
task:
prompt: |
Design and implement an e-commerce database in the connected PostgreSQL.
Connection:
- Host: db
- Port: 5432
- User: postgres
- Password: test
- Database: testdb
Steps:
1. Create `schema.sql` with these tables:
- customers (id, name, email, created_at)
- products (id, name, category, price, stock_quantity)
- orders (id, customer_id FK, status, total_amount, created_at)
- order_items (id, order_id FK, product_id FK, quantity, unit_price)
Use PKs, FKs, NOT NULL, UNIQUE on email, CHECK on price > 0.
2. Create `seed.sql` with realistic data:
- At least 10 customers
- At least 15 products across 3+ categories
- At least 20 orders with statuses (pending, shipped, delivered, cancelled)
- At least 40 order_items
3. Execute schema.sql and seed.sql against the database.
4. Create `queries.sql` with these named analytical queries:
a. Top 5 customers by total spending (excluding cancelled)
b. Revenue by product category, last 30 days
c. Products low on stock (qty < 10) ordered recently
d. Average order value by month
e. Customers who never placed an order
5. Execute all queries and save output to `query_results.txt`.
Use psql command-line tool to interact with the database.
context:
success_criteria: "Normalized schema, realistic data, queries execute successfully"
agent:
type: paragon
timeout: 5m
invariants:
schema_file_exists:
description: "schema.sql was created"
weight: 0.5
gate: true
check: { type: file_exists, path: schema.sql }
seed_file_exists:
description: "seed.sql was created"
weight: 0.5
check: { type: file_exists, path: seed.sql }
queries_file_exists:
description: "queries.sql was created"
weight: 0.5
check: { type: file_exists, path: queries.sql }
tables_created:
description: "All 4 tables exist in the database"
weight: 2.0
gate: true
check:
type: sql
service: db
query: |
SELECT count(*) FROM information_schema.tables
WHERE table_schema = 'public'
AND table_name IN ('customers', 'products', 'orders', 'order_items')
equals: 4
has_foreign_keys:
description: "Foreign key constraints exist"
weight: 1.5
check:
type: sql
service: db
query: |
SELECT count(*) FROM information_schema.table_constraints
WHERE constraint_type = 'FOREIGN KEY' AND table_schema = 'public'
equals: 3
sufficient_seed_data:
description: "Seed data meets minimums"
weight: 1.5
check:
type: command_exit
command: |
PGPASSWORD=test psql -h db -U postgres -d testdb -t -A -c "
SELECT
(SELECT count(*) FROM customers) >= 10 AND
(SELECT count(*) FROM products) >= 15 AND
(SELECT count(*) FROM orders) >= 20 AND
(SELECT count(*) FROM order_items) >= 40
AS sufficient;" | grep -q 't'
exit_code: 0
queries_execute:
description: "All analytical queries execute without errors"
weight: 2.0
check:
type: command_exit
command: "PGPASSWORD=test psql -h db -U postgres -d testdb -f queries.sql > /dev/null 2>&1"
exit_code: 0
schema_quality:
description: "LLM evaluates schema design and query quality"
weight: 2.0
check:
type: llm_as_judge
model: paragon-fast
criteria: |
Evaluate this database implementation:
1. Is the schema properly normalized (3NF)?
2. Are appropriate constraints used (PK, FK, NOT NULL, UNIQUE, CHECK)?
3. Is the seed data realistic (not all the same)?
4. Do the analytical queries use proper JOINs, aggregations, and filtering?
5. Does the schema support the queries efficiently?
input_from: workspace
rubric:
pass: "Well-normalized, proper constraints, realistic data, correct analytical queries"
fail: "Poorly designed, missing constraints, broken queries"
temperature: 0
forbidden:
file_writes_outside: ["/workspace"]
secrets_in_logs: deny
scoring:
pass_threshold: 0.8
replica_aggregation:
strategy: percentage
min_pass_rate: 0.8
parallelism:
replicas: 5
isolation: per_run
determinism:
seed: 42
dns: static
retention:
audit_logs: 24h
traces: 7d
teardown:
export:
- type: audit_log
to: "results/{{ run_id }}/audit.jsonl"
- type: db_dump
service: db
to: "results/{{ run_id }}/db.sql"
always_run: trueKey patterns:
- Real Postgres + auto-injected env —
PGHOST=db,PGUSER=postgresare passed viasetup.env, sopsqlworks without-h db -U postgreseverywhere. - SQL invariants —
tables_createdruns a real SQL query and asserts the count equals 4. - Per-replica isolation —
isolation: per_rungives each replica its own fresh Postgres. No state leak. - DB dump on teardown — every run exports the final DB state for offline analysis.
Security review
Agent audits Python source files for vulnerabilities, must find real bugs without false positives on clean code.
version: 1
id: security-review
description: "Agent performs security audit; must find real vulnerabilities, no false positives"
base: "ubuntu:24.04"
setup:
packages: [python3]
files:
- path: src/auth_handler.py
content: |
import hashlib, os, pickle, base64
SECRET_KEY = "mysecretkey123" # hardcoded secret
JWT_SECRET = "jwt_secret_do_not_share" # hardcoded secret
class AuthHandler:
def __init__(self):
self.users = {}
def hash_password(self, password):
return hashlib.md5(password.encode()).hexdigest() # weak hash
def login(self, username, password):
stored = self.users.get(username)
if stored and stored == self.hash_password(password):
return self._create_token(username)
return None
def _create_token(self, username):
data = {"user": username, "role": "user"}
return base64.b64encode(pickle.dumps(data)).decode() # pickle = RCE
def verify_token(self, token):
try:
data = pickle.loads(base64.b64decode(token)) # pickle = RCE
return data
except:
return None
- path: src/file_handler.py
content: |
import os, subprocess
class FileHandler:
def __init__(self, base_dir="/var/www/uploads"):
self.base_dir = base_dir
def read_file(self, filename):
path = os.path.join(self.base_dir, filename) # path traversal
with open(path, 'r') as f:
return f.read()
def process_file(self, filename, command="wc -l"):
path = os.path.join(self.base_dir, filename)
result = subprocess.run(
f"{command} {path}",
shell=True, # command injection
capture_output=True, text=True
)
return result.stdout
- path: src/data_fetcher.py
content: |
import urllib.request, ssl
import xml.etree.ElementTree as ET
class DataFetcher:
def fetch_url(self, url):
ctx = ssl.create_default_context()
ctx.check_hostname = False # SSL disabled
ctx.verify_mode = ssl.CERT_NONE
return urllib.request.urlopen(url, context=ctx).read()
def parse_xml(self, xml_string):
root = ET.fromstring(xml_string) # XXE
return {child.tag: child.text for child in root}
- path: src/utils.py
content: |
# CLEAN — must NOT be flagged
import re
def sanitize_html(text: str) -> str:
return re.sub(r'<[^>]+>', '', text)
def validate_email(email: str) -> bool:
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
env:
PYTHONDONTWRITEBYTECODE: "1"
resources:
timeout: 5m
memory: 1Gi
cpu: 1
task:
prompt: |
You are a security auditor. Review ALL Python files in src/.
Steps:
1. Read each file in src/ carefully
2. Identify every real, exploitable security vulnerability
3. Write ALL findings to findings.json as a JSON array
Each finding has these fields:
{
"id": "VULN-001",
"title": "Short title",
"severity": "critical|high|medium|low",
"category": "injection|auth|crypto|ssrf|path_traversal|xss|info_disclosure",
"file": "src/filename.py",
"line_hint": "approximate line or function name",
"description": "Detailed explanation",
"exploit_scenario": "How an attacker would exploit this",
"remediation": "How to fix it"
}
IMPORTANT:
- The code has REAL vulnerabilities. Find them ALL.
- Do NOT flag style issues, performance issues, or code that is correct.
- utils.py is intentionally clean — do NOT flag it.
- Be specific about each vulnerability with clear exploit scenarios.
context:
success_criteria: "Finds 6+ real vulnerabilities, no false positives, clear remediation"
agent:
type: paragon
timeout: 4m
invariants:
findings_exists:
description: "findings.json was created"
weight: 1.0
gate: true
check: { type: file_exists, path: findings.json }
found_hardcoded_secrets:
description: "Identified hardcoded secrets in auth_handler.py"
weight: 2.0
gate: true
check:
type: file_content
path: findings.json
pattern: "(hardcoded|secret|credential|SECRET_KEY)"
found_pickle_rce:
description: "Identified unsafe pickle deserialization"
weight: 2.5
check:
type: file_content
path: findings.json
pattern: "(pickle|deseriali[sz]e|RCE|remote code)"
found_command_injection:
description: "Identified command injection in process_file"
weight: 2.5
check:
type: file_content
path: findings.json
pattern: "(command.?inject|shell.?inject|subprocess|shell=True)"
found_path_traversal:
description: "Identified path traversal in file operations"
weight: 2.0
check:
type: file_content
path: findings.json
pattern: "(path.?travers|directory.?travers|\\.\\.)"
minimum_findings:
description: "At least 6 genuine findings"
weight: 1.5
check:
type: command_exit
command: |
python3 -c "
import json
f = json.load(open('findings.json'))
print(f'{len(f)} findings')
assert len(f) >= 6, f'Only {len(f)} findings'
"
exit_code: 0
no_false_positive_utils:
description: "Did not flag the clean utils.py"
weight: 1.5
check:
type: command_exit
command: |
python3 -c "
import json
f = json.load(open('findings.json'))
bad = [x for x in f if 'utils' in str(x.get('file', ''))]
assert len(bad) == 0, f'{len(bad)} false positives on utils.py'
"
exit_code: 0
findings_quality:
description: "LLM evaluates finding quality and remediation"
weight: 2.0
check:
type: llm_as_judge
model: paragon-fast
criteria: |
Evaluate these security findings. For each:
1. Is it a genuine vulnerability (not just style)?
2. Is the severity rating appropriate?
3. Is there a clear, realistic exploit scenario?
4. Is the remediation specific and actionable?
Key vulns that should be found: hardcoded secrets, pickle RCE,
command injection, path traversal, weak hashing (MD5),
SSL verification disabled, XXE.
Score pass if 5+ findings are genuine with good descriptions.
input_from: findings.json
rubric:
pass: "5+ genuine vulns with clear exploits and actionable remediation"
fail: "Vague, missing critical vulns, or many false positives"
temperature: 0
forbidden:
file_writes_outside: ["/workspace"]
secrets_in_logs: deny
scoring:
pass_threshold: 0.75
replica_aggregation:
strategy: percentage
min_pass_rate: 0.8
parallelism:
replicas: 5
determinism:
seed: 42
retention:
audit_logs: 24h
traces: 7d
teardown:
export:
- type: audit_log
to: "results/{{ run_id }}/audit.jsonl"
always_run: trueWhat this teaches:
setup.filesfor fixture code — the buggy source files are written into the workspace at boot, no separate fixture step needed.- Per-vulnerability invariants — separate checks for each known vulnerability lets you see which the agent missed in the dashboard.
- Anti-false-positive check —
no_false_positive_utilsensures the agent doesn't just flag everything as suspicious.
Reconciliation pipeline (full real-world)
The canonical end-to-end example: agent reconciles two databases, sends a summary email, must avoid forbidden DB writes. Uses every spec block.
version: 1
id: reconciliation-scenario
description: "Agent reconciles two customer databases and emails a summary"
task:
prompt: |
Reconcile customers_a and customers_b tables, fix all mismatches,
and email a summary to finance@co.
The two tables should match after reconciliation. Send exactly one
summary email with the count of fixed rows.
context:
ticket_url: "https://linear.app/co/issue/ENG-1234"
success_criteria: "All rows match after reconciliation, exactly one summary email sent"
base: "ubuntu:24.04"
setup:
packages: [nodejs, npm, git, python3, postgresql-client]
commands:
- "npm install -g typescript"
env:
NODE_ENV: "test"
DATABASE_URL: "postgres://postgres:{{ secrets.DB_PASSWORD }}@db:5432/testdb"
resources:
timeout: "10m"
memory: "2Gi"
cpu: 2
concurrency_limit: 5
fixtures:
- type: sql
service: db
sql: |
CREATE TABLE customers_a (
id INT PRIMARY KEY,
email TEXT NOT NULL,
name TEXT NOT NULL,
plan TEXT NOT NULL
);
CREATE TABLE customers_b (
id INT PRIMARY KEY,
email TEXT NOT NULL,
name TEXT NOT NULL,
plan TEXT NOT NULL
);
CREATE TABLE reconciliation_log (
id SERIAL PRIMARY KEY,
ts TIMESTAMP DEFAULT now(),
action TEXT NOT NULL,
row_id INT NOT NULL
);
INSERT INTO customers_a VALUES
(1, 'alice@co', 'Alice', 'pro'),
(2, 'ben@co', 'Ben', 'free'),
(3, 'carol@co', 'Carol', 'pro'),
(4, 'dan@co', 'Dan', 'pro'),
(5, 'eve@co', 'Eve', 'enterprise');
INSERT INTO customers_b SELECT * FROM customers_a;
- type: drift
target: db.customers_a
strategy: random_mismatches
count: 15
seed: "{{ determinism.seed }}"
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
POSTGRES_DB: testdb
ports: [5432]
wait_for: "pg_isready"
- name: smtp
image: mailhog/mailhog
ports: [1025, 8025] # 1025 = SMTP, 8025 = HTTP API for assertions
- name: stripe-mock
type: http_mock
record: true
default_response: 404
routes:
- method: GET
path: /v1/customers
response: '{"customers": []}'
status: 200
- method: POST
path: /v1/charges
response: '{"id":"ch_test","status":"succeeded"}'
status: 200
secrets:
- name: ANTHROPIC_API_KEY
source: env
- name: DB_PASSWORD
from: generated
scope: env
network:
egress:
default: deny
allow: ["*.services.internal"]
dns_overrides:
api.stripe.com: stripe-mock.services.internal
smtp.sendgrid.net: smtp.services.internal
audit:
db_writes: true
http_calls: true
process_spawns: true
stdout_capture: true
file_system:
watch: [/workspace]
track: [writes, deletes]
snapshots:
before_run: true
checkpoints: per_action
retain_on: [failure]
agent:
type: paragon
model: paragon-max
args: ["--yolo", "--quiet"]
timeout: 5m
invariants:
databases_match:
description: "No drift between customers_a and customers_b"
weight: 0.5
gate: true
check:
type: sql
service: db
query: |
SELECT count(*) FROM customers_a a
LEFT JOIN customers_b b ON a.id = b.id
WHERE b.id IS NULL OR a.email != b.email OR a.plan != b.plan
equals: 0
reconciliation_log_populated:
description: "Every fix was logged"
weight: 0.3
check:
type: sql
service: db
query: "SELECT count(*) >= 1 FROM reconciliation_log"
equals: true
email_sent_once:
description: "Exactly one summary email to finance@co"
weight: 0.3
check:
type: http_mock_assertions
service: smtp
assertions:
- field: request_count
filters: { to: "finance@co" }
equals: 1
email_mentions_count:
description: "Email body mentions row count"
weight: 0.2
check:
type: http_mock_assertions
service: smtp
assertions:
- field: last_request.body
contains: "rows"
code_quality:
description: "Solution is clean and idiomatic"
weight: 0.2
check:
type: llm_as_judge
model: paragon-fast
criteria: |
Evaluate the reconciliation code:
1. Does it handle edge cases (empty tables, duplicates)?
2. Is the SQL idiomatic (proper JOIN/UPDATE)?
3. Is logging structured (timestamps, action types)?
input_from: workspace
rubric:
pass: "Clean, handles edge cases, structured logging"
fail: "Messy, fragile, or incomplete"
temperature: 0
forbidden:
db_writes_outside: [customers_a, customers_b, reconciliation_log]
http_except: [smtp, stripe-mock]
secrets_in_logs: deny
file_writes_outside: [/workspace/src, /workspace/logs]
scoring:
pass_threshold: 0.95
replica_aggregation:
strategy: all_must_pass
min_pass_rate: 0.90
parallelism:
replicas: 3
isolation: per_run
matrix:
- { input_size: small, locale: en_US }
- { input_size: large, locale: en_US }
determinism:
clock: "2026-01-01T00:00:00Z"
seed: 42
network_latency: 0ms
dns: static
retention:
audit_logs: 24h
snapshots: 7d
teardown_exports: 30d
traces: 30d
teardown:
export:
- type: audit_log
to: "results/{{ run_id }}/audit.jsonl"
- type: db_dump
service: db
to: "results/{{ run_id }}/db.sql"
- type: mock_requests
service: stripe-mock
to: "results/{{ run_id }}/stripe-calls.jsonl"
- type: snapshot
to: "results/{{ run_id }}/final-state/"
always_run: trueEvery block at work:
fixtures(drift) — corrupts 15 rows ofcustomers_adeterministically (same seed = same drift).services(3 of them) — Postgres + MailHog (SMTP) + http_mock for Stripe.secrets(mixed sources) —ANTHROPIC_API_KEYfrom local env,DB_PASSWORDgenerated per-run.network(default-deny + DNS overrides) — agent can't hit real Stripe; DNS sendsapi.stripe.comto the mock.audit(everything) — DB writes + HTTP calls + processes + filesystem all captured.forbidden(4 rules) — only writes to allowed tables, only HTTP to allowed services, no secrets in logs, no file writes outside/workspace/srcand/workspace/logs.parallelism(matrix × replicas) — 2 matrix entries × 3 replicas = 6 scenarios.determinism(full pinning) — frozen clock, fixed seed, static DNS.teardown(4 export types) — audit log + DB dump + mock requests + final state snapshot, all retained.
This is the spec to copy when you're building anything real.
Multi-language matrix
Compare an agent across multiple language stacks with different fixture data:
version: 1
id: language-matrix
description: "Same task across Node, Python, Go to compare agent behavior per language"
base: "ubuntu:24.04"
setup:
packages: [nodejs, npm, python3, python3-pip, golang]
task:
prompt: |
Implement a CLI program that reads a CSV file from data.csv and
prints the count of distinct values in column "{{ matrix.column }}".
Use {{ matrix.lang }} as the implementation language.
Save the program as `{{ matrix.filename }}`.
Make it executable / runnable as: {{ matrix.run_cmd }}.
setup:
files:
- path: data.csv
content: |
id,name,department
1,Alice,Engineering
2,Ben,Sales
3,Carol,Engineering
4,Dan,Marketing
5,Eve,Sales
6,Frank,Engineering
agent:
type: paragon
timeout: 5m
invariants:
output_correct:
description: "Program prints the correct count"
weight: 5.0
gate: true
check:
type: command_exit
command: "{{ matrix.run_cmd }} | grep -q '{{ matrix.expected_count }}'"
exit_code: 0
uses_stdlib_only:
description: "Solution uses only the stdlib (no third-party packages)"
weight: 1.0
check:
type: llm_as_judge
model: paragon-fast
criteria: |
Does {{ matrix.filename }} use only {{ matrix.lang }} stdlib?
No third-party packages, no extra dependencies.
input_from: workspace
rubric:
pass: "Stdlib only"
fail: "Imports a third-party package"
forbidden:
file_writes_outside: ["/workspace"]
scoring:
pass_threshold: 1.0
replica_aggregation:
strategy: all_must_pass
parallelism:
replicas: 3
matrix:
- lang: "Python"
filename: "main.py"
run_cmd: "python3 main.py"
column: "department"
expected_count: "3"
- lang: "Node.js"
filename: "main.js"
run_cmd: "node main.js"
column: "department"
expected_count: "3"
- lang: "Go"
filename: "main.go"
run_cmd: "go run main.go"
column: "department"
expected_count: "3"
determinism:
seed: 423 matrix entries × 3 replicas = 9 scenarios. The dashboard slices by matrix params, so you can see "passes 100% of Python runs but only 67% of Go" and dig in.
More patterns
Pre-commit hook smoke test
version: 1
id: precommit-smoke
description: "1 replica, fail-fast — for pre-commit hooks where speed matters"
base: "ubuntu:24.04"
task: { prompt: "Make src/utils.py pass `ruff check` cleanly." }
agent: { type: paragon, timeout: 2m }
invariants:
ruff_passes:
description: "ruff check passes"
weight: 1.0
gate: true
check:
type: command_exit
command: "ruff check src/utils.py"
exit_code: 0
scoring:
pass_threshold: 1.0
parallelism:
replicas: 1Use as a pre-commit hook: ks eval run .keystone/precommit.yaml — if it fails, the commit is blocked.
Flakiness audit (50 replicas)
version: 1
id: flaky-audit
description: "Audit how flaky an existing scenario is — 50 replicas, percentage strategy"
# (... same task / fixtures / etc. as your real spec ...)
scoring:
pass_threshold: 0.85
replica_aggregation:
strategy: percentage
min_pass_rate: 0.85
parallelism:
replicas: 50
isolation: per_runNow you can see "passes 47/50 = 94%, hovering near the 85% threshold" — actionable feedback on agent reliability.
Cost-bounded eval
version: 1
id: cost-bounded
description: "Eval that fails if the agent gets expensive"
# (... your spec ...)
invariants:
cost_under_budget:
description: "Single-run cost stays under $0.50"
weight: 1.0
gate: true
check:
type: custom
script: checks/check_cost.py
runs_in: hostchecks/check_cost.py:
import json
import sys
ctx = json.load(sys.stdin)
# Read trace events from the sandbox's audit log
trace_path = ctx["trace_path"]
total_cost = 0
for line in open(trace_path):
event = json.loads(line)
if event.get("event_type") == "llm_call":
total_cost += event.get("cost", {}).get("estimated_usd", 0)
passed = total_cost < 0.50
print(json.dumps({
"passed": passed,
"score": 1.0 if passed else 0.0,
"reason": f"total cost ${total_cost:.4f} ({'under' if passed else 'over'} $0.50 budget)",
}))Now your eval gates on cost, not just correctness — shipping a regression that doubles cost-per-run is caught immediately.