Get started

Examples

Real-world Keystone specs from hello-world to production reconciliation pipelines, with copy-pasteable YAML and explanations.

Specs in this page are ordered from simple to complex. Start with the hello-world to verify your setup, then work up to real-world scenarios. Every example is a complete spec — copy, paste, adapt.

Table of contents

  1. Hello world (file write)
  2. Build a REST API
  3. Bug fix in a linked-list
  4. Postgres schema design
  5. Security review
  6. Reconciliation pipeline (full real-world)
  7. Multi-language matrix

Hello world

The smoke test. Verifies the agent can write a file and the file has the right content.

version: 1
id: hello-world
description: "Smoke test — does the agent write a file?"
 
base: "ubuntu:24.04"
 
task:
  prompt: "Create a file called hello.txt with the text 'Hello, World!' inside."
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  hello_file_exists:
    description: "hello.txt was created"
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: hello.txt
 
  hello_file_correct:
    description: "hello.txt contains the right text"
    weight: 1.0
    check:
      type: file_content
      path: hello.txt
      contains: "Hello, World!"
 
scoring:
  pass_threshold: 1.0

Run:

ks eval run hello.yaml

Expected output:

{
  "passed": 1,
  "failed": 0,
  "metrics": { "pass_rate": 1.0, "mean_wall_ms": 8200 },
  "scenarios": [
    {
      "status": "pass",
      "composite_score": 1.0,
      "invariants": [
        { "name": "hello_file_exists",  "passed": true, "gate": true, "weight": 1.0 },
        { "name": "hello_file_correct", "passed": true, "weight": 1.0 }
      ]
    }
  ]
}

If this works, you're set up correctly.


Build a REST API

Agent builds a Python HTTP server using only the standard library, then writes a test script that exercises every route.

version: 1
id: rest-api-todo
description: "Agent builds a Python REST API for a todo list, with input validation and tests"
 
base: "ubuntu:24.04"
 
setup:
  packages: [python3, curl]
  env:
    PYTHONDONTWRITEBYTECODE: "1"
 
resources:
  timeout: 5m
  memory: 1Gi
  cpu: 1
 
task:
  prompt: |
    Build a Python REST API server using only the standard library
    (http.server + json). Do NOT use Flask, FastAPI, or any third-party
    frameworks.
 
    Requirements:
    1. Create `server.py` that runs an HTTP server on port 8080
    2. Implement these endpoints:
       - GET /todos        → JSON array of all todos
       - POST /todos       → create (body: {"title": "...", "done": false}), return 201
       - GET /todos/<id>   → return a single todo, 404 if missing
       - DELETE /todos/<id> → delete, return 204, 404 if missing
    3. Todos: id (auto-increment int), title (string), done (boolean)
    4. Store todos in memory (a list or dict)
    5. Validate POST input: title must be non-empty string; return 400 with
       error message if invalid
    6. Create `test_api.sh` that:
       - Starts the server in background
       - Tests all endpoints with curl
       - Checks response codes and bodies
       - Kills the server at the end
       - Exits 0 only if all tests pass
 
    In the test script, after starting the server in background, wait for it:
      for i in $(seq 1 50); do
        curl -s http://localhost:8080/todos > /dev/null 2>&1 && break
        sleep 0.1
      done
  context:
    success_criteria: server.py exists, serves correct responses, test script passes, code is clean
 
agent:
  type: paragon
  timeout: 4m
 
invariants:
  server_file_exists:
    description: "server.py was created"
    weight: 1.0
    gate: true
    check: { type: file_exists, path: server.py }
 
  test_script_exists:
    description: "test_api.sh was created"
    weight: 0.5
    gate: true
    check: { type: file_exists, path: test_api.sh }
 
  server_has_routes:
    description: "server.py implements the required HTTP methods"
    weight: 1.0
    check:
      type: file_content
      path: server.py
      pattern: "(GET|POST|DELETE).*(todos|todo)"
 
  server_has_validation:
    description: "server.py includes input validation"
    weight: 1.0
    check:
      type: file_content
      path: server.py
      contains: "400"
 
  test_script_passes:
    description: "Test script runs all API tests successfully"
    weight: 3.0
    gate: true
    check:
      type: command_exit
      command: "chmod +x test_api.sh && bash test_api.sh"
      exit_code: 0
 
  code_quality:
    description: "LLM judge evaluates code quality"
    weight: 2.0
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Evaluate this REST API:
        1. Are all 4 endpoints implemented correctly?
        2. Is error handling thorough (404, 400, proper status codes)?
        3. Is the code well-structured?
        4. Does it use only the standard library as requested?
        5. Is the test script comprehensive?
      input_from: workspace
      rubric:
        pass: "Complete API, proper error handling, clean code, comprehensive tests"
        fail: "Missing endpoints, poor error handling, messy code"
      temperature: 0
 
forbidden:
  file_writes_outside: ["/workspace"]
  secrets_in_logs: deny
 
scoring:
  pass_threshold: 0.8
  replica_aggregation:
    strategy: percentage
    min_pass_rate: 0.8
 
parallelism:
  replicas: 5
  isolation: per_run
 
determinism:
  seed: 42
  dns: static
 
retention:
  audit_logs: 24h
  traces: 7d
 
teardown:
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
  always_run: true

The structure to notice:

  • Gatesserver_file_exists, test_script_exists, test_script_passes are all gate: true. If any one fails, the scenario fails outright.
  • Composite weightingtest_script_passes (the actual end-to-end test) is weight 3.0; quality checks are weight 1–2. The test passing matters more than the code being pretty.
  • 5 replicas + 80% pass rate — measures consistency. Catches "passes 4/5 times" flakiness.

Bug fix in a linked-list

Agent fixes a known-buggy implementation. Tests verify the fix without changes to the test file.

version: 1
id: bugfix-linked-list
description: "Agent fixes a buggy linked-list implementation; tests must pass without test-file modifications"
 
base: "ubuntu:24.04"
 
setup:
  packages: [python3, python3-pip]
  files:
    - path: linked_list.py
      content: |
        class Node:
            def __init__(self, value):
                self.value = value
                self.next = None
 
        class LinkedList:
            def __init__(self):
                self.head = None
 
            def append(self, value):
                # BUG: doesn't handle empty list
                node = self.head
                while node.next is not None:
                    node = node.next
                node.next = Node(value)
 
            def remove(self, value):
                # BUG: doesn't update head if first node matches
                node = self.head
                while node is not None and node.next is not None:
                    if node.next.value == value:
                        node.next = node.next.next
                        return
                    node = node.next
 
            def __len__(self):
                # BUG: off-by-one
                node = self.head
                n = 0
                while node is not None:
                    node = node.next
                    n += 1
                return n - 1
 
    - path: test_linked_list.py
      content: |
        from linked_list import LinkedList
 
        def test_append_empty():
            ll = LinkedList()
            ll.append(1)
            assert len(ll) == 1
 
        def test_append_multiple():
            ll = LinkedList()
            ll.append(1)
            ll.append(2)
            ll.append(3)
            assert len(ll) == 3
 
        def test_remove_head():
            ll = LinkedList()
            ll.append(1)
            ll.append(2)
            ll.remove(1)
            assert len(ll) == 1
            assert ll.head.value == 2
 
        def test_remove_middle():
            ll = LinkedList()
            ll.append(1)
            ll.append(2)
            ll.append(3)
            ll.remove(2)
            assert len(ll) == 2
 
  commands:
    - "pip install pytest"
 
task:
  prompt: |
    The file linked_list.py contains a buggy LinkedList implementation.
    Fix the bugs so that all tests in test_linked_list.py pass.
 
    Constraints:
    - Do NOT modify test_linked_list.py
    - Run `pytest test_linked_list.py -v` to verify
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  test_file_unchanged:
    description: "test_linked_list.py was not modified"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: |
        sha256sum test_linked_list.py | grep -q $(cat .keystone/test_file_initial_hash 2>/dev/null || sha256sum test_linked_list.py | awk '{print $1}')
      exit_code: 0
 
  tests_pass:
    description: "All tests pass"
    weight: 5.0
    gate: true
    check:
      type: command_exit
      command: "pytest test_linked_list.py -v"
      exit_code: 0
 
  no_unrelated_changes:
    description: "Diff is contained to linked_list.py"
    weight: 1.0
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Did the agent fix the bugs in a minimal way? Reject if unnecessary
        files were created or unrelated code was added.
      input_from: workspace
 
forbidden:
  file_writes_outside: ["/workspace"]
  secrets_in_logs: deny
 
scoring:
  pass_threshold: 0.85
 
parallelism:
  replicas: 5
 
determinism:
  seed: 42

Notice: pre-seeding test_linked_list.py via setup.files means every replica starts from the same buggy code. The test_file_unchanged invariant uses a SHA256 hash to verify the agent didn't cheat by editing the test file.


Postgres schema design

Agent designs an e-commerce database, seeds data, writes analytical queries. The DB is a real Postgres container, the agent uses psql to interact.

version: 1
id: postgres-ecommerce
description: "Agent designs an e-commerce schema, seeds data, writes analytical queries against live Postgres"
 
base: "ubuntu:24.04"
 
setup:
  packages: [postgresql-client, python3]
  env:
    PGPASSWORD: test
    PGHOST: db
    PGUSER: postgres
    PGDATABASE: testdb
 
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: test
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: pg_isready
 
resources:
  timeout: 6m
  memory: 2Gi
  cpu: 2
 
task:
  prompt: |
    Design and implement an e-commerce database in the connected PostgreSQL.
 
    Connection:
    - Host: db
    - Port: 5432
    - User: postgres
    - Password: test
    - Database: testdb
 
    Steps:
    1. Create `schema.sql` with these tables:
       - customers (id, name, email, created_at)
       - products (id, name, category, price, stock_quantity)
       - orders (id, customer_id FK, status, total_amount, created_at)
       - order_items (id, order_id FK, product_id FK, quantity, unit_price)
       Use PKs, FKs, NOT NULL, UNIQUE on email, CHECK on price > 0.
 
    2. Create `seed.sql` with realistic data:
       - At least 10 customers
       - At least 15 products across 3+ categories
       - At least 20 orders with statuses (pending, shipped, delivered, cancelled)
       - At least 40 order_items
 
    3. Execute schema.sql and seed.sql against the database.
 
    4. Create `queries.sql` with these named analytical queries:
       a. Top 5 customers by total spending (excluding cancelled)
       b. Revenue by product category, last 30 days
       c. Products low on stock (qty < 10) ordered recently
       d. Average order value by month
       e. Customers who never placed an order
 
    5. Execute all queries and save output to `query_results.txt`.
 
    Use psql command-line tool to interact with the database.
  context:
    success_criteria: "Normalized schema, realistic data, queries execute successfully"
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  schema_file_exists:
    description: "schema.sql was created"
    weight: 0.5
    gate: true
    check: { type: file_exists, path: schema.sql }
 
  seed_file_exists:
    description: "seed.sql was created"
    weight: 0.5
    check: { type: file_exists, path: seed.sql }
 
  queries_file_exists:
    description: "queries.sql was created"
    weight: 0.5
    check: { type: file_exists, path: queries.sql }
 
  tables_created:
    description: "All 4 tables exist in the database"
    weight: 2.0
    gate: true
    check:
      type: sql
      service: db
      query: |
        SELECT count(*) FROM information_schema.tables
        WHERE table_schema = 'public'
        AND table_name IN ('customers', 'products', 'orders', 'order_items')
      equals: 4
 
  has_foreign_keys:
    description: "Foreign key constraints exist"
    weight: 1.5
    check:
      type: sql
      service: db
      query: |
        SELECT count(*) FROM information_schema.table_constraints
        WHERE constraint_type = 'FOREIGN KEY' AND table_schema = 'public'
      equals: 3
 
  sufficient_seed_data:
    description: "Seed data meets minimums"
    weight: 1.5
    check:
      type: command_exit
      command: |
        PGPASSWORD=test psql -h db -U postgres -d testdb -t -A -c "
          SELECT
            (SELECT count(*) FROM customers) >= 10 AND
            (SELECT count(*) FROM products) >= 15 AND
            (SELECT count(*) FROM orders) >= 20 AND
            (SELECT count(*) FROM order_items) >= 40
          AS sufficient;" | grep -q 't'
      exit_code: 0
 
  queries_execute:
    description: "All analytical queries execute without errors"
    weight: 2.0
    check:
      type: command_exit
      command: "PGPASSWORD=test psql -h db -U postgres -d testdb -f queries.sql > /dev/null 2>&1"
      exit_code: 0
 
  schema_quality:
    description: "LLM evaluates schema design and query quality"
    weight: 2.0
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Evaluate this database implementation:
        1. Is the schema properly normalized (3NF)?
        2. Are appropriate constraints used (PK, FK, NOT NULL, UNIQUE, CHECK)?
        3. Is the seed data realistic (not all the same)?
        4. Do the analytical queries use proper JOINs, aggregations, and filtering?
        5. Does the schema support the queries efficiently?
      input_from: workspace
      rubric:
        pass: "Well-normalized, proper constraints, realistic data, correct analytical queries"
        fail: "Poorly designed, missing constraints, broken queries"
      temperature: 0
 
forbidden:
  file_writes_outside: ["/workspace"]
  secrets_in_logs: deny
 
scoring:
  pass_threshold: 0.8
  replica_aggregation:
    strategy: percentage
    min_pass_rate: 0.8
 
parallelism:
  replicas: 5
  isolation: per_run
 
determinism:
  seed: 42
  dns: static
 
retention:
  audit_logs: 24h
  traces: 7d
 
teardown:
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/{{ run_id }}/db.sql"
  always_run: true

Key patterns:

  • Real Postgres + auto-injected envPGHOST=db, PGUSER=postgres are passed via setup.env, so psql works without -h db -U postgres everywhere.
  • SQL invariantstables_created runs a real SQL query and asserts the count equals 4.
  • Per-replica isolationisolation: per_run gives each replica its own fresh Postgres. No state leak.
  • DB dump on teardown — every run exports the final DB state for offline analysis.

Security review

Agent audits Python source files for vulnerabilities, must find real bugs without false positives on clean code.

version: 1
id: security-review
description: "Agent performs security audit; must find real vulnerabilities, no false positives"
 
base: "ubuntu:24.04"
 
setup:
  packages: [python3]
  files:
    - path: src/auth_handler.py
      content: |
        import hashlib, os, pickle, base64
 
        SECRET_KEY = "mysecretkey123"        # hardcoded secret
        JWT_SECRET = "jwt_secret_do_not_share"  # hardcoded secret
 
        class AuthHandler:
            def __init__(self):
                self.users = {}
 
            def hash_password(self, password):
                return hashlib.md5(password.encode()).hexdigest()  # weak hash
 
            def login(self, username, password):
                stored = self.users.get(username)
                if stored and stored == self.hash_password(password):
                    return self._create_token(username)
                return None
 
            def _create_token(self, username):
                data = {"user": username, "role": "user"}
                return base64.b64encode(pickle.dumps(data)).decode()  # pickle = RCE
 
            def verify_token(self, token):
                try:
                    data = pickle.loads(base64.b64decode(token))  # pickle = RCE
                    return data
                except:
                    return None
 
    - path: src/file_handler.py
      content: |
        import os, subprocess
 
        class FileHandler:
            def __init__(self, base_dir="/var/www/uploads"):
                self.base_dir = base_dir
 
            def read_file(self, filename):
                path = os.path.join(self.base_dir, filename)  # path traversal
                with open(path, 'r') as f:
                    return f.read()
 
            def process_file(self, filename, command="wc -l"):
                path = os.path.join(self.base_dir, filename)
                result = subprocess.run(
                    f"{command} {path}",
                    shell=True,                                    # command injection
                    capture_output=True, text=True
                )
                return result.stdout
 
    - path: src/data_fetcher.py
      content: |
        import urllib.request, ssl
        import xml.etree.ElementTree as ET
 
        class DataFetcher:
            def fetch_url(self, url):
                ctx = ssl.create_default_context()
                ctx.check_hostname = False                          # SSL disabled
                ctx.verify_mode = ssl.CERT_NONE
                return urllib.request.urlopen(url, context=ctx).read()
 
            def parse_xml(self, xml_string):
                root = ET.fromstring(xml_string)                    # XXE
                return {child.tag: child.text for child in root}
 
    - path: src/utils.py
      content: |
        # CLEAN — must NOT be flagged
        import re
 
        def sanitize_html(text: str) -> str:
            return re.sub(r'<[^>]+>', '', text)
 
        def validate_email(email: str) -> bool:
            pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
            return bool(re.match(pattern, email))
 
  env:
    PYTHONDONTWRITEBYTECODE: "1"
 
resources:
  timeout: 5m
  memory: 1Gi
  cpu: 1
 
task:
  prompt: |
    You are a security auditor. Review ALL Python files in src/.
 
    Steps:
    1. Read each file in src/ carefully
    2. Identify every real, exploitable security vulnerability
    3. Write ALL findings to findings.json as a JSON array
 
    Each finding has these fields:
    {
      "id": "VULN-001",
      "title": "Short title",
      "severity": "critical|high|medium|low",
      "category": "injection|auth|crypto|ssrf|path_traversal|xss|info_disclosure",
      "file": "src/filename.py",
      "line_hint": "approximate line or function name",
      "description": "Detailed explanation",
      "exploit_scenario": "How an attacker would exploit this",
      "remediation": "How to fix it"
    }
 
    IMPORTANT:
    - The code has REAL vulnerabilities. Find them ALL.
    - Do NOT flag style issues, performance issues, or code that is correct.
    - utils.py is intentionally clean — do NOT flag it.
    - Be specific about each vulnerability with clear exploit scenarios.
  context:
    success_criteria: "Finds 6+ real vulnerabilities, no false positives, clear remediation"
 
agent:
  type: paragon
  timeout: 4m
 
invariants:
  findings_exists:
    description: "findings.json was created"
    weight: 1.0
    gate: true
    check: { type: file_exists, path: findings.json }
 
  found_hardcoded_secrets:
    description: "Identified hardcoded secrets in auth_handler.py"
    weight: 2.0
    gate: true
    check:
      type: file_content
      path: findings.json
      pattern: "(hardcoded|secret|credential|SECRET_KEY)"
 
  found_pickle_rce:
    description: "Identified unsafe pickle deserialization"
    weight: 2.5
    check:
      type: file_content
      path: findings.json
      pattern: "(pickle|deseriali[sz]e|RCE|remote code)"
 
  found_command_injection:
    description: "Identified command injection in process_file"
    weight: 2.5
    check:
      type: file_content
      path: findings.json
      pattern: "(command.?inject|shell.?inject|subprocess|shell=True)"
 
  found_path_traversal:
    description: "Identified path traversal in file operations"
    weight: 2.0
    check:
      type: file_content
      path: findings.json
      pattern: "(path.?travers|directory.?travers|\\.\\.)"
 
  minimum_findings:
    description: "At least 6 genuine findings"
    weight: 1.5
    check:
      type: command_exit
      command: |
        python3 -c "
        import json
        f = json.load(open('findings.json'))
        print(f'{len(f)} findings')
        assert len(f) >= 6, f'Only {len(f)} findings'
        "
      exit_code: 0
 
  no_false_positive_utils:
    description: "Did not flag the clean utils.py"
    weight: 1.5
    check:
      type: command_exit
      command: |
        python3 -c "
        import json
        f = json.load(open('findings.json'))
        bad = [x for x in f if 'utils' in str(x.get('file', ''))]
        assert len(bad) == 0, f'{len(bad)} false positives on utils.py'
        "
      exit_code: 0
 
  findings_quality:
    description: "LLM evaluates finding quality and remediation"
    weight: 2.0
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Evaluate these security findings. For each:
        1. Is it a genuine vulnerability (not just style)?
        2. Is the severity rating appropriate?
        3. Is there a clear, realistic exploit scenario?
        4. Is the remediation specific and actionable?
 
        Key vulns that should be found: hardcoded secrets, pickle RCE,
        command injection, path traversal, weak hashing (MD5),
        SSL verification disabled, XXE.
 
        Score pass if 5+ findings are genuine with good descriptions.
      input_from: findings.json
      rubric:
        pass: "5+ genuine vulns with clear exploits and actionable remediation"
        fail: "Vague, missing critical vulns, or many false positives"
      temperature: 0
 
forbidden:
  file_writes_outside: ["/workspace"]
  secrets_in_logs: deny
 
scoring:
  pass_threshold: 0.75
  replica_aggregation:
    strategy: percentage
    min_pass_rate: 0.8
 
parallelism:
  replicas: 5
 
determinism:
  seed: 42
 
retention:
  audit_logs: 24h
  traces: 7d
 
teardown:
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
  always_run: true

What this teaches:

  • setup.files for fixture code — the buggy source files are written into the workspace at boot, no separate fixture step needed.
  • Per-vulnerability invariants — separate checks for each known vulnerability lets you see which the agent missed in the dashboard.
  • Anti-false-positive checkno_false_positive_utils ensures the agent doesn't just flag everything as suspicious.

Reconciliation pipeline (full real-world)

The canonical end-to-end example: agent reconciles two databases, sends a summary email, must avoid forbidden DB writes. Uses every spec block.

version: 1
id: reconciliation-scenario
description: "Agent reconciles two customer databases and emails a summary"
 
task:
  prompt: |
    Reconcile customers_a and customers_b tables, fix all mismatches,
    and email a summary to finance@co.
 
    The two tables should match after reconciliation. Send exactly one
    summary email with the count of fixed rows.
  context:
    ticket_url: "https://linear.app/co/issue/ENG-1234"
    success_criteria: "All rows match after reconciliation, exactly one summary email sent"
 
base: "ubuntu:24.04"
 
setup:
  packages: [nodejs, npm, git, python3, postgresql-client]
  commands:
    - "npm install -g typescript"
  env:
    NODE_ENV: "test"
    DATABASE_URL: "postgres://postgres:{{ secrets.DB_PASSWORD }}@db:5432/testdb"
 
resources:
  timeout: "10m"
  memory: "2Gi"
  cpu: 2
  concurrency_limit: 5
 
fixtures:
  - type: sql
    service: db
    sql: |
      CREATE TABLE customers_a (
        id INT PRIMARY KEY,
        email TEXT NOT NULL,
        name TEXT NOT NULL,
        plan TEXT NOT NULL
      );
      CREATE TABLE customers_b (
        id INT PRIMARY KEY,
        email TEXT NOT NULL,
        name TEXT NOT NULL,
        plan TEXT NOT NULL
      );
      CREATE TABLE reconciliation_log (
        id SERIAL PRIMARY KEY,
        ts TIMESTAMP DEFAULT now(),
        action TEXT NOT NULL,
        row_id INT NOT NULL
      );
      INSERT INTO customers_a VALUES
        (1, 'alice@co', 'Alice', 'pro'),
        (2, 'ben@co', 'Ben', 'free'),
        (3, 'carol@co', 'Carol', 'pro'),
        (4, 'dan@co', 'Dan', 'pro'),
        (5, 'eve@co', 'Eve', 'enterprise');
      INSERT INTO customers_b SELECT * FROM customers_a;
 
  - type: drift
    target: db.customers_a
    strategy: random_mismatches
    count: 15
    seed: "{{ determinism.seed }}"
 
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: "pg_isready"
 
  - name: smtp
    image: mailhog/mailhog
    ports: [1025, 8025]    # 1025 = SMTP, 8025 = HTTP API for assertions
 
  - name: stripe-mock
    type: http_mock
    record: true
    default_response: 404
    routes:
      - method: GET
        path: /v1/customers
        response: '{"customers": []}'
        status: 200
      - method: POST
        path: /v1/charges
        response: '{"id":"ch_test","status":"succeeded"}'
        status: 200
 
secrets:
  - name: ANTHROPIC_API_KEY
    source: env
 
  - name: DB_PASSWORD
    from: generated
    scope: env
 
network:
  egress:
    default: deny
    allow: ["*.services.internal"]
  dns_overrides:
    api.stripe.com: stripe-mock.services.internal
    smtp.sendgrid.net: smtp.services.internal
 
audit:
  db_writes: true
  http_calls: true
  process_spawns: true
  stdout_capture: true
  file_system:
    watch: [/workspace]
    track: [writes, deletes]
 
snapshots:
  before_run: true
  checkpoints: per_action
  retain_on: [failure]
 
agent:
  type: paragon
  model: paragon-max
  args: ["--yolo", "--quiet"]
  timeout: 5m
 
invariants:
  databases_match:
    description: "No drift between customers_a and customers_b"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db
      query: |
        SELECT count(*) FROM customers_a a
        LEFT JOIN customers_b b ON a.id = b.id
        WHERE b.id IS NULL OR a.email != b.email OR a.plan != b.plan
      equals: 0
 
  reconciliation_log_populated:
    description: "Every fix was logged"
    weight: 0.3
    check:
      type: sql
      service: db
      query: "SELECT count(*) >= 1 FROM reconciliation_log"
      equals: true
 
  email_sent_once:
    description: "Exactly one summary email to finance@co"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  email_mentions_count:
    description: "Email body mentions row count"
    weight: 0.2
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: last_request.body
          contains: "rows"
 
  code_quality:
    description: "Solution is clean and idiomatic"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Evaluate the reconciliation code:
        1. Does it handle edge cases (empty tables, duplicates)?
        2. Is the SQL idiomatic (proper JOIN/UPDATE)?
        3. Is logging structured (timestamps, action types)?
      input_from: workspace
      rubric:
        pass: "Clean, handles edge cases, structured logging"
        fail: "Messy, fragile, or incomplete"
      temperature: 0
 
forbidden:
  db_writes_outside: [customers_a, customers_b, reconciliation_log]
  http_except: [smtp, stripe-mock]
  secrets_in_logs: deny
  file_writes_outside: [/workspace/src, /workspace/logs]
 
scoring:
  pass_threshold: 0.95
  replica_aggregation:
    strategy: all_must_pass
    min_pass_rate: 0.90
 
parallelism:
  replicas: 3
  isolation: per_run
  matrix:
    - { input_size: small, locale: en_US }
    - { input_size: large, locale: en_US }
 
determinism:
  clock: "2026-01-01T00:00:00Z"
  seed: 42
  network_latency: 0ms
  dns: static
 
retention:
  audit_logs: 24h
  snapshots: 7d
  teardown_exports: 30d
  traces: 30d
 
teardown:
  export:
    - type: audit_log
      to: "results/{{ run_id }}/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/{{ run_id }}/db.sql"
    - type: mock_requests
      service: stripe-mock
      to: "results/{{ run_id }}/stripe-calls.jsonl"
    - type: snapshot
      to: "results/{{ run_id }}/final-state/"
  always_run: true

Every block at work:

  • fixtures (drift) — corrupts 15 rows of customers_a deterministically (same seed = same drift).
  • services (3 of them) — Postgres + MailHog (SMTP) + http_mock for Stripe.
  • secrets (mixed sources)ANTHROPIC_API_KEY from local env, DB_PASSWORD generated per-run.
  • network (default-deny + DNS overrides) — agent can't hit real Stripe; DNS sends api.stripe.com to the mock.
  • audit (everything) — DB writes + HTTP calls + processes + filesystem all captured.
  • forbidden (4 rules) — only writes to allowed tables, only HTTP to allowed services, no secrets in logs, no file writes outside /workspace/src and /workspace/logs.
  • parallelism (matrix × replicas) — 2 matrix entries × 3 replicas = 6 scenarios.
  • determinism (full pinning) — frozen clock, fixed seed, static DNS.
  • teardown (4 export types) — audit log + DB dump + mock requests + final state snapshot, all retained.

This is the spec to copy when you're building anything real.


Multi-language matrix

Compare an agent across multiple language stacks with different fixture data:

version: 1
id: language-matrix
description: "Same task across Node, Python, Go to compare agent behavior per language"
 
base: "ubuntu:24.04"
 
setup:
  packages: [nodejs, npm, python3, python3-pip, golang]
 
task:
  prompt: |
    Implement a CLI program that reads a CSV file from data.csv and
    prints the count of distinct values in column "{{ matrix.column }}".
 
    Use {{ matrix.lang }} as the implementation language.
    Save the program as `{{ matrix.filename }}`.
    Make it executable / runnable as: {{ matrix.run_cmd }}.
 
setup:
  files:
    - path: data.csv
      content: |
        id,name,department
        1,Alice,Engineering
        2,Ben,Sales
        3,Carol,Engineering
        4,Dan,Marketing
        5,Eve,Sales
        6,Frank,Engineering
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  output_correct:
    description: "Program prints the correct count"
    weight: 5.0
    gate: true
    check:
      type: command_exit
      command: "{{ matrix.run_cmd }} | grep -q '{{ matrix.expected_count }}'"
      exit_code: 0
 
  uses_stdlib_only:
    description: "Solution uses only the stdlib (no third-party packages)"
    weight: 1.0
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: |
        Does {{ matrix.filename }} use only {{ matrix.lang }} stdlib?
        No third-party packages, no extra dependencies.
      input_from: workspace
      rubric:
        pass: "Stdlib only"
        fail: "Imports a third-party package"
 
forbidden:
  file_writes_outside: ["/workspace"]
 
scoring:
  pass_threshold: 1.0
  replica_aggregation:
    strategy: all_must_pass
 
parallelism:
  replicas: 3
  matrix:
    - lang: "Python"
      filename: "main.py"
      run_cmd: "python3 main.py"
      column: "department"
      expected_count: "3"
    - lang: "Node.js"
      filename: "main.js"
      run_cmd: "node main.js"
      column: "department"
      expected_count: "3"
    - lang: "Go"
      filename: "main.go"
      run_cmd: "go run main.go"
      column: "department"
      expected_count: "3"
 
determinism:
  seed: 42

3 matrix entries × 3 replicas = 9 scenarios. The dashboard slices by matrix params, so you can see "passes 100% of Python runs but only 67% of Go" and dig in.


More patterns

Pre-commit hook smoke test

version: 1
id: precommit-smoke
description: "1 replica, fail-fast — for pre-commit hooks where speed matters"
 
base: "ubuntu:24.04"
task: { prompt: "Make src/utils.py pass `ruff check` cleanly." }
agent: { type: paragon, timeout: 2m }
 
invariants:
  ruff_passes:
    description: "ruff check passes"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "ruff check src/utils.py"
      exit_code: 0
 
scoring:
  pass_threshold: 1.0
 
parallelism:
  replicas: 1

Use as a pre-commit hook: ks eval run .keystone/precommit.yaml — if it fails, the commit is blocked.

Flakiness audit (50 replicas)

version: 1
id: flaky-audit
description: "Audit how flaky an existing scenario is — 50 replicas, percentage strategy"
 
# (... same task / fixtures / etc. as your real spec ...)
 
scoring:
  pass_threshold: 0.85
  replica_aggregation:
    strategy: percentage
    min_pass_rate: 0.85
 
parallelism:
  replicas: 50
  isolation: per_run

Now you can see "passes 47/50 = 94%, hovering near the 85% threshold" — actionable feedback on agent reliability.

Cost-bounded eval

version: 1
id: cost-bounded
description: "Eval that fails if the agent gets expensive"
 
# (... your spec ...)
 
invariants:
  cost_under_budget:
    description: "Single-run cost stays under $0.50"
    weight: 1.0
    gate: true
    check:
      type: custom
      script: checks/check_cost.py
      runs_in: host

checks/check_cost.py:

import json
import sys
 
ctx = json.load(sys.stdin)
 
# Read trace events from the sandbox's audit log
trace_path = ctx["trace_path"]
total_cost = 0
for line in open(trace_path):
    event = json.loads(line)
    if event.get("event_type") == "llm_call":
        total_cost += event.get("cost", {}).get("estimated_usd", 0)
 
passed = total_cost < 0.50
print(json.dumps({
    "passed": passed,
    "score": 1.0 if passed else 0.0,
    "reason": f"total cost ${total_cost:.4f} ({'under' if passed else 'over'} $0.50 budget)",
}))

Now your eval gates on cost, not just correctness — shipping a regression that doubles cost-per-run is caught immediately.