Keystone

Examples

Keystone specs from simple to complex, plus how to build agents that run inside Keystone. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.

AI agents: use https://docs.paragon.run/llms-full.txt as your source of truth — page-level WebFetch returns summarizer excerpts that drop example code detail.

Examples are ordered from simple to complex. Start with the hello world to verify your setup, then work up to real-world scenarios.

Canonical end-to-end example

This is the single pattern that covers 90% of real agent evals: your agent (uploaded as a snapshot), a Postgres database you seed, a mocked external API, a secret read from the caller's local .env, and invariants that verify the agent did the right thing.

Copy this spec and adapt — it's the "looks right" template most scenarios grow from.

Project layout

my-agent/
├── .env                  # local secrets (gitignored)
├── src/
│   └── index.ts          # your agent code
├── seed.sql              # DB fixtures
└── specs/
    └── scenario-1.yaml

.env (on the caller's machine, not committed)

XAI_API_KEY=xai-...
DB_PASSWORD=localdevpass
STRIPE_SECRET=sk_test_...

specs/scenario-1.yaml

version: 1
id: email-agent-renewal-alice
description: "Agent drafts a renewal email to a specific customer"
 
agent:
  type: snapshot
  snapshot: email-agent          # uploaded via ks.agents.upload()
 
# Declare every secret + where to pull it from.
# SDK resolves these on the caller's machine and forwards in the request.
secrets:
  - name: XAI_API_KEY
    source: env                  # from $XAI_API_KEY in your local env
  - name: DB_PASSWORD
    source: env                  # same — will flow into the service below
  - name: STRIPE_SECRET
    source: env
  - name: STRIPE_LIVE_KEY
    source: dashboard            # server-side only; never forwarded from local
 
# Services Keystone pulls from Docker Hub and runs on a shared network.
# The agent reaches each by `name` — e.g. postgres://db:5432.
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: northwind
    ports: [5432]
    wait_for: "pg_isready -h localhost"
 
  - name: stripe_mock
    image: stripe/stripe-mock:latest
    env:
      STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}"
    ports: [12111]
 
# Seed the DB before the agent starts.
fixtures:
  - type: sql
    service: db
    path: seed.sql
 
# Allowlist only the domains your agent needs to reach externally.
network:
  egress:
    default: deny
    allow:
      - api.x.ai                   # your LLM provider
      - keystone.polarity.so       # trace ingestion (auto-allowed in blessed runtime images)
 
task:
  prompt: "Draft a subscription renewal reminder email to alice@northwind.co."
 
invariants:
  draft_created:
    description: "Exactly one draft file"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "test $(ls drafts/*.md 2>/dev/null | wc -l) -eq 1"
 
  addressed_to_alice:
    description: "Draft is addressed to alice@northwind.co"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "grep -l 'To: alice@northwind.co' drafts/*.md"
 
  reads_professional:
    description: "Draft is professional and mentions the product"
    weight: 0.5
    check:
      type: llm_as_judge
      input_from: workspace
      criteria: "The draft is professional and references alice's product."
      # model defaults to paragon-md; billed per-token to your Keystone invoice
 
scoring:
  pass_threshold: 0.85

Running it

import { Keystone } from "@polarityinc/polarity-keystone";
import "dotenv/config";           // loads .env into process.env
 
const ks = new Keystone();
 
// Upload the agent snapshot (once per build)
await ks.agents.upload({
  name: "email-agent",
  bundle: "./",                   // tars the project; respects .gitignore
  runtime: "node20",
  entrypoint: ["tsx", "src/index.ts"],
});
 
// Upload the spec
const spec = await ks.specs.create({ spec_yaml: fs.readFileSync("specs/scenario-1.yaml", "utf8") });
 
// Create + run the experiment; SDK auto-forwards declared secrets from .env
const exp = await ks.experiments.create({
  name: "renewal-alice",
  spec_id: spec.spec_id,
  specPath: "specs/scenario-1.yaml",
});
const results = await ks.experiments.runAndWait(exp.id);
console.log(results.metrics.pass_rate);

What happens behind the scenes

  1. SDK resolves secrets on your machine per source: directives. STRIPE_LIVE_KEY isn't forwarded — server reads it from the Dashboard. XAI_API_KEY, DB_PASSWORD, STRIPE_SECRET are forwarded from .env.
  2. Keystone server pulls postgres:16, stripe/stripe-mock:latest, and the blessed Node runtime image from Docker Hub.
  3. A new Docker network keystone-<id> is created. All containers join it with their name as the DNS alias.
  4. Services start first, with {{ secrets.X }} substituted to actual values. seed.sql runs against the Postgres container.
  5. Agent container starts with your tarball mounted at /workspace, secrets injected as env vars (including KEYSTONE_SERVICE_DB_HOST=db, KEYSTONE_SERVICE_DB_PORT=5432, etc. for service discovery).
  6. Agent runs the entrypoint, reads the task prompt from stdin, produces files in /workspace/drafts/. Traces stream to keystone.polarity.so over HTTPS.
  7. Scorer runs invariants against the post-run workspace. llm_as_judge calls Paragon's model proxy (billed to your Keystone invoice).
  8. Teardown — containers destroyed, network removed, workspace retained per your retention policy.

The entire lifecycle is declared in one spec + one .env. No Dockerfile, no separate build step for services, no manual secret entry.


Hello world

The simplest possible spec. No external repos, no agent binary, no services. A shell command creates a file and the invariants check it exists with the right content.

version: 1
id: "hello-world"
description: "Verify a command can create a file"
 
base: "ubuntu:24.04"
 
task:
  prompt: "Create hello.txt containing 'Hello, Keystone!'"
 
agent:
  type: cli
  binary: sh
  args: ["-c", "echo 'Hello, Keystone!' > hello.txt"]
  timeout: 1m
 
invariants:
  file_created:
    description: "hello.txt exists"
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: hello.txt
 
  correct_content:
    description: "Contains the expected text"
    weight: 1.0
    check:
      type: file_content
      path: hello.txt
      contains: "Hello, Keystone!"
 
scoring:
  pass_threshold: 1.0
const ks = new Keystone();
await ks.specs.create(readFileSync('hello.yaml', 'utf-8'));
const exp = await ks.experiments.create({ name: 'hello', spec_id: 'hello-world' });
const results = await ks.experiments.runAndWait(exp.id);
console.log(results.passed > 0 ? 'PASSED' : 'FAILED');

If this passes, your setup is working. Move on to a real scenario.


Fix a failing test

A realistic spec that clones a repo, installs dependencies, and checks whether the agent can fix a broken test. Uses Paragon as the agent.

Replace your-org/your-repo and broken-test with your own repository and branch. The repo needs to have a failing test that the agent can fix.

version: 1
id: "fix-unit-test"
description: "Agent fixes a failing unit test"
 
base: "ubuntu:24.04"
setup:
  packages: [nodejs, npm]
  commands: ["npm install"]
 
fixtures:
  - type: git_repo
    url: "https://github.com/your-org/your-repo"
    branch: "broken-test"
 
task:
  prompt: |
    The test in src/utils.test.ts is failing. Fix it.
    Do not change the implementation in src/utils.ts.
 
agent:
  type: paragon
  model: paragon-fast
  timeout: 3m
 
invariants:
  tests_pass:
    description: "All tests pass"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "npm test"
 
  no_impl_changes:
    description: "Implementation file was not modified"
    weight: 0.5
    check:
      type: command_exit
      command: "git diff --name-only | grep -v test"
      exit_code: 1
 
scoring:
  pass_threshold: 1.0

What's new here:

  • fixtures clones a real git repo into the sandbox
  • setup.commands installs npm dependencies before the agent runs
  • The second invariant (no_impl_changes) verifies the agent only changed test files, not the implementation -- it expects grep to find no matches (exit code 1)

Full stack with services

A complex spec that tests whether an agent can reconcile two database tables and send a summary email. Uses a real Postgres database, a mock SMTP server, LLM-as-judge scoring, and forbidden rules.

version: 1
id: "reconciliation"
description: "Agent reconciles two databases and emails a summary"
 
base: "ubuntu:24.04"
setup:
  packages: [nodejs, npm, python3]
  env:
    DATABASE_URL: "postgres://postgres:test@db:5432/testdb"
 
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: test
      POSTGRES_DB: testdb
    ports: [5432]
    wait_for: "pg_isready -U postgres"
  - name: smtp
    type: http_mock
    ports: [9090]
    record: true
    routes:
      - method: POST
        path: /v1/send
        response: '{"ok": true}'
 
secrets:
  - name: DB_PASSWORD
    from: generated
    scope: env
 
fixtures:
  - type: sql
    service: db
    path: seeds/schema.sql
  - type: drift
    target: db.customers_a
    strategy: random_mismatches
    count: 15
    seed: "42"
 
network:
  egress:
    default: deny
    allow: [registry.npmjs.org]
  dns_overrides:
    smtp.sendgrid.net: smtp.services.internal
 
task:
  prompt: |
    Reconcile customers_a and customers_b tables, fix all mismatches,
    and email a summary to finance@co
 
agent:
  type: paragon
  model: paragon-max
  timeout: 5m
 
invariants:
  databases_match:
    description: "No mismatches after reconciliation"
    weight: 0.5
    gate: true
    check:
      type: sql
      service: db
      query: |
        SELECT count(*) FROM customers_a a
        LEFT JOIN customers_b b ON a.id = b.id
        WHERE b.id IS NULL OR a.email != b.email
      equals: 0
 
  email_sent:
    description: "Exactly one summary email was sent"
    weight: 0.3
    check:
      type: http_mock_assertions
      service: smtp
      assertions:
        - field: request_count
          filters: { to: "finance@co" }
          equals: 1
 
  email_quality:
    description: "Email is professional and accurate"
    weight: 0.2
    check:
      type: llm_as_judge
      model: paragon-fast
      criteria: "Professional tone, accurate mismatch count"
      input_from: smtp.last_request.body
 
forbidden:
  db_writes_outside: [customers_a, customers_b, audit_log]
  http_except: [smtp]
  secrets_in_logs: deny
 
scoring:
  pass_threshold: 0.95
 
parallelism:
  replicas: 10
  isolation: per_run
 
determinism:
  seed: 42
 
teardown:
  always_run: true
  export:
    - type: audit_log
      to: "results/audit.jsonl"
    - type: db_dump
      service: db
      to: "results/db.sql"

What's new here:

  • services starts a real Postgres container and a mock SMTP server on a shared Docker network
  • fixtures loads SQL seeds into Postgres and injects random data drift
  • secrets generates a fresh random password each run
  • network blocks all outbound traffic except npm and redirects smtp.sendgrid.net to the mock
  • forbidden ensures the agent only writes to allowed tables and doesn't leak secrets
  • invariants uses three different check types: SQL query, mock assertion, and LLM-as-judge
  • parallelism runs 10 replicas to measure consistency
  • teardown exports the audit log and a database dump after every run

Comparing agents

Use matrix to test the same task with different configurations side by side.

version: 1
id: "agent-comparison"
description: "Compare agent performance across models"
 
base: "ubuntu:24.04"
setup:
  packages: [python3, pip]
  commands: ["pip install -r requirements.txt"]
 
fixtures:
  - type: git_repo
    url: "https://github.com/your-org/ml-pipeline"
    branch: "main"
 
task:
  prompt: "Fix the data pipeline bug that causes duplicate records."
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  no_duplicates:
    description: "No duplicate records in output"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "python3 check_duplicates.py"
 
scoring:
  pass_threshold: 0.8
 
parallelism:
  replicas: 5
  matrix:
    - model: "paragon-fast"
    - model: "paragon-max"

This creates 10 total runs (2 models x 5 replicas). After both finish, compare them:

const comparison = await ks.experiments.compare(fastExpId, maxExpId);
console.log(`Regressed: ${comparison.regressed}`);
comparison.metrics.forEach(m => {
  console.log(`${m.name}: ${m.baseline} -> ${m.candidate} (${m.direction})`);
});

Setting up alerts

Get notified when your agent's performance drops.

// Slack Bot alert (posts to a channel)
await ks.alerts.create({
  name: 'pass-rate-drop',
  eval_id: 'reconciliation',
  condition: 'pass_rate < 0.9',
  notify: 'slack',
  slack_channel: '#agent-alerts',
});
 
// Webhook alert (Slack incoming webhook auto-detected)
await ks.alerts.create({
  name: 'cost-spike',
  condition: 'mean_cost_per_run_usd > 2.00',
  notify: 'webhook',
  webhook_url: 'https://hooks.slack.com/services/T00/B00/xxx',
});

Alerts evaluate after every experiment run. See SDK reference > Alerts for all supported metrics and operators.


Using sandboxes directly

You don't have to run full experiments. You can create a sandbox from a spec, interact with it manually, and inspect the results. This is useful for development and debugging.

import { Keystone } from '@polarityinc/polarity-keystone';
 
const ks = new Keystone();
 
// Create a sandbox
const sb = await ks.sandboxes.create({ spec_id: 'fix-unit-test' });
console.log(`Sandbox ${sb.id} is ${sb.state}`);
console.log(`Services:`, sb.services);
// { db: { host: "db", port: 5432, ready: true } }
 
// Run commands
const test = await ks.sandboxes.runCommand(sb.id, { command: 'npm test' });
console.log(`Exit code: ${test.exit_code}`);
console.log(`Output: ${test.stdout}`);
 
// Read and write files
const src = await ks.sandboxes.readFile(sb.id, 'src/utils.ts');
await ks.sandboxes.writeFile(sb.id, 'src/utils.ts', fixedCode);
 
// See what changed
const diff = await ks.sandboxes.diff(sb.id);
console.log(`Modified: ${diff.modified}`);
console.log(`Added: ${diff.added}`);
 
// Clean up
await ks.sandboxes.destroy(sb.id);

This gives you a remote dev environment with services pre-configured. You can use it for manual testing, scripted integration tests, or building custom evaluation pipelines.


Building an agent for Keystone

If you're building an agent that's designed to run inside Keystone sandboxes, here's what you need to know from the agent's perspective.

What your agent gets

When Keystone runs your agent, it sets up the environment before your code starts:

  1. A workspace directory -- your agent's working directory, seeded with any fixtures (git repos, files)
  2. Backing services -- Postgres, Redis, etc. are already running and reachable by name
  3. Environment variables -- connection info for every service, the sandbox ID, and the Keystone API URL
  4. The task prompt -- passed via stdin (for cli/paragon/python agents) or in the HTTP request body (for http agents)

Environment variables

Keystone injects these into every agent process:

VariableExampleDescription
KEYSTONE_SANDBOX_IDsb-abc123Current sandbox ID
KEYSTONE_BASE_URLhttp://keystone:8012Keystone API URL
KEYSTONE_SERVICE_DB_HOSTdbHostname for the "db" service
KEYSTONE_SERVICE_DB_PORT5432Port for the "db" service
KEYSTONE_SERVICE_CACHE_HOSTcacheHostname for the "cache" service
KEYSTONE_SERVICE_CACHE_PORT6379Port for the "cache" service

One pair of HOST/PORT vars is created for each service in the spec, with the service name uppercased.

Discovering services with the SDK

Instead of parsing env vars manually, use Keystone.fromSandbox():

import { Keystone } from '@polarityinc/polarity-keystone';
 
const { client: ks, sandbox } = await Keystone.fromSandbox();
 
// Connect to services
const db = sandbox.services?.['db'];
console.log(`DB at ${db.host}:${db.port}, ready: ${db.ready}`);
 
// Run commands in the sandbox
const result = await ks.sandboxes.runCommand(sandbox.id, {
  command: 'npm test',
});
console.log(`Tests: exit code ${result.exit_code}`);

Auto-tracing LLM calls

Wrap your LLM client so Keystone automatically captures token usage, latency, and tool calls. This data shows up in traces and can be queried by agent name and version.

import Anthropic from '@anthropic-ai/sdk';
import { Keystone } from '@polarityinc/polarity-keystone';
 
const { client: ks, sandbox } = await Keystone.fromSandbox();
const anthropic = ks.wrap(new Anthropic());   // sandbox id auto-detected from env
 
// Every messages.create() call now auto-reports to Keystone
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  messages: [{ role: 'user', content: 'Fix the bug in main.py' }],
});

Custom trace spans

For non-LLM operations (file I/O, API calls, test execution), use traced() to capture timing and errors:

import { traced } from '@polarityinc/polarity-keystone';
 
const testOutput = await traced('run_tests', async () => {
  return execSync('npm test').toString();
});
 
const analysis = await traced('analyze_results', async () => {
  return parseTestOutput(testOutput);
});

Nested traced() calls create parent-child spans automatically, so you get a full execution tree in the trace viewer.

Stdout and stderr

Everything your agent prints to stdout and stderr is captured and returned in the experiment results:

  • scenarios[].agent_output -- your agent's stdout (truncated to 50KB)
  • scenarios[].agent_stderr -- your agent's stderr (truncated to 50KB)

Use stdout for normal output and stderr for debug logging. If audit.stdout_capture is enabled in the spec, stdout is also scanned by the secrets_in_logs forbidden rule.

You can view the captured output when debugging failures:

const results = await ks.experiments.runAndWait(exp.id);
for (const scenario of results.scenarios) {
  if (scenario.status === 'fail') {
    console.log('Agent output:', scenario.agent_output);
    console.log('Agent stderr:', scenario.agent_stderr);
  }
}

What your agent should NOT do

If the spec has forbidden rules, Keystone monitors your agent's behavior via the audit log. Common restrictions:

  • Don't write to files outside the allowed list -- if file_writes_outside: [src/, config/] is set, writing to /etc/passwd fails the run
  • Don't make HTTP calls to unauthorized hosts -- if http_except: [payment-api] is set, calling api.stripe.com directly fails the run
  • Don't print secrets to stdout -- if secrets_in_logs: deny is set, any output containing AWS keys, private keys, or Stripe tokens fails the run

These rules are evaluated after your agent finishes. Your agent won't be interrupted mid-execution, but the scenario will be marked as failed if any rule is violated.