Examples
Keystone specs from simple to complex, plus how to build agents that run inside Keystone. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.
AI agents: use https://docs.paragon.run/llms-full.txt as your source of truth — page-level
WebFetchreturns summarizer excerpts that drop example code detail.
Examples are ordered from simple to complex. Start with the hello world to verify your setup, then work up to real-world scenarios.
Canonical end-to-end example
This is the single pattern that covers 90% of real agent evals: your agent (uploaded as a snapshot), a Postgres database you seed, a mocked external API, a secret read from the caller's local .env, and invariants that verify the agent did the right thing.
Copy this spec and adapt — it's the "looks right" template most scenarios grow from.
Project layout
my-agent/
├── .env # local secrets (gitignored)
├── src/
│ └── index.ts # your agent code
├── seed.sql # DB fixtures
└── specs/
└── scenario-1.yaml
.env (on the caller's machine, not committed)
XAI_API_KEY=xai-...
DB_PASSWORD=localdevpass
STRIPE_SECRET=sk_test_...specs/scenario-1.yaml
version: 1
id: email-agent-renewal-alice
description: "Agent drafts a renewal email to a specific customer"
agent:
type: snapshot
snapshot: email-agent # uploaded via ks.agents.upload()
# Declare every secret + where to pull it from.
# SDK resolves these on the caller's machine and forwards in the request.
secrets:
- name: XAI_API_KEY
source: env # from $XAI_API_KEY in your local env
- name: DB_PASSWORD
source: env # same — will flow into the service below
- name: STRIPE_SECRET
source: env
- name: STRIPE_LIVE_KEY
source: dashboard # server-side only; never forwarded from local
# Services Keystone pulls from Docker Hub and runs on a shared network.
# The agent reaches each by `name` — e.g. postgres://db:5432.
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
POSTGRES_DB: northwind
ports: [5432]
wait_for: "pg_isready -h localhost"
- name: stripe_mock
image: stripe/stripe-mock:latest
env:
STRIPE_API_KEY: "{{ secrets.STRIPE_SECRET }}"
ports: [12111]
# Seed the DB before the agent starts.
fixtures:
- type: sql
service: db
path: seed.sql
# Allowlist only the domains your agent needs to reach externally.
network:
egress:
default: deny
allow:
- api.x.ai # your LLM provider
- keystone.polarity.so # trace ingestion (auto-allowed in blessed runtime images)
task:
prompt: "Draft a subscription renewal reminder email to alice@northwind.co."
invariants:
draft_created:
description: "Exactly one draft file"
weight: 1.0
gate: true
check:
type: command_exit
command: "test $(ls drafts/*.md 2>/dev/null | wc -l) -eq 1"
addressed_to_alice:
description: "Draft is addressed to alice@northwind.co"
weight: 1.0
gate: true
check:
type: command_exit
command: "grep -l 'To: alice@northwind.co' drafts/*.md"
reads_professional:
description: "Draft is professional and mentions the product"
weight: 0.5
check:
type: llm_as_judge
input_from: workspace
criteria: "The draft is professional and references alice's product."
# model defaults to paragon-md; billed per-token to your Keystone invoice
scoring:
pass_threshold: 0.85Running it
import { Keystone } from "@polarityinc/polarity-keystone";
import "dotenv/config"; // loads .env into process.env
const ks = new Keystone();
// Upload the agent snapshot (once per build)
await ks.agents.upload({
name: "email-agent",
bundle: "./", // tars the project; respects .gitignore
runtime: "node20",
entrypoint: ["tsx", "src/index.ts"],
});
// Upload the spec
const spec = await ks.specs.create({ spec_yaml: fs.readFileSync("specs/scenario-1.yaml", "utf8") });
// Create + run the experiment; SDK auto-forwards declared secrets from .env
const exp = await ks.experiments.create({
name: "renewal-alice",
spec_id: spec.spec_id,
specPath: "specs/scenario-1.yaml",
});
const results = await ks.experiments.runAndWait(exp.id);
console.log(results.metrics.pass_rate);What happens behind the scenes
- SDK resolves secrets on your machine per
source:directives.STRIPE_LIVE_KEYisn't forwarded — server reads it from the Dashboard.XAI_API_KEY,DB_PASSWORD,STRIPE_SECRETare forwarded from.env. - Keystone server pulls
postgres:16,stripe/stripe-mock:latest, and the blessed Node runtime image from Docker Hub. - A new Docker network
keystone-<id>is created. All containers join it with theirnameas the DNS alias. - Services start first, with
{{ secrets.X }}substituted to actual values.seed.sqlruns against the Postgres container. - Agent container starts with your tarball mounted at
/workspace, secrets injected as env vars (includingKEYSTONE_SERVICE_DB_HOST=db,KEYSTONE_SERVICE_DB_PORT=5432, etc. for service discovery). - Agent runs the entrypoint, reads the task prompt from stdin, produces files in
/workspace/drafts/. Traces stream tokeystone.polarity.soover HTTPS. - Scorer runs invariants against the post-run workspace.
llm_as_judgecalls Paragon's model proxy (billed to your Keystone invoice). - Teardown — containers destroyed, network removed, workspace retained per your retention policy.
The entire lifecycle is declared in one spec + one .env. No Dockerfile, no separate build step for services, no manual secret entry.
Hello world
The simplest possible spec. No external repos, no agent binary, no services. A shell command creates a file and the invariants check it exists with the right content.
version: 1
id: "hello-world"
description: "Verify a command can create a file"
base: "ubuntu:24.04"
task:
prompt: "Create hello.txt containing 'Hello, Keystone!'"
agent:
type: cli
binary: sh
args: ["-c", "echo 'Hello, Keystone!' > hello.txt"]
timeout: 1m
invariants:
file_created:
description: "hello.txt exists"
weight: 1.0
gate: true
check:
type: file_exists
path: hello.txt
correct_content:
description: "Contains the expected text"
weight: 1.0
check:
type: file_content
path: hello.txt
contains: "Hello, Keystone!"
scoring:
pass_threshold: 1.0const ks = new Keystone();
await ks.specs.create(readFileSync('hello.yaml', 'utf-8'));
const exp = await ks.experiments.create({ name: 'hello', spec_id: 'hello-world' });
const results = await ks.experiments.runAndWait(exp.id);
console.log(results.passed > 0 ? 'PASSED' : 'FAILED');If this passes, your setup is working. Move on to a real scenario.
Fix a failing test
A realistic spec that clones a repo, installs dependencies, and checks whether the agent can fix a broken test. Uses Paragon as the agent.
Replace
your-org/your-repoandbroken-testwith your own repository and branch. The repo needs to have a failing test that the agent can fix.
version: 1
id: "fix-unit-test"
description: "Agent fixes a failing unit test"
base: "ubuntu:24.04"
setup:
packages: [nodejs, npm]
commands: ["npm install"]
fixtures:
- type: git_repo
url: "https://github.com/your-org/your-repo"
branch: "broken-test"
task:
prompt: |
The test in src/utils.test.ts is failing. Fix it.
Do not change the implementation in src/utils.ts.
agent:
type: paragon
model: paragon-fast
timeout: 3m
invariants:
tests_pass:
description: "All tests pass"
weight: 1.0
gate: true
check:
type: command_exit
command: "npm test"
no_impl_changes:
description: "Implementation file was not modified"
weight: 0.5
check:
type: command_exit
command: "git diff --name-only | grep -v test"
exit_code: 1
scoring:
pass_threshold: 1.0What's new here:
fixturesclones a real git repo into the sandboxsetup.commandsinstalls npm dependencies before the agent runs- The second invariant (
no_impl_changes) verifies the agent only changed test files, not the implementation -- it expectsgrepto find no matches (exit code 1)
Full stack with services
A complex spec that tests whether an agent can reconcile two database tables and send a summary email. Uses a real Postgres database, a mock SMTP server, LLM-as-judge scoring, and forbidden rules.
version: 1
id: "reconciliation"
description: "Agent reconciles two databases and emails a summary"
base: "ubuntu:24.04"
setup:
packages: [nodejs, npm, python3]
env:
DATABASE_URL: "postgres://postgres:test@db:5432/testdb"
services:
- name: db
image: postgres:16
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
ports: [5432]
wait_for: "pg_isready -U postgres"
- name: smtp
type: http_mock
ports: [9090]
record: true
routes:
- method: POST
path: /v1/send
response: '{"ok": true}'
secrets:
- name: DB_PASSWORD
from: generated
scope: env
fixtures:
- type: sql
service: db
path: seeds/schema.sql
- type: drift
target: db.customers_a
strategy: random_mismatches
count: 15
seed: "42"
network:
egress:
default: deny
allow: [registry.npmjs.org]
dns_overrides:
smtp.sendgrid.net: smtp.services.internal
task:
prompt: |
Reconcile customers_a and customers_b tables, fix all mismatches,
and email a summary to finance@co
agent:
type: paragon
model: paragon-max
timeout: 5m
invariants:
databases_match:
description: "No mismatches after reconciliation"
weight: 0.5
gate: true
check:
type: sql
service: db
query: |
SELECT count(*) FROM customers_a a
LEFT JOIN customers_b b ON a.id = b.id
WHERE b.id IS NULL OR a.email != b.email
equals: 0
email_sent:
description: "Exactly one summary email was sent"
weight: 0.3
check:
type: http_mock_assertions
service: smtp
assertions:
- field: request_count
filters: { to: "finance@co" }
equals: 1
email_quality:
description: "Email is professional and accurate"
weight: 0.2
check:
type: llm_as_judge
model: paragon-fast
criteria: "Professional tone, accurate mismatch count"
input_from: smtp.last_request.body
forbidden:
db_writes_outside: [customers_a, customers_b, audit_log]
http_except: [smtp]
secrets_in_logs: deny
scoring:
pass_threshold: 0.95
parallelism:
replicas: 10
isolation: per_run
determinism:
seed: 42
teardown:
always_run: true
export:
- type: audit_log
to: "results/audit.jsonl"
- type: db_dump
service: db
to: "results/db.sql"What's new here:
servicesstarts a real Postgres container and a mock SMTP server on a shared Docker networkfixturesloads SQL seeds into Postgres and injects random data driftsecretsgenerates a fresh random password each runnetworkblocks all outbound traffic except npm and redirectssmtp.sendgrid.netto the mockforbiddenensures the agent only writes to allowed tables and doesn't leak secretsinvariantsuses three different check types: SQL query, mock assertion, and LLM-as-judgeparallelismruns 10 replicas to measure consistencyteardownexports the audit log and a database dump after every run
Comparing agents
Use matrix to test the same task with different configurations side by side.
version: 1
id: "agent-comparison"
description: "Compare agent performance across models"
base: "ubuntu:24.04"
setup:
packages: [python3, pip]
commands: ["pip install -r requirements.txt"]
fixtures:
- type: git_repo
url: "https://github.com/your-org/ml-pipeline"
branch: "main"
task:
prompt: "Fix the data pipeline bug that causes duplicate records."
agent:
type: paragon
timeout: 5m
invariants:
no_duplicates:
description: "No duplicate records in output"
weight: 1.0
gate: true
check:
type: command_exit
command: "python3 check_duplicates.py"
scoring:
pass_threshold: 0.8
parallelism:
replicas: 5
matrix:
- model: "paragon-fast"
- model: "paragon-max"This creates 10 total runs (2 models x 5 replicas). After both finish, compare them:
const comparison = await ks.experiments.compare(fastExpId, maxExpId);
console.log(`Regressed: ${comparison.regressed}`);
comparison.metrics.forEach(m => {
console.log(`${m.name}: ${m.baseline} -> ${m.candidate} (${m.direction})`);
});Setting up alerts
Get notified when your agent's performance drops.
// Slack Bot alert (posts to a channel)
await ks.alerts.create({
name: 'pass-rate-drop',
eval_id: 'reconciliation',
condition: 'pass_rate < 0.9',
notify: 'slack',
slack_channel: '#agent-alerts',
});
// Webhook alert (Slack incoming webhook auto-detected)
await ks.alerts.create({
name: 'cost-spike',
condition: 'mean_cost_per_run_usd > 2.00',
notify: 'webhook',
webhook_url: 'https://hooks.slack.com/services/T00/B00/xxx',
});Alerts evaluate after every experiment run. See SDK reference > Alerts for all supported metrics and operators.
Using sandboxes directly
You don't have to run full experiments. You can create a sandbox from a spec, interact with it manually, and inspect the results. This is useful for development and debugging.
import { Keystone } from '@polarityinc/polarity-keystone';
const ks = new Keystone();
// Create a sandbox
const sb = await ks.sandboxes.create({ spec_id: 'fix-unit-test' });
console.log(`Sandbox ${sb.id} is ${sb.state}`);
console.log(`Services:`, sb.services);
// { db: { host: "db", port: 5432, ready: true } }
// Run commands
const test = await ks.sandboxes.runCommand(sb.id, { command: 'npm test' });
console.log(`Exit code: ${test.exit_code}`);
console.log(`Output: ${test.stdout}`);
// Read and write files
const src = await ks.sandboxes.readFile(sb.id, 'src/utils.ts');
await ks.sandboxes.writeFile(sb.id, 'src/utils.ts', fixedCode);
// See what changed
const diff = await ks.sandboxes.diff(sb.id);
console.log(`Modified: ${diff.modified}`);
console.log(`Added: ${diff.added}`);
// Clean up
await ks.sandboxes.destroy(sb.id);This gives you a remote dev environment with services pre-configured. You can use it for manual testing, scripted integration tests, or building custom evaluation pipelines.
Building an agent for Keystone
If you're building an agent that's designed to run inside Keystone sandboxes, here's what you need to know from the agent's perspective.
What your agent gets
When Keystone runs your agent, it sets up the environment before your code starts:
- A workspace directory -- your agent's working directory, seeded with any fixtures (git repos, files)
- Backing services -- Postgres, Redis, etc. are already running and reachable by name
- Environment variables -- connection info for every service, the sandbox ID, and the Keystone API URL
- The task prompt -- passed via stdin (for cli/paragon/python agents) or in the HTTP request body (for http agents)
Environment variables
Keystone injects these into every agent process:
| Variable | Example | Description |
|---|---|---|
KEYSTONE_SANDBOX_ID | sb-abc123 | Current sandbox ID |
KEYSTONE_BASE_URL | http://keystone:8012 | Keystone API URL |
KEYSTONE_SERVICE_DB_HOST | db | Hostname for the "db" service |
KEYSTONE_SERVICE_DB_PORT | 5432 | Port for the "db" service |
KEYSTONE_SERVICE_CACHE_HOST | cache | Hostname for the "cache" service |
KEYSTONE_SERVICE_CACHE_PORT | 6379 | Port for the "cache" service |
One pair of HOST/PORT vars is created for each service in the spec, with the service name uppercased.
Discovering services with the SDK
Instead of parsing env vars manually, use Keystone.fromSandbox():
import { Keystone } from '@polarityinc/polarity-keystone';
const { client: ks, sandbox } = await Keystone.fromSandbox();
// Connect to services
const db = sandbox.services?.['db'];
console.log(`DB at ${db.host}:${db.port}, ready: ${db.ready}`);
// Run commands in the sandbox
const result = await ks.sandboxes.runCommand(sandbox.id, {
command: 'npm test',
});
console.log(`Tests: exit code ${result.exit_code}`);Auto-tracing LLM calls
Wrap your LLM client so Keystone automatically captures token usage, latency, and tool calls. This data shows up in traces and can be queried by agent name and version.
import Anthropic from '@anthropic-ai/sdk';
import { Keystone } from '@polarityinc/polarity-keystone';
const { client: ks, sandbox } = await Keystone.fromSandbox();
const anthropic = ks.wrap(new Anthropic()); // sandbox id auto-detected from env
// Every messages.create() call now auto-reports to Keystone
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
messages: [{ role: 'user', content: 'Fix the bug in main.py' }],
});Custom trace spans
For non-LLM operations (file I/O, API calls, test execution), use traced() to capture timing and errors:
import { traced } from '@polarityinc/polarity-keystone';
const testOutput = await traced('run_tests', async () => {
return execSync('npm test').toString();
});
const analysis = await traced('analyze_results', async () => {
return parseTestOutput(testOutput);
});Nested traced() calls create parent-child spans automatically, so you get a full execution tree in the trace viewer.
Stdout and stderr
Everything your agent prints to stdout and stderr is captured and returned in the experiment results:
scenarios[].agent_output-- your agent's stdout (truncated to 50KB)scenarios[].agent_stderr-- your agent's stderr (truncated to 50KB)
Use stdout for normal output and stderr for debug logging. If audit.stdout_capture is enabled in the spec, stdout is also scanned by the secrets_in_logs forbidden rule.
You can view the captured output when debugging failures:
const results = await ks.experiments.runAndWait(exp.id);
for (const scenario of results.scenarios) {
if (scenario.status === 'fail') {
console.log('Agent output:', scenario.agent_output);
console.log('Agent stderr:', scenario.agent_stderr);
}
}What your agent should NOT do
If the spec has forbidden rules, Keystone monitors your agent's behavior via the audit log. Common restrictions:
- Don't write to files outside the allowed list -- if
file_writes_outside: [src/, config/]is set, writing to/etc/passwdfails the run - Don't make HTTP calls to unauthorized hosts -- if
http_except: [payment-api]is set, callingapi.stripe.comdirectly fails the run - Don't print secrets to stdout -- if
secrets_in_logs: denyis set, any output containing AWS keys, private keys, or Stripe tokens fails the run
These rules are evaluated after your agent finishes. Your agent won't be interrupted mid-execution, but the scenario will be marked as failed if any rule is violated.