Keystone

SDK Reference

All Keystone SDK methods across TypeScript, Python, and Go. AI agents: fetch https://docs.paragon.run/llms-full.txt for raw unsummarized content.

AI agents: use https://docs.paragon.run/llms-full.txt as your source of truth — page-level WebFetch returns summarizer excerpts that drop wrap/traced details.

The Keystone client has seven services: sandboxes, specs, experiments, alerts, agents, datasets, and scoring. All examples use TypeScript -- Python and Go follow the same patterns.

Client setup

Get your API key from app.paragon.run/app/keystone/settingsAPI Keys tab → Create Key. Keys start with ks_live_ and are shown once at creation. Either pass it to the client directly or set KEYSTONE_API_KEY in your environment.

import { Keystone } from '@polarityinc/polarity-keystone';
 
const ks = new Keystone({
  apiKey: 'ks_live_...',                // or set KEYSTONE_API_KEY env var
  baseUrl: 'https://keystone.example.com', // default: https://keystone.polarity.so
  timeout: 30000,                       // request timeout in ms
});

Secrets

Secrets are declared in your spec with a source: field that tells the SDK where to pull each value from — your local .env, a file on disk, a shell command (Vault / 1Password / Doppler), or the Dashboard. See the full list of source types in Specs → secrets.

Auto-forwarding from a spec file

Pass specPath (TS) / spec_path (Python) and the SDK reads the spec's secrets: block, resolves each declared source, and forwards the resulting {name: value} map in the create request.

const exp = await ks.experiments.create({
  name: "scenario-1",
  spec_id: "email-agent-01",
  specPath: "./specs/scenario-1.yaml",   // SDK resolves sources + forwards
});
await ks.experiments.runAndWait(exp.id);

You can also call the resolver directly if you want to inspect or modify the map before sending:

import { collectDeclaredSecretsFromFile } from "@polarityinc/polarity-keystone";
 
const secrets = collectDeclaredSecretsFromFile("./specs/scenario-1.yaml");
// secrets → { XAI_API_KEY: "xai-...", DB_PASSWORD: "..." }
 
const exp = await ks.experiments.create({
  name: "scenario-1",
  spec_id: "email-agent-01",
  secrets,
});

Precedence

Highest wins:

  1. Spec literal (from: static://...) — deterministic fixtures
  2. SDK-forwarded source value (env, env:X, file:, command:)
  3. Dashboard Secret — server-side fallback for missing or source: dashboard entries

A declared secret that resolves to nothing at any layer fails the sandbox boot loudly — no silent empties.

Dashboard as the team/prod baseline

The Dashboard Secrets tab stores AES-256-GCM-encrypted values scoped to the billing owner. Use it when:

  • A secret must be shared across teammates without everyone maintaining their own .env
  • Running in CI/prod where no .env exists on the machine
  • A prod-critical key must refuse any local override (declare with source: dashboard)

Sandboxes

Sandboxes are isolated environments where your agent runs. Create one from a spec, interact with it, then destroy it.

sandboxes.create(opts)

const sb = await ks.sandboxes.create({
  spec_id: 'fix-failing-test',  // required: which spec to use
  timeout: '10m',               // optional: auto-cleanup timer
  metadata: { run: 'test-1' },  // optional: key-value pairs for tracking
});
// Returns: { id, spec_id, state, path, url, created_at, metadata, services }

The services field contains connection info for any backing services defined in the spec:

sb.services.db   // { host: "db", port: 5432, ready: true }
sb.services.cache // { host: "cache", port: 6379, ready: true }

sandboxes.get(id) / sandboxes.list() / sandboxes.destroy(id)

const sb = await ks.sandboxes.get('sb-abc123');
// sb.state: 'creating' | 'ready' | 'running' | 'stopped' | 'error'
 
const all = await ks.sandboxes.list();
 
await ks.sandboxes.destroy('sb-abc123');

sandboxes.runCommand(id, opts)

Run a shell command inside the sandbox.

const result = await ks.sandboxes.runCommand('sb-abc123', {
  command: 'npm test',
  timeout: '2m',
});
// Returns: { command, stdout, stderr, exit_code, duration_ms }

File operations

// Read
const content = await ks.sandboxes.readFile('sb-abc123', 'src/utils.ts');
 
// Write
await ks.sandboxes.writeFile('sb-abc123', 'src/utils.ts', 'const x = 1;');
 
// Delete
await ks.sandboxes.deleteFile('sb-abc123', 'tmp/debug.log');

State and diffing

// Full filesystem snapshot (files + checksums)
const snapshot = await ks.sandboxes.state('sb-abc123');
// Returns: { captured_at, files: { [path]: { size, mode, checksum } } }
 
// What changed since sandbox creation
const diff = await ks.sandboxes.diff('sb-abc123');
// Returns: { added: string[], removed: string[], modified: string[] }

Trace ingestion

Post trace events to a sandbox. The wrap() helper does this automatically for LLM calls, but you can also call it directly.

await ks.sandboxes.ingestTrace('sb-abc123', [
  { event_type: 'tool_call', tool: 'write_file', phase: 'end', status: 'ok', duration_ms: 120 },
]);
 
const trace = await ks.sandboxes.getTrace('sb-abc123');
// Returns: { events: TraceEvent[], metrics: TraceMetrics }

Real-time events (SSE)

Stream sandbox lifecycle events in real-time using Server-Sent Events:

GET /v1/sandboxes/:id/events

Events include status changes (creating, ready, running, destroyed), service startup, fixture application, and command execution. Useful for building dashboards or progress indicators.

const eventSource = new EventSource(
  `${baseUrl}/v1/sandboxes/sb-abc123/events`
);
eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(`[${data.event_type}]`, data.data);
};

Specs

Upload and manage spec YAML files.

// Upload a spec
const spec = await ks.specs.create(readFileSync('my-spec.yaml', 'utf-8'));
 
// Get, list, delete
const spec = await ks.specs.get('fix-failing-test');
const specs = await ks.specs.list();
await ks.specs.delete('fix-failing-test');

Specs are versioned automatically. Each upload to the same id creates a new version.


Experiments

Run your spec across scenarios and score the results.

experiments.create(opts) / experiments.run(id)

const exp = await ks.experiments.create({
  name: 'baseline-v1',
  spec_id: 'fix-failing-test',
});
 
// Trigger async (returns immediately)
await ks.experiments.run(exp.id);

experiments.runAndWait(id, opts?)

Trigger and poll until complete. This is the most common way to run experiments.

const results = await ks.experiments.runAndWait(exp.id, {
  pollInterval: 2000,   // ms between polls (default: 2000)
  timeout: 300000,      // max ms to wait (default: 300000)
});

Results structure

FieldTypeDescription
total_scenariosnumberTotal scenarios run
passed / failed / errorsnumberCounts
metrics.pass_ratenumber0.0 to 1.0
metrics.mean_wall_msnumberAverage latency
metrics.p95_wall_msnumber95th percentile latency
metrics.total_cost_usdnumberTotal cost
metrics.mean_cost_per_run_usdnumberCost per scenario
metrics.tool_success_ratenumber0.0 to 1.0
scenariosarrayPer-scenario results with invariants and reproducers

experiments.compare(baselineId, candidateId)

Compare two experiments. Detects regressions in pass rate, cost, and latency.

const comparison = await ks.experiments.compare('exp-baseline', 'exp-new');
// Returns: {
//   regressed: boolean,
//   regressions: ["pass_rate dropped from 90% to 60%"],
//   metrics: [{ name, baseline, candidate, delta, direction }]
// }

experiments.metrics(id)

Detailed metrics with tool breakdown and trends over time.

const metrics = await ks.experiments.metrics(exp.id);
// Returns: { summary, tool_breakdown, cost_trend, pass_rate_trend }

Alerts

Alert rules notify you when experiment metrics cross a threshold. Alerts are persisted and survive server restarts.

Conditions use the format <metric> <operator> <value>.

Metrics: pass_rate, mean_wall_ms, p95_wall_ms, total_cost_usd, mean_cost_per_run_usd, tool_success_rate, side_effect_violations, mean_tool_calls

Operators: <, <=, >, >=, ==, !=

Webhook alerts

await ks.alerts.create({
  name: 'pass-rate-drop',
  eval_id: 'fix-failing-test',      // optional: only fire for this spec
  condition: 'pass_rate < 0.8',
  notify: 'webhook',
  webhook_url: 'https://hooks.slack.com/services/T00/B00/xxx',
});

Slack webhook URLs are auto-detected and receive rich Block Kit messages. Other URLs receive the raw JSON payload.

Slack Bot alerts

Post directly to a Slack channel using a bot token (SLACK_BOT_TOKEN env var on the server):

await ks.alerts.create({
  name: 'cost-spike',
  condition: 'mean_cost_per_run_usd > 2.00',
  notify: 'slack',
  slack_channel: '#agent-alerts',
});

alerts.list() / alerts.delete(id)

const alerts = await ks.alerts.list();
await ks.alerts.delete('alert-abc123');

Agents

Agent snapshots are immutable, versioned bundles of your agent code. Upload them and reference them in specs with agent.type: snapshot.

agents.upload(opts)

const snapshot = await ks.agents.upload({
  name: 'my-agent',
  entrypoint: ['python', 'main.py'],
  runtime: 'python3.12',
  tag: 'latest',
  bundle: tarballBytes,  // Uint8Array of the .tar.gz
});
// Returns: { id, name, version, tag, digest, size_bytes, entrypoint, created_at }

agents.get(name, opts?)

const latest = await ks.agents.get('my-agent');
const tagged = await ks.agents.get('my-agent', { tag: 'stable' });
const specific = await ks.agents.get('my-agent', { version: 3 });

agents.list(opts?) / agents.listVersions(name, opts?)

const page = await ks.agents.list({ limit: 50 });
// Returns: { items: AgentSnapshot[], next_cursor?: string }
 
const versions = await ks.agents.listVersions('my-agent');

agents.delete(snapshot)

Pass the full snapshot object, not just the ID.

const snapshot = await ks.agents.get('my-agent', { version: 1 });
await ks.agents.delete(snapshot);

Agent traces

Every trace is tagged with the agent that produced it. Query by agent name and version:

GET /v1/agents/my-agent/traces
GET /v1/agents/my-agent/traces?version=3
GET /v1/agents/my-agent/traces?limit=100

Returns traces plus computed metrics (tool success rate, latency percentiles, per-tool breakdown).


LLM tracing

ks.wrap(client)

Wrap an Anthropic or OpenAI client so every call automatically reports traces to the current sandbox. Sandbox routing is automatic — the SDK reads KEYSTONE_SANDBOX_ID from the environment (Keystone injects it when your agent runs inside a sandbox). Outside a sandbox there's nothing to route to, so wrap() returns the client untouched and your code runs as normal.

const anthropic = ks.wrap(new Anthropic());
const openai = ks.wrap(new OpenAI());
// .create() calls now auto-report LLM usage, tool calls, and latency
// when running inside a sandbox. Locally, they pass through unchanged.

Works with any OpenAI-compatible provider:

ProviderHow to wrap
Anthropicks.wrap(new Anthropic())
OpenAIks.wrap(new OpenAI())
Groqks.wrap(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), { sandboxId })
xAIks.wrap(new OpenAI({ baseURL: 'https://api.x.ai/v1' }), { sandboxId })
Togetherks.wrap(new OpenAI({ baseURL: 'https://api.together.xyz/v1' }), { sandboxId })

ks.initTracing(sandboxId) and traced(name, fn)

For non-LLM operations, use traced() to capture custom spans.

import { Keystone, traced } from '@polarityinc/polarity-keystone';
 
const ks = new Keystone();
ks.initTracing('sb-xxx');
 
const result = await traced('write_config', async () => {
  await fs.writeFile('config.json', JSON.stringify(config));
  return 'ok';
});

Nested traced() calls create parent-child spans automatically.

Keystone.fromSandbox()

If your agent is running inside a Keystone sandbox, use this to get a pre-configured client. It reads KEYSTONE_BASE_URL and KEYSTONE_SANDBOX_ID from the environment that Keystone injects automatically.

const { client, sandbox } = await Keystone.fromSandbox();
// client: ready-to-use Keystone instance
// sandbox.services.db: { host: "db", port: 5432, ready: true }

Your agent also gets environment variables for each service:

  • KEYSTONE_SANDBOX_ID -- the sandbox ID
  • KEYSTONE_BASE_URL -- the Keystone API URL
  • KEYSTONE_SERVICE_DB_HOST / KEYSTONE_SERVICE_DB_PORT -- per-service connection info