SDK Reference

Overview of the Keystone SDK across TypeScript, Python, and Go — services, methods, and what each one does.

The Keystone client surface is the same across languages: nine services hanging off a Keystone (TS / Python) / Client (Go) struct. Each service maps to a REST namespace; each method maps to one or more HTTP calls.

This page is the orientation. Per-language docs live in:

The nine services

Service	What it manages
`sandboxes`	Isolated environments — create, run commands, read/write files, snapshot, destroy
`specs`	Spec YAML upload, list, get, delete
`experiments`	Run a spec, poll for results, compare two runs, fetch metrics
`alerts`	Slack/webhook notifications when metrics cross thresholds
`agents`	Versioned, content-addressed agent snapshots — upload, get, list, delete
`datasets`	Versioned `(input, expected)` collections — drive specs across many cases
`scoring`	Server-side score rules — create, list, run offline scoring, fetch scores
`export`	Bulk extraction of traces, spans, scenarios, scores
`prompts`	Server-side prompt templates — register, render, version

Plus three top-level helpers:

Helper	What it does
`Keystone.fromSandbox()`	Inside a sandbox, build a pre-configured client and return the current sandbox
`wrap(client)`	Wrap an LLM client so every `.create()` call auto-reports traces
`traced(fn)`	Wrap a function to capture spans (duration, errors, parent-child structure)

And the Eval() / auto_instrument() / observe() ergonomic primitives — see Tracing and Auto-Instrument.

Client construction

Every SDK constructs a client with up to three knobs:

Knob	Default	Override
API key	`KEYSTONE_API_KEY` env var	Constructor arg
Base URL	`https://keystone.polarity.so`	Constructor arg
Timeout	30 seconds	Constructor arg

import { Keystone } from "@polarityinc/polarity-keystone";
 
const ks = new Keystone({
  apiKey: "ks_live_...",                  // or KEYSTONE_API_KEY env
  baseUrl: "https://keystone.polarity.so", // default
  timeout: 30_000,                          // ms
});

Method conventions

Every service follows the same naming pattern — close enough that you can navigate the SDKs interchangeably:

Convention	TS	Python	Go
Create	`service.create(opts)`	`service.create(...)`	`Service.Create(ctx, req)`
Get	`service.get(id)`	`service.get(id)`	`Service.Get(ctx, id)`
List	`service.list()`	`service.list()`	`Service.List(ctx)`
Delete	`service.delete(id)`	`service.delete(id)`	`Service.Delete(ctx, id)`
Run	`service.run(id)`	`service.run(id)`	`Service.Run(ctx, id)`
Compare	`service.compare(a, b)`	`service.compare(a, b)`	`Service.Compare(ctx, a, b)`

Async/sync follows each language's idiom: TS returns Promises, Python is sync (until you wrap with asyncio.to_thread), Go takes a context.Context.

Each service in 30 seconds

`sandboxes`

const sb = await ks.sandboxes.create({ spec_id: "..." });
const result = await ks.sandboxes.runCommand(sb.id, { command: "npm test" });
const content = await ks.sandboxes.readFile(sb.id, "output.json");
const diff = await ks.sandboxes.diff(sb.id);
await ks.sandboxes.destroy(sb.id);

Sandboxes →

`specs`

const spec = await ks.specs.create(yamlContent);
const all = await ks.specs.list();
const one = await ks.specs.get("fix-failing-test");
await ks.specs.delete("fix-failing-test");

Specs are versioned automatically — uploading the same id: increments the version.

`experiments`

const exp = await ks.experiments.create({ name: "...", spec_id: "..." });
const results = await ks.experiments.runAndWait(exp.id);
const cmp = await ks.experiments.compare("exp-baseline", exp.id);
const metrics = await ks.experiments.metrics(exp.id);

Experiments →

`alerts`

await ks.alerts.create({
  name: "pass-rate-drop",
  condition: "pass_rate < 0.8",
  notify: "slack",
  slack_channel: "#agent-alerts",
});
const all = await ks.alerts.list();
await ks.alerts.delete("alert_abc");

Alerts →

`agents`

const snap = await ks.agents.upload({
  name: "email-agent",
  entrypoint: ["python", "main.py"],
  bundle: tarballBytes,
});
const latest = await ks.agents.get("email-agent");
const tagged = await ks.agents.get("email-agent", { tag: "v2.1" });
await ks.agents.delete(snap);

Agent Snapshots →

`datasets`

const ds = await ks.datasets.create("customer-emails", "...");
await ks.datasets.addRecords(ds.id, [
  { input: { id: "alice" }, expected: { subject: "Renewal" } },
]);
const records = await ks.datasets.getRecords(ds.id);

Datasets →

`scoring`

const rule = await ks.scoring.createRule("factuality", "llm_as_judge", { ... });
await ks.scoring.scoreExperiment("exp-old-id", [rule.id]);
const scores = await ks.scoring.getScores("exp-old-id");

Server-side scoring rules — useful for retroactively adding metrics to old experiments.

`export`

const traces = ks.export.traces({ experiment_id: "exp-abc" });
for await (const event of traces) {
  console.log(event);
}
 
const bundle = await ks.export.experiment("exp-abc", { format: "json" });

Streaming pagination over traces, spans, scenarios, and scores. NDJSON output for piping into jq.

`prompts`

const p = await ks.prompts.create({
  name: "summarize",
  template: "Summarize this in {{ words }} words: {{ text }}",
});
const rendered = ks.prompts.render(p, { text: "...", words: 50 });

Server-side prompt template store. Versioned, named, render with variables.

Top-level helpers

`Keystone.fromSandbox()` — inside the sandbox

const { client, sandbox } = await Keystone.fromSandbox();
const db = sandbox.services.db;   // { host, port, ready }

Reads KEYSTONE_SANDBOX_ID, KEYSTONE_API_KEY, KEYSTONE_BASE_URL from the env (Keystone injects them at sandbox boot), constructs a client, returns it plus the current sandbox object. The injected API key is sandbox-scoped — safe to log.

`ks.wrap(client)` — LLM tracing

const anthropic = ks.wrap(new Anthropic());
// every messages.create() now auto-reports

Tracing →

`traced(fn)` — custom spans

const writeFile = traced(async (path, content) => fs.writeFile(path, content), { name: "write_file" });
await writeFile("config.json", "{}");

Custom Spans →

`ks.observe({...})` — one-call observability

ks.observe({ clients: [openai, anthropic], aiSdk });
// Wraps every client + auto-instruments AI SDK + initializes traced()

The "trace everything" line — combines wrap(), initTracing(), and auto_instrument().

`Eval(name, config)` — Braintrust-parity primitive

const result = await Eval("email-eval", {
  data: [{ input: ..., expected: ... }, ...],
  task: async (input) => myAgent.run(input),
  scores: [new Factuality(), new AnswerRelevancy()],
});
console.log(result.summary);

Run a task per row in parallel, score each result, aggregate per scorer. If KEYSTONE_API_KEY is set, results also post to the dashboard.

Errors

Every SDK surfaces non-2xx responses as a typed error:

Language	Type
TS / Python	`KeystoneError` (with `statusCode`)
Go	`*APIError` (with `StatusCode`)

Most operations include the server's error message in the exception. For 503s ("at capacity"), implement a short retry loop with jitter — the platform is back-pressuring, not refusing.

Authentication

All requests use Authorization: Bearer <api-key>. Keys are tenant-scoped:

Key prefix	Means
`ks_live_`	Tenant-issued production key. Full account access. Rotate via the Dashboard.
`ks_sb_`	Sandbox-scoped token. Authorized only for one sandbox's resources. Auto-injected into the agent process; safe to log.

Use ks_live_ keys from your application code; ks_sb_ is what the SDK sees inside a sandbox.

Auto-Instrument TypeScript

SDK Reference

The nine services

Client construction

Method conventions

Each service in 30 seconds

sandboxes

specs

experiments

alerts

agents

datasets

scoring

export

prompts