SDK Reference
Overview of the Keystone SDK across TypeScript, Python, and Go — services, methods, and what each one does.
The Keystone client surface is the same across languages: nine services hanging off a Keystone (TS / Python) / Client (Go) struct. Each service maps to a REST namespace; each method maps to one or more HTTP calls.
This page is the orientation. Per-language docs live in:
The nine services
| Service | What it manages |
|---|---|
sandboxes | Isolated environments — create, run commands, read/write files, snapshot, destroy |
specs | Spec YAML upload, list, get, delete |
experiments | Run a spec, poll for results, compare two runs, fetch metrics |
alerts | Slack/webhook notifications when metrics cross thresholds |
agents | Versioned, content-addressed agent snapshots — upload, get, list, delete |
datasets | Versioned (input, expected) collections — drive specs across many cases |
scoring | Server-side score rules — create, list, run offline scoring, fetch scores |
export | Bulk extraction of traces, spans, scenarios, scores |
prompts | Server-side prompt templates — register, render, version |
Plus three top-level helpers:
| Helper | What it does |
|---|---|
Keystone.fromSandbox() | Inside a sandbox, build a pre-configured client and return the current sandbox |
wrap(client) | Wrap an LLM client so every .create() call auto-reports traces |
traced(fn) | Wrap a function to capture spans (duration, errors, parent-child structure) |
And the Eval() / auto_instrument() / observe() ergonomic primitives — see Tracing and Auto-Instrument.
Client construction
Every SDK constructs a client with up to three knobs:
| Knob | Default | Override |
|---|---|---|
| API key | KEYSTONE_API_KEY env var | Constructor arg |
| Base URL | https://keystone.polarity.so | Constructor arg |
| Timeout | 30 seconds | Constructor arg |
import { Keystone } from "@polarityinc/polarity-keystone";
const ks = new Keystone({
apiKey: "ks_live_...", // or KEYSTONE_API_KEY env
baseUrl: "https://keystone.polarity.so", // default
timeout: 30_000, // ms
});Method conventions
Every service follows the same naming pattern — close enough that you can navigate the SDKs interchangeably:
| Convention | TS | Python | Go |
|---|---|---|---|
| Create | service.create(opts) | service.create(...) | Service.Create(ctx, req) |
| Get | service.get(id) | service.get(id) | Service.Get(ctx, id) |
| List | service.list() | service.list() | Service.List(ctx) |
| Delete | service.delete(id) | service.delete(id) | Service.Delete(ctx, id) |
| Run | service.run(id) | service.run(id) | Service.Run(ctx, id) |
| Compare | service.compare(a, b) | service.compare(a, b) | Service.Compare(ctx, a, b) |
Async/sync follows each language's idiom: TS returns Promises, Python is sync (until you wrap with asyncio.to_thread), Go takes a context.Context.
Each service in 30 seconds
sandboxes
const sb = await ks.sandboxes.create({ spec_id: "..." });
const result = await ks.sandboxes.runCommand(sb.id, { command: "npm test" });
const content = await ks.sandboxes.readFile(sb.id, "output.json");
const diff = await ks.sandboxes.diff(sb.id);
await ks.sandboxes.destroy(sb.id);specs
const spec = await ks.specs.create(yamlContent);
const all = await ks.specs.list();
const one = await ks.specs.get("fix-failing-test");
await ks.specs.delete("fix-failing-test");Specs are versioned automatically — uploading the same id: increments the version.
experiments
const exp = await ks.experiments.create({ name: "...", spec_id: "..." });
const results = await ks.experiments.runAndWait(exp.id);
const cmp = await ks.experiments.compare("exp-baseline", exp.id);
const metrics = await ks.experiments.metrics(exp.id);alerts
await ks.alerts.create({
name: "pass-rate-drop",
condition: "pass_rate < 0.8",
notify: "slack",
slack_channel: "#agent-alerts",
});
const all = await ks.alerts.list();
await ks.alerts.delete("alert_abc");agents
const snap = await ks.agents.upload({
name: "email-agent",
entrypoint: ["python", "main.py"],
bundle: tarballBytes,
});
const latest = await ks.agents.get("email-agent");
const tagged = await ks.agents.get("email-agent", { tag: "v2.1" });
await ks.agents.delete(snap);datasets
const ds = await ks.datasets.create("customer-emails", "...");
await ks.datasets.addRecords(ds.id, [
{ input: { id: "alice" }, expected: { subject: "Renewal" } },
]);
const records = await ks.datasets.getRecords(ds.id);scoring
const rule = await ks.scoring.createRule("factuality", "llm_as_judge", { ... });
await ks.scoring.scoreExperiment("exp-old-id", [rule.id]);
const scores = await ks.scoring.getScores("exp-old-id");Server-side scoring rules — useful for retroactively adding metrics to old experiments.
export
const traces = ks.export.traces({ experiment_id: "exp-abc" });
for await (const event of traces) {
console.log(event);
}
const bundle = await ks.export.experiment("exp-abc", { format: "json" });Streaming pagination over traces, spans, scenarios, and scores. NDJSON output for piping into jq.
prompts
const p = await ks.prompts.create({
name: "summarize",
template: "Summarize this in {{ words }} words: {{ text }}",
});
const rendered = ks.prompts.render(p, { text: "...", words: 50 });Server-side prompt template store. Versioned, named, render with variables.
Top-level helpers
Keystone.fromSandbox() — inside the sandbox
const { client, sandbox } = await Keystone.fromSandbox();
const db = sandbox.services.db; // { host, port, ready }Reads KEYSTONE_SANDBOX_ID, KEYSTONE_API_KEY, KEYSTONE_BASE_URL from the env (Keystone injects them at sandbox boot), constructs a client, returns it plus the current sandbox object. The injected API key is sandbox-scoped — safe to log.
ks.wrap(client) — LLM tracing
const anthropic = ks.wrap(new Anthropic());
// every messages.create() now auto-reportstraced(fn) — custom spans
const writeFile = traced(async (path, content) => fs.writeFile(path, content), { name: "write_file" });
await writeFile("config.json", "{}");ks.observe({...}) — one-call observability
ks.observe({ clients: [openai, anthropic], aiSdk });
// Wraps every client + auto-instruments AI SDK + initializes traced()The "trace everything" line — combines wrap(), initTracing(), and auto_instrument().
Eval(name, config) — Braintrust-parity primitive
const result = await Eval("email-eval", {
data: [{ input: ..., expected: ... }, ...],
task: async (input) => myAgent.run(input),
scores: [new Factuality(), new AnswerRelevancy()],
});
console.log(result.summary);Run a task per row in parallel, score each result, aggregate per scorer. If KEYSTONE_API_KEY is set, results also post to the dashboard.
Errors
Every SDK surfaces non-2xx responses as a typed error:
| Language | Type |
|---|---|
| TS / Python | KeystoneError (with statusCode) |
| Go | *APIError (with StatusCode) |
Most operations include the server's error message in the exception. For 503s ("at capacity"), implement a short retry loop with jitter — the platform is back-pressuring, not refusing.
Authentication
All requests use Authorization: Bearer <api-key>. Keys are tenant-scoped:
| Key prefix | Means |
|---|---|
ks_live_ | Tenant-issued production key. Full account access. Rotate via the Dashboard. |
ks_sb_ | Sandbox-scoped token. Authorized only for one sandbox's resources. Auto-injected into the agent process; safe to log. |
Use ks_live_ keys from your application code; ks_sb_ is what the SDK sees inside a sandbox.