SDK Reference

SDK Reference

Overview of the Keystone SDK across TypeScript, Python, and Go — services, methods, and what each one does.

The Keystone client surface is the same across languages: nine services hanging off a Keystone (TS / Python) / Client (Go) struct. Each service maps to a REST namespace; each method maps to one or more HTTP calls.

This page is the orientation. Per-language docs live in:

The nine services

ServiceWhat it manages
sandboxesIsolated environments — create, run commands, read/write files, snapshot, destroy
specsSpec YAML upload, list, get, delete
experimentsRun a spec, poll for results, compare two runs, fetch metrics
alertsSlack/webhook notifications when metrics cross thresholds
agentsVersioned, content-addressed agent snapshots — upload, get, list, delete
datasetsVersioned (input, expected) collections — drive specs across many cases
scoringServer-side score rules — create, list, run offline scoring, fetch scores
exportBulk extraction of traces, spans, scenarios, scores
promptsServer-side prompt templates — register, render, version

Plus three top-level helpers:

HelperWhat it does
Keystone.fromSandbox()Inside a sandbox, build a pre-configured client and return the current sandbox
wrap(client)Wrap an LLM client so every .create() call auto-reports traces
traced(fn)Wrap a function to capture spans (duration, errors, parent-child structure)

And the Eval() / auto_instrument() / observe() ergonomic primitives — see Tracing and Auto-Instrument.

Client construction

Every SDK constructs a client with up to three knobs:

KnobDefaultOverride
API keyKEYSTONE_API_KEY env varConstructor arg
Base URLhttps://keystone.polarity.soConstructor arg
Timeout30 secondsConstructor arg
import { Keystone } from "@polarityinc/polarity-keystone";
 
const ks = new Keystone({
  apiKey: "ks_live_...",                  // or KEYSTONE_API_KEY env
  baseUrl: "https://keystone.polarity.so", // default
  timeout: 30_000,                          // ms
});

Method conventions

Every service follows the same naming pattern — close enough that you can navigate the SDKs interchangeably:

ConventionTSPythonGo
Createservice.create(opts)service.create(...)Service.Create(ctx, req)
Getservice.get(id)service.get(id)Service.Get(ctx, id)
Listservice.list()service.list()Service.List(ctx)
Deleteservice.delete(id)service.delete(id)Service.Delete(ctx, id)
Runservice.run(id)service.run(id)Service.Run(ctx, id)
Compareservice.compare(a, b)service.compare(a, b)Service.Compare(ctx, a, b)

Async/sync follows each language's idiom: TS returns Promises, Python is sync (until you wrap with asyncio.to_thread), Go takes a context.Context.

Each service in 30 seconds

sandboxes

const sb = await ks.sandboxes.create({ spec_id: "..." });
const result = await ks.sandboxes.runCommand(sb.id, { command: "npm test" });
const content = await ks.sandboxes.readFile(sb.id, "output.json");
const diff = await ks.sandboxes.diff(sb.id);
await ks.sandboxes.destroy(sb.id);

Sandboxes →

specs

const spec = await ks.specs.create(yamlContent);
const all = await ks.specs.list();
const one = await ks.specs.get("fix-failing-test");
await ks.specs.delete("fix-failing-test");

Specs are versioned automatically — uploading the same id: increments the version.

experiments

const exp = await ks.experiments.create({ name: "...", spec_id: "..." });
const results = await ks.experiments.runAndWait(exp.id);
const cmp = await ks.experiments.compare("exp-baseline", exp.id);
const metrics = await ks.experiments.metrics(exp.id);

Experiments →

alerts

await ks.alerts.create({
  name: "pass-rate-drop",
  condition: "pass_rate < 0.8",
  notify: "slack",
  slack_channel: "#agent-alerts",
});
const all = await ks.alerts.list();
await ks.alerts.delete("alert_abc");

Alerts →

agents

const snap = await ks.agents.upload({
  name: "email-agent",
  entrypoint: ["python", "main.py"],
  bundle: tarballBytes,
});
const latest = await ks.agents.get("email-agent");
const tagged = await ks.agents.get("email-agent", { tag: "v2.1" });
await ks.agents.delete(snap);

Agent Snapshots →

datasets

const ds = await ks.datasets.create("customer-emails", "...");
await ks.datasets.addRecords(ds.id, [
  { input: { id: "alice" }, expected: { subject: "Renewal" } },
]);
const records = await ks.datasets.getRecords(ds.id);

Datasets →

scoring

const rule = await ks.scoring.createRule("factuality", "llm_as_judge", { ... });
await ks.scoring.scoreExperiment("exp-old-id", [rule.id]);
const scores = await ks.scoring.getScores("exp-old-id");

Server-side scoring rules — useful for retroactively adding metrics to old experiments.

export

const traces = ks.export.traces({ experiment_id: "exp-abc" });
for await (const event of traces) {
  console.log(event);
}
 
const bundle = await ks.export.experiment("exp-abc", { format: "json" });

Streaming pagination over traces, spans, scenarios, and scores. NDJSON output for piping into jq.

prompts

const p = await ks.prompts.create({
  name: "summarize",
  template: "Summarize this in {{ words }} words: {{ text }}",
});
const rendered = ks.prompts.render(p, { text: "...", words: 50 });

Server-side prompt template store. Versioned, named, render with variables.

Top-level helpers

Keystone.fromSandbox() — inside the sandbox

const { client, sandbox } = await Keystone.fromSandbox();
const db = sandbox.services.db;   // { host, port, ready }

Reads KEYSTONE_SANDBOX_ID, KEYSTONE_API_KEY, KEYSTONE_BASE_URL from the env (Keystone injects them at sandbox boot), constructs a client, returns it plus the current sandbox object. The injected API key is sandbox-scoped — safe to log.

ks.wrap(client) — LLM tracing

const anthropic = ks.wrap(new Anthropic());
// every messages.create() now auto-reports

Tracing →

traced(fn) — custom spans

const writeFile = traced(async (path, content) => fs.writeFile(path, content), { name: "write_file" });
await writeFile("config.json", "{}");

Custom Spans →

ks.observe({...}) — one-call observability

ks.observe({ clients: [openai, anthropic], aiSdk });
// Wraps every client + auto-instruments AI SDK + initializes traced()

The "trace everything" line — combines wrap(), initTracing(), and auto_instrument().

Eval(name, config) — Braintrust-parity primitive

const result = await Eval("email-eval", {
  data: [{ input: ..., expected: ... }, ...],
  task: async (input) => myAgent.run(input),
  scores: [new Factuality(), new AnswerRelevancy()],
});
console.log(result.summary);

Run a task per row in parallel, score each result, aggregate per scorer. If KEYSTONE_API_KEY is set, results also post to the dashboard.

Errors

Every SDK surfaces non-2xx responses as a typed error:

LanguageType
TS / PythonKeystoneError (with statusCode)
Go*APIError (with StatusCode)

Most operations include the server's error message in the exception. For 503s ("at capacity"), implement a short retry loop with jitter — the platform is back-pressuring, not refusing.

Authentication

All requests use Authorization: Bearer <api-key>. Keys are tenant-scoped:

Key prefixMeans
ks_live_Tenant-issued production key. Full account access. Rotate via the Dashboard.
ks_sb_Sandbox-scoped token. Authorized only for one sandbox's resources. Auto-injected into the agent process; safe to log.

Use ks_live_ keys from your application code; ks_sb_ is what the SDK sees inside a sandbox.