SDK Reference

TypeScript SDK

Every Keystone TypeScript class, method, and option — what each does, when to use it, what it returns.

@polarityinc/polarity-keystone — the TypeScript / JavaScript SDK. Works in Node.js (≥18) and modern bundlers; some helpers (like Keystone.fromSandbox()) require Node-only APIs.

Install

npm install @polarityinc/polarity-keystone
# or
bun add @polarityinc/polarity-keystone
# or
pnpm add @polarityinc/polarity-keystone

Keystone — the client

import { Keystone } from "@polarityinc/polarity-keystone";
 
const ks = new Keystone({
  apiKey: "ks_live_...",                  // optional — falls back to KEYSTONE_API_KEY
  baseUrl: "https://keystone.polarity.so", // optional — default
  timeout: 30_000,                          // optional — default 30s
});

Constructor reads KEYSTONE_API_KEY from the env if apiKey is omitted. After construction, the nine services are available as instance properties: ks.sandboxes, ks.specs, ks.experiments, ks.alerts, ks.agents, ks.datasets, ks.scoring, ks.export, ks.prompts.

ks.wrap(client, opts?) — wrap an LLM client

Patches the client so every .create() call posts trace events.

const anthropic = ks.wrap(new Anthropic());
const openai = ks.wrap(new OpenAI(), {
  sandboxId: "sb-explicit",                  // override env detection
  tracing: false,                             // skip global tracing init
  aiSdk: aiSdkModule,                        // auto-instrument AI SDK alongside
  langchainCallbackManager: callbackManager, // auto-instrument LangChain alongside
});

Returns: the same client object, instrumented in place. Response shapes are unchanged.

Three things it does:

  1. Patches client.messages.create (Anthropic) or client.chat.completions.create (OpenAI).
  2. Calls initTracing() so subsequent traced() calls emit spans (skip with tracing: false).
  3. Auto-instruments any framework module passed via aiSdk: / langchainCallbackManager:.

ks.initTracing(sandboxId?) — set up traced() only

ks.initTracing();                       // picks up KEYSTONE_SANDBOX_ID
ks.initTracing("sb-explicit");          // override

No-op without a sandbox id and an API key. Use this when you don't have an LLM client to wrap but still want traced() to emit spans.

ks.observe(opts) — one-call observability

const labels = ks.observe({
  clients: [new Anthropic(), new OpenAI()],
  tracing: true,
  aiSdk: aiSdkModule,
  langchainCallbackManager: callbackManager,
  sandboxId: "sb-...",                    // optional
});
// labels: ['anthropic-client', 'openai-client', 'tracing', 'ai-sdk.generateText', 'langchain']

Wraps every named client, initializes traced(), auto-instruments framework modules. Returns the labels of what was instrumented — useful for startup logging.

Keystone.fromSandbox() — inside the sandbox

const { client, sandbox } = await Keystone.fromSandbox();
const db = sandbox.services?.db;        // { host, port, ready }

Reads KEYSTONE_SANDBOX_ID, KEYSTONE_API_KEY, KEYSTONE_BASE_URL from the env (Keystone injects them at sandbox boot). Throws KeystoneError if KEYSTONE_SANDBOX_ID isn't set.

SandboxService

ks.sandboxes.create(opts)

const sb = await ks.sandboxes.create({
  spec_id: "fix-failing-test",         // required
  timeout: "10m",                       // optional duration string
  metadata: { run: "ci-7821" },         // optional key-value
  secrets: { ANTHROPIC_API_KEY: "..." } // optional explicit secret map
});

Posts to POST /v1/sandboxes. Returns a Sandbox:

{
  id: "sb-a1b2c3...",
  spec_id: string,
  state: "creating" | "ready" | "running" | "stopped" | "error",
  path: string,
  url: string,
  created_at: string,
  metadata?: Record<string, string>,
  services?: Record<string, ServiceInfo>,
}

The full boot pipeline runs server-side before this returns — see Sandboxes.

Other sandbox methods

await ks.sandboxes.get("sb-abc");                                          // GET /v1/sandboxes/:id
await ks.sandboxes.list();                                                  // GET /v1/sandboxes
await ks.sandboxes.destroy("sb-abc");                                       // DELETE /v1/sandboxes/:id
 
await ks.sandboxes.runCommand("sb-abc", { command: "npm test", timeout: "2m" });
await ks.sandboxes.readFile("sb-abc", "src/main.ts");                       // returns string
await ks.sandboxes.writeFile("sb-abc", "config.json", '{"x": 1}');
await ks.sandboxes.deleteFile("sb-abc", "tmp/cache.bin");
 
await ks.sandboxes.state("sb-abc");                                          // StateSnapshot — files + checksums
await ks.sandboxes.diff("sb-abc");                                           // StateDiff — added/modified/removed
 
await ks.sandboxes.ingestTrace("sb-abc", events);                            // POST /v1/sandboxes/:id/trace
await ks.sandboxes.getTrace("sb-abc");                                       // GET /v1/sandboxes/:id/trace

SpecService

await ks.specs.create(yamlContent);                                          // POST /v1/specs
await ks.specs.get("fix-failing-test");                                      // GET /v1/specs/:id
await ks.specs.list();                                                        // GET /v1/specs
await ks.specs.delete("fix-failing-test");                                   // DELETE /v1/specs/:id

create takes the raw YAML as a string. Versioning is automatic — uploading the same id: increments the version.

ExperimentService

ks.experiments.create(opts)

const exp = await ks.experiments.create({
  name: "baseline-v1",
  spec_id: "fix-failing-test",
  specPath: "./specs/fix-failing-test.yaml",   // auto-forwards declared secrets
  secrets: { ... },                             // explicit overrides
});

When specPath is passed, the SDK reads the spec's secrets: block, resolves each declared source on the caller's machine via collectDeclaredSecretsFromFile(), and merges into body.secrets (explicit secrets: arg wins on collision).

Other experiment methods

await ks.experiments.run(exp.id);                       // POST /v1/experiments/:id/run (async)
await ks.experiments.runAndWait(exp.id, {
  pollInterval: 2000,                                    // ms between polls
  timeout: 300_000,                                      // max ms to wait
  scores: [/* client-side scorers */],
});
await ks.experiments.get(exp.id);                       // GET /v1/experiments/:id (RunResults)
await ks.experiments.list();                             // GET /v1/experiments
 
await ks.experiments.compare("exp-baseline", "exp-new");
await ks.experiments.metrics(exp.id);

runAndWait triggers the run, polls until done == total, optionally runs client-side scorers over each completed scenario, and returns the final RunResults.

AlertService

await ks.alerts.create({
  name: "pass-rate-drop",
  eval_id: "fix-failing-test",
  condition: "pass_rate < 0.8",
  notify: "slack",
  slack_channel: "#agent-alerts",
});
 
await ks.alerts.list();
await ks.alerts.delete("alert_abc");

Alerts →

AgentService

ks.agents.upload(opts)

const snap = await ks.agents.upload({
  name: "email-agent",
  entrypoint: ["python", "main.py"],
  runtime: "python3.12",                           // optional hint
  tag: "v2.1",                                     // optional label
  auth: {                                          // optional declared requirements
    required_env: ["ANTHROPIC_API_KEY"],
    config_files: [{ path: ".env", template: "..." }],
    egress: { "api.anthropic.com": ["443"] },
  },
  bundle: tarballBytes,                            // Uint8Array of a .tar.gz
});

Returns:

{
  id: "snap_abc...",                              // immutable content hash
  name: string,
  version: number,                                 // auto-assigned
  tag?: string,
  digest: string,                                  // sha256
  size_bytes: number,
  storage_path?: string,
  runtime?: string,
  entrypoint: string[],
  auth?: AgentAuth,
  created_at: string,
}

Other agent methods

await ks.agents.get("email-agent");                       // latest version
await ks.agents.get("email-agent", { tag: "v2.1" });
await ks.agents.get("email-agent", { version: 3 });
await ks.agents.getById("snap_abc...");                    // by content hash
 
await ks.agents.list({ limit: 50, cursor: "..." });        // paginated
await ks.agents.listVersions("email-agent", { limit: 20 });
 
await ks.agents.delete(snapshot);                          // pass full object, not just id

Pagination returns { items: AgentSnapshot[], next_cursor?: string }.

DatasetService

await ks.datasets.create("customer-emails", "Renewal scenarios");
await ks.datasets.list();
await ks.datasets.get("ds_abc");
await ks.datasets.delete("ds_abc");
 
await ks.datasets.addRecords("ds_abc", [
  { input: { id: "alice" }, expected: { subject: "Renewal" }, tags: ["pro"] },
]);
await ks.datasets.getRecords("ds_abc", { version: 3, tags: ["pro"] });

Datasets →

ScoringService

Server-side scoring rules. Run scoring offline against existing experiments.

const rule = await ks.scoring.createRule("factuality", "llm_as_judge", {
  model: "paragon-fast",
  rubric: { pass: "...", fail: "..." },
});
 
await ks.scoring.listRules();
await ks.scoring.deleteRule(rule.id);
 
await ks.scoring.scoreExperiment("exp-old", [rule.id]);
const scores = await ks.scoring.getScores("exp-old");

scoreExperiment enqueues offline scoring; getScores returns the resulting per-scenario scores.

ExportService

Bulk extraction. Each method returns an AsyncIterable that pages through keyset cursors automatically:

for await (const event of ks.export.traces({ experiment_id: "exp-abc" })) {
  console.log(event);
}
 
for await (const span of ks.export.spans({ root_span_id: "span_xyz" })) {
  console.log(span);
}
 
for await (const scenario of ks.export.scenarios({ experiment_id: "exp-abc", status: "failed" })) {
  console.log(scenario);
}
 
for await (const score of ks.export.scores({ experiment_id: "exp-abc" })) {
  console.log(score);
}
 
// Or one-shot bundle:
const bundle = await ks.export.experiment("exp-abc", { format: "json" });
const ndjson = await ks.export.experiment("exp-abc", { format: "ndjson" });
 
// Or single trace:
const trace = await ks.export.trace("trace-abc");

Filters available depend on the endpoint — see the type definitions for TracesFilters, SpansFilters, ScenariosFilters, ScoresFilters.

Tracing exports

import { traced, initTracing, TracedSpan, registerOtelFlush, flushOtel } from "@polarityinc/polarity-keystone";
ExportPurpose
traced(fn, opts?)Decorator/HOF — wrap a function, return wrapped version
traced(name, fn)Callback form (legacy)
TracedSpanManual lifecycle — new TracedSpan(name), setOutput(), fail(), end()
initTracing(http, sandboxId, opts?)Internal, called by Keystone.initTracing()
registerOtelFlush(cb)Register an OTel-flush callback for shutdown
flushOtel()Run all registered OTel-flush callbacks
currentOtelTracer()Get the currently-installed OTel tracer (for advanced use)

Wrap exports

import {
  wrapClient,
  wrapAnthropic, wrapOpenAI, wrapMistral, wrapGoogleGenAI,
  wrapClaudeAgentSDK, wrapAISDK, wrapMastraAgent,
  autoInstrument,
} from "@polarityinc/polarity-keystone";

Generally you use ks.wrap() instead of these — but if you need finer control, wrapClient(client, sandboxId, http) lets you wrap directly without going through the Keystone instance.

Scorer exports

The 28 built-in scorers across 5 families. See Scorers Library for what each one does.

import {
  // base
  BaseScorer, CustomScorer, scorer, scorersToInvariants, runClientScorers,
 
  // heuristic
  Contains, ExactMatch, Levenshtein, NumericDiff, JSONDiff, JSONValidity, SemanticListContains,
 
  // llm-judge
  JudgeScorer, Factuality, Battle, ClosedQA, Humor, Moderation, Summarization,
  SQLJudge, Translation, Security,
 
  // rag
  ContextPrecision, ContextRecall, ContextRelevancy, ContextEntityRecall,
  Faithfulness, AnswerRelevancy, AnswerSimilarity, AnswerCorrectness,
 
  // embedding
  EmbeddingSimilarity, openaiEmbedder,
 
  // sandbox invariants
  FileExists, FileContains, CommandExits, SQLEquals, LLMJudge,
 
  // presets
  presets,
} from "@polarityinc/polarity-keystone";

Eval primitive

import { Eval } from "@polarityinc/polarity-keystone";
 
const result = await Eval("email-eval", {
  data: [{ input, expected }, ...],
  task: async (input) => myAgent.run(input),
  scores: [new Factuality(), new AnswerRelevancy()],
  maxConcurrency: 4,
  keystone: ks,                    // optional, falls back to env
});
 
result.summary;       // { factuality: { mean, p50, p95, count }, ... }
result.rows;          // [{ input, expected, output, scores, durationMs }, ...]
result.experimentId;  // dashboard write — present if KEYSTONE_API_KEY is set

Braintrust-parity primitive. Runs task per row in parallel, scores each result, aggregates per scorer.

Pricing

import { estimateCost, pricingTable } from "@polarityinc/polarity-keystone";
 
const cost = estimateCost("claude-sonnet-4-5", 4200, 1800, 1500);
// USD value
 
pricingTable.set("my-custom-model", { input: 1.0, output: 5.0 });
// Add custom pricing

Error handling

import { KeystoneError } from "@polarityinc/polarity-keystone";
 
try {
  await ks.sandboxes.create({ spec_id: "missing" });
} catch (err) {
  if (err instanceof KeystoneError) {
    console.error(err.statusCode, err.message);
  }
}

TypeScript types

Every shape is exported. The most useful:

import type {
  Sandbox, CreateSandboxRequest, CommandResult,
  Experiment, RunResults, ScenarioResult, RunMetrics,
  AgentSnapshot, UploadSnapshotRequest, AgentPage,
  AlertRule, AlertFiring,
  TraceEvent, TraceMetrics, TraceResponse,
  Score, InvariantResult, CostInfo,
} from "@polarityinc/polarity-keystone";