TypeScript SDK
Every Keystone TypeScript class, method, and option — what each does, when to use it, what it returns.
@polarityinc/polarity-keystone — the TypeScript / JavaScript SDK. Works in Node.js (≥18) and modern bundlers; some helpers (like Keystone.fromSandbox()) require Node-only APIs.
Install
npm install @polarityinc/polarity-keystone
# or
bun add @polarityinc/polarity-keystone
# or
pnpm add @polarityinc/polarity-keystoneKeystone — the client
import { Keystone } from "@polarityinc/polarity-keystone";
const ks = new Keystone({
apiKey: "ks_live_...", // optional — falls back to KEYSTONE_API_KEY
baseUrl: "https://keystone.polarity.so", // optional — default
timeout: 30_000, // optional — default 30s
});Constructor reads KEYSTONE_API_KEY from the env if apiKey is omitted. After construction, the nine services are available as instance properties: ks.sandboxes, ks.specs, ks.experiments, ks.alerts, ks.agents, ks.datasets, ks.scoring, ks.export, ks.prompts.
ks.wrap(client, opts?) — wrap an LLM client
Patches the client so every .create() call posts trace events.
const anthropic = ks.wrap(new Anthropic());
const openai = ks.wrap(new OpenAI(), {
sandboxId: "sb-explicit", // override env detection
tracing: false, // skip global tracing init
aiSdk: aiSdkModule, // auto-instrument AI SDK alongside
langchainCallbackManager: callbackManager, // auto-instrument LangChain alongside
});Returns: the same client object, instrumented in place. Response shapes are unchanged.
Three things it does:
- Patches
client.messages.create(Anthropic) orclient.chat.completions.create(OpenAI). - Calls
initTracing()so subsequenttraced()calls emit spans (skip withtracing: false). - Auto-instruments any framework module passed via
aiSdk:/langchainCallbackManager:.
ks.initTracing(sandboxId?) — set up traced() only
ks.initTracing(); // picks up KEYSTONE_SANDBOX_ID
ks.initTracing("sb-explicit"); // overrideNo-op without a sandbox id and an API key. Use this when you don't have an LLM client to wrap but still want traced() to emit spans.
ks.observe(opts) — one-call observability
const labels = ks.observe({
clients: [new Anthropic(), new OpenAI()],
tracing: true,
aiSdk: aiSdkModule,
langchainCallbackManager: callbackManager,
sandboxId: "sb-...", // optional
});
// labels: ['anthropic-client', 'openai-client', 'tracing', 'ai-sdk.generateText', 'langchain']Wraps every named client, initializes traced(), auto-instruments framework modules. Returns the labels of what was instrumented — useful for startup logging.
Keystone.fromSandbox() — inside the sandbox
const { client, sandbox } = await Keystone.fromSandbox();
const db = sandbox.services?.db; // { host, port, ready }Reads KEYSTONE_SANDBOX_ID, KEYSTONE_API_KEY, KEYSTONE_BASE_URL from the env (Keystone injects them at sandbox boot). Throws KeystoneError if KEYSTONE_SANDBOX_ID isn't set.
SandboxService
ks.sandboxes.create(opts)
const sb = await ks.sandboxes.create({
spec_id: "fix-failing-test", // required
timeout: "10m", // optional duration string
metadata: { run: "ci-7821" }, // optional key-value
secrets: { ANTHROPIC_API_KEY: "..." } // optional explicit secret map
});Posts to POST /v1/sandboxes. Returns a Sandbox:
{
id: "sb-a1b2c3...",
spec_id: string,
state: "creating" | "ready" | "running" | "stopped" | "error",
path: string,
url: string,
created_at: string,
metadata?: Record<string, string>,
services?: Record<string, ServiceInfo>,
}The full boot pipeline runs server-side before this returns — see Sandboxes.
Other sandbox methods
await ks.sandboxes.get("sb-abc"); // GET /v1/sandboxes/:id
await ks.sandboxes.list(); // GET /v1/sandboxes
await ks.sandboxes.destroy("sb-abc"); // DELETE /v1/sandboxes/:id
await ks.sandboxes.runCommand("sb-abc", { command: "npm test", timeout: "2m" });
await ks.sandboxes.readFile("sb-abc", "src/main.ts"); // returns string
await ks.sandboxes.writeFile("sb-abc", "config.json", '{"x": 1}');
await ks.sandboxes.deleteFile("sb-abc", "tmp/cache.bin");
await ks.sandboxes.state("sb-abc"); // StateSnapshot — files + checksums
await ks.sandboxes.diff("sb-abc"); // StateDiff — added/modified/removed
await ks.sandboxes.ingestTrace("sb-abc", events); // POST /v1/sandboxes/:id/trace
await ks.sandboxes.getTrace("sb-abc"); // GET /v1/sandboxes/:id/traceSpecService
await ks.specs.create(yamlContent); // POST /v1/specs
await ks.specs.get("fix-failing-test"); // GET /v1/specs/:id
await ks.specs.list(); // GET /v1/specs
await ks.specs.delete("fix-failing-test"); // DELETE /v1/specs/:idcreate takes the raw YAML as a string. Versioning is automatic — uploading the same id: increments the version.
ExperimentService
ks.experiments.create(opts)
const exp = await ks.experiments.create({
name: "baseline-v1",
spec_id: "fix-failing-test",
specPath: "./specs/fix-failing-test.yaml", // auto-forwards declared secrets
secrets: { ... }, // explicit overrides
});When specPath is passed, the SDK reads the spec's secrets: block, resolves each declared source on the caller's machine via collectDeclaredSecretsFromFile(), and merges into body.secrets (explicit secrets: arg wins on collision).
Other experiment methods
await ks.experiments.run(exp.id); // POST /v1/experiments/:id/run (async)
await ks.experiments.runAndWait(exp.id, {
pollInterval: 2000, // ms between polls
timeout: 300_000, // max ms to wait
scores: [/* client-side scorers */],
});
await ks.experiments.get(exp.id); // GET /v1/experiments/:id (RunResults)
await ks.experiments.list(); // GET /v1/experiments
await ks.experiments.compare("exp-baseline", "exp-new");
await ks.experiments.metrics(exp.id);runAndWait triggers the run, polls until done == total, optionally runs client-side scorers over each completed scenario, and returns the final RunResults.
AlertService
await ks.alerts.create({
name: "pass-rate-drop",
eval_id: "fix-failing-test",
condition: "pass_rate < 0.8",
notify: "slack",
slack_channel: "#agent-alerts",
});
await ks.alerts.list();
await ks.alerts.delete("alert_abc");AgentService
ks.agents.upload(opts)
const snap = await ks.agents.upload({
name: "email-agent",
entrypoint: ["python", "main.py"],
runtime: "python3.12", // optional hint
tag: "v2.1", // optional label
auth: { // optional declared requirements
required_env: ["ANTHROPIC_API_KEY"],
config_files: [{ path: ".env", template: "..." }],
egress: { "api.anthropic.com": ["443"] },
},
bundle: tarballBytes, // Uint8Array of a .tar.gz
});Returns:
{
id: "snap_abc...", // immutable content hash
name: string,
version: number, // auto-assigned
tag?: string,
digest: string, // sha256
size_bytes: number,
storage_path?: string,
runtime?: string,
entrypoint: string[],
auth?: AgentAuth,
created_at: string,
}Other agent methods
await ks.agents.get("email-agent"); // latest version
await ks.agents.get("email-agent", { tag: "v2.1" });
await ks.agents.get("email-agent", { version: 3 });
await ks.agents.getById("snap_abc..."); // by content hash
await ks.agents.list({ limit: 50, cursor: "..." }); // paginated
await ks.agents.listVersions("email-agent", { limit: 20 });
await ks.agents.delete(snapshot); // pass full object, not just idPagination returns { items: AgentSnapshot[], next_cursor?: string }.
DatasetService
await ks.datasets.create("customer-emails", "Renewal scenarios");
await ks.datasets.list();
await ks.datasets.get("ds_abc");
await ks.datasets.delete("ds_abc");
await ks.datasets.addRecords("ds_abc", [
{ input: { id: "alice" }, expected: { subject: "Renewal" }, tags: ["pro"] },
]);
await ks.datasets.getRecords("ds_abc", { version: 3, tags: ["pro"] });ScoringService
Server-side scoring rules. Run scoring offline against existing experiments.
const rule = await ks.scoring.createRule("factuality", "llm_as_judge", {
model: "paragon-fast",
rubric: { pass: "...", fail: "..." },
});
await ks.scoring.listRules();
await ks.scoring.deleteRule(rule.id);
await ks.scoring.scoreExperiment("exp-old", [rule.id]);
const scores = await ks.scoring.getScores("exp-old");scoreExperiment enqueues offline scoring; getScores returns the resulting per-scenario scores.
ExportService
Bulk extraction. Each method returns an AsyncIterable that pages through keyset cursors automatically:
for await (const event of ks.export.traces({ experiment_id: "exp-abc" })) {
console.log(event);
}
for await (const span of ks.export.spans({ root_span_id: "span_xyz" })) {
console.log(span);
}
for await (const scenario of ks.export.scenarios({ experiment_id: "exp-abc", status: "failed" })) {
console.log(scenario);
}
for await (const score of ks.export.scores({ experiment_id: "exp-abc" })) {
console.log(score);
}
// Or one-shot bundle:
const bundle = await ks.export.experiment("exp-abc", { format: "json" });
const ndjson = await ks.export.experiment("exp-abc", { format: "ndjson" });
// Or single trace:
const trace = await ks.export.trace("trace-abc");Filters available depend on the endpoint — see the type definitions for TracesFilters, SpansFilters, ScenariosFilters, ScoresFilters.
Tracing exports
import { traced, initTracing, TracedSpan, registerOtelFlush, flushOtel } from "@polarityinc/polarity-keystone";| Export | Purpose |
|---|---|
traced(fn, opts?) | Decorator/HOF — wrap a function, return wrapped version |
traced(name, fn) | Callback form (legacy) |
TracedSpan | Manual lifecycle — new TracedSpan(name), setOutput(), fail(), end() |
initTracing(http, sandboxId, opts?) | Internal, called by Keystone.initTracing() |
registerOtelFlush(cb) | Register an OTel-flush callback for shutdown |
flushOtel() | Run all registered OTel-flush callbacks |
currentOtelTracer() | Get the currently-installed OTel tracer (for advanced use) |
Wrap exports
import {
wrapClient,
wrapAnthropic, wrapOpenAI, wrapMistral, wrapGoogleGenAI,
wrapClaudeAgentSDK, wrapAISDK, wrapMastraAgent,
autoInstrument,
} from "@polarityinc/polarity-keystone";Generally you use ks.wrap() instead of these — but if you need finer control, wrapClient(client, sandboxId, http) lets you wrap directly without going through the Keystone instance.
Scorer exports
The 28 built-in scorers across 5 families. See Scorers Library for what each one does.
import {
// base
BaseScorer, CustomScorer, scorer, scorersToInvariants, runClientScorers,
// heuristic
Contains, ExactMatch, Levenshtein, NumericDiff, JSONDiff, JSONValidity, SemanticListContains,
// llm-judge
JudgeScorer, Factuality, Battle, ClosedQA, Humor, Moderation, Summarization,
SQLJudge, Translation, Security,
// rag
ContextPrecision, ContextRecall, ContextRelevancy, ContextEntityRecall,
Faithfulness, AnswerRelevancy, AnswerSimilarity, AnswerCorrectness,
// embedding
EmbeddingSimilarity, openaiEmbedder,
// sandbox invariants
FileExists, FileContains, CommandExits, SQLEquals, LLMJudge,
// presets
presets,
} from "@polarityinc/polarity-keystone";Eval primitive
import { Eval } from "@polarityinc/polarity-keystone";
const result = await Eval("email-eval", {
data: [{ input, expected }, ...],
task: async (input) => myAgent.run(input),
scores: [new Factuality(), new AnswerRelevancy()],
maxConcurrency: 4,
keystone: ks, // optional, falls back to env
});
result.summary; // { factuality: { mean, p50, p95, count }, ... }
result.rows; // [{ input, expected, output, scores, durationMs }, ...]
result.experimentId; // dashboard write — present if KEYSTONE_API_KEY is setBraintrust-parity primitive. Runs task per row in parallel, scores each result, aggregates per scorer.
Pricing
import { estimateCost, pricingTable } from "@polarityinc/polarity-keystone";
const cost = estimateCost("claude-sonnet-4-5", 4200, 1800, 1500);
// USD value
pricingTable.set("my-custom-model", { input: 1.0, output: 5.0 });
// Add custom pricingError handling
import { KeystoneError } from "@polarityinc/polarity-keystone";
try {
await ks.sandboxes.create({ spec_id: "missing" });
} catch (err) {
if (err instanceof KeystoneError) {
console.error(err.statusCode, err.message);
}
}TypeScript types
Every shape is exported. The most useful:
import type {
Sandbox, CreateSandboxRequest, CommandResult,
Experiment, RunResults, ScenarioResult, RunMetrics,
AgentSnapshot, UploadSnapshotRequest, AgentPage,
AlertRule, AlertFiring,
TraceEvent, TraceMetrics, TraceResponse,
Score, InvariantResult, CostInfo,
} from "@polarityinc/polarity-keystone";