Production Mode

Use the same SDK in production to trace real agent traffic — no sandbox required, scoped by API key.

The same SDK you use during evals works in production. Set KEYSTONE_API_KEY to a ks_live_ key, call wrap() and traced() the same way you would inside a sandbox — events post to POST /v1/traces and are scoped to the API key's billing owner server-side.

This is the agent mode path. No sandbox. No experiment. Just LLM + tool tracing for real traffic.

Why use agent mode

Three reasons:

One SDK, two paths. The same agent code traces correctly during evals (sandbox mode) and in production (agent mode). No "production observability stack" to set up separately.
Outcome correlation. When a user-reported bug happens, the trace tells you exactly what the model did — same trace UI you debug evals with.
Cost & latency aggregates. Per-API-key dashboards: how many tokens did agent v2 burn last week? What's the p95 latency on the production cluster?

Agent mode runs alongside, not instead of, sandbox-based evals. Most teams do both.

Setup

One env var:

export KEYSTONE_API_KEY=ks_live_xxxxxxxxxxxxxxxxxxxxxx

That's it. Get the key from app.paragon.run/app/keystone/settings → API Keys. The same key works for evals and production — just don't share ks_live_ keys across deployments where you'd want different billing.

Use the SDK normally

import { Keystone, traced } from "@polarityinc/polarity-keystone";
import OpenAI from "openai";
 
const ks = new Keystone();           // reads KEYSTONE_API_KEY from env
ks.initTracing();                     // no sandbox id → agent mode
 
const openai = ks.wrap(new OpenAI());
 
await traced(async () => {
  const resp = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: userQuery }],
    tools: myTools,
  });
  // llm_call event + tool_use events posted to /v1/traces
}, { name: "agent.run" });

The SDK detects mode at call time:

KEYSTONE_SANDBOX_ID set → sandbox mode (events go under that sandbox).
Only KEYSTONE_API_KEY set → agent mode (events scoped by API key).
Neither → silent no-op.

The same code works in both contexts — when running in a Keystone sandbox during evals, KEYSTONE_SANDBOX_ID is auto-injected and traces nest under the sandbox. In prod, only KEYSTONE_API_KEY is set and traces go to /v1/traces.

What you see in the dashboard

The Traces tab in the Keystone dashboard shows agent-mode traces — filterable by API key, time window, model, and tool. The shape:

Column	Source
Time	`ts`
API Key	resolved from the auth header
Model	`cost.model`
Input tokens / output tokens	`cost.input_tokens` / `cost.output_tokens`
Cost	`cost.estimated_usd`
Tools called	aggregated from child `tool_use` events
Duration	`duration_ms`

Click into a trace to see the full event tree (LLM call + tool calls + custom spans).

Sandbox vs agent mode comparison

	Sandbox mode	Agent mode
Env vars	`KEYSTONE_SANDBOX_ID` + `KEYSTONE_API_KEY` (server-injected)	`KEYSTONE_API_KEY` only
Endpoint	`POST /v1/sandboxes/:id/trace`	`POST /v1/traces`
Tenant key	`sandbox_id` + `api_key_id`	`api_key_id` (sandbox_id null)
Use case	Experiments, CI, regression runs	Prod, staging, real traffic
Billed to	Sandbox's team (metadata stamped at create)	API key's billing owner
Dashboard view	Experiment detail (per-scenario)	Traces tab (filtered by API key)

The two never mix in views — experiment-scoped queries ignore agent-mode rows; the Traces tab filters out sandbox-bound rows. Prod traffic doesn't pollute experiment metrics.

Custom span scoping

For a hosted SaaS where each customer has their own API key, agent mode automatically segregates traces by tenant:

// Customer A's traffic uses key_a
const ksA = new Keystone({ apiKey: "ks_live_customerA_xxx" });
const openaiA = ksA.wrap(new OpenAI());
 
// Customer B's traffic uses key_b
const ksB = new Keystone({ apiKey: "ks_live_customerB_yyy" });
const openaiB = ksB.wrap(new OpenAI());
 
// Each customer's traces are scoped to their own dashboard.

This is the multi-tenant pattern: one Keystone account per customer, one Keystone instance per request, each with the right key.

CI / staging integration

Agent mode shines for environments where running full sandbox evals is too slow or too expensive:

CI smoke tests. Fire 1–2 real agent calls, get traces back, manually inspect.
Staging. Real users, real traffic, full observability — without the sandbox round-trip.
Production canaries. Roll out a new agent version to 1% of traffic; compare its trace metrics to the 99% baseline.

Use API keys to segment:

# Production
export KEYSTONE_API_KEY=ks_live_prod_xxx
 
# Staging
export KEYSTONE_API_KEY=ks_live_staging_xxx
 
# Canary
export KEYSTONE_API_KEY=ks_live_canary_xxx

The dashboard's API-key filter lets you compare traces across all three.

Dashboard filters

In the Traces tab, common filters:

API key — segment by environment / customer.
Time range — last hour / last 24h / custom.
Model — pick from cost.model.
Tool — only traces that called a specific tool.
Status — only errored traces (find production failures).
Cost > X — only expensive traces (find cost runaways).
Duration > X — only slow traces (find latency outliers).

Tying agent-mode traces back to evals

The natural workflow:

In prod (agent mode). A user reports a bug. Find the trace in the Traces tab.
Save the input. Copy the prompt, the tool definitions, and the expected output.
Add to a regression dataset. ks.datasets.add_records(...) — see Datasets.
Run as eval (sandbox mode). Build a spec that drives the dataset and asserts the right behavior. Now this prod bug has a permanent test case.

Every prod bug becomes a regression test. The two-mode design is what makes that easy.

What it doesn't do

Doesn't enforce invariants. Agent mode is observability-only. Use sandbox mode + invariants for "did this go right" checks.
Doesn't snapshot state. No filesystem snapshot, no audit log. Agent mode just collects LLM call + custom span events.
Doesn't gate deploys. It's not a CI tool — you'd need experiment runs (sandbox mode) for that.
Doesn't rate-limit your traffic. The SDK posts events fire-and-forget; if Keystone is down, your agent keeps running.

Privacy and data residency

In agent mode, the SDK sends:

LLM request body (truncated to ~4KB)
LLM response body (truncated to ~4KB)
Token counts and cost
Tool names and arguments (truncated)
Span timing and status

It does not send:

Full unredacted prompts beyond the truncation cap
Streaming chunks individually (only the accumulated final state)
Anything from outside the wrapped LLM call (no env, no filesystem)

Self-hosted Keystone deployments can pin all this to your own infrastructure. For Polarity-hosted, data is encrypted in transit and at rest; retention defaults to 30 days (Free) / 6 months (Pro) / 2 years (Enterprise).

For sensitive prompts, either use self-hosted or pre-redact in your application before the SDK sees them.

Custom Spans OpenTelemetry