Production Mode
Use the same SDK in production to trace real agent traffic — no sandbox required, scoped by API key.
The same SDK you use during evals works in production. Set KEYSTONE_API_KEY to a ks_live_ key, call wrap() and traced() the same way you would inside a sandbox — events post to POST /v1/traces and are scoped to the API key's billing owner server-side.
This is the agent mode path. No sandbox. No experiment. Just LLM + tool tracing for real traffic.
Why use agent mode
Three reasons:
- One SDK, two paths. The same agent code traces correctly during evals (sandbox mode) and in production (agent mode). No "production observability stack" to set up separately.
- Outcome correlation. When a user-reported bug happens, the trace tells you exactly what the model did — same trace UI you debug evals with.
- Cost & latency aggregates. Per-API-key dashboards: how many tokens did agent v2 burn last week? What's the p95 latency on the production cluster?
Agent mode runs alongside, not instead of, sandbox-based evals. Most teams do both.
Setup
One env var:
export KEYSTONE_API_KEY=ks_live_xxxxxxxxxxxxxxxxxxxxxxThat's it. Get the key from app.paragon.run/app/keystone/settings → API Keys. The same key works for evals and production — just don't share ks_live_ keys across deployments where you'd want different billing.
Use the SDK normally
import { Keystone, traced } from "@polarityinc/polarity-keystone";
import OpenAI from "openai";
const ks = new Keystone(); // reads KEYSTONE_API_KEY from env
ks.initTracing(); // no sandbox id → agent mode
const openai = ks.wrap(new OpenAI());
await traced(async () => {
const resp = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: userQuery }],
tools: myTools,
});
// llm_call event + tool_use events posted to /v1/traces
}, { name: "agent.run" });The SDK detects mode at call time:
KEYSTONE_SANDBOX_IDset → sandbox mode (events go under that sandbox).- Only
KEYSTONE_API_KEYset → agent mode (events scoped by API key). - Neither → silent no-op.
The same code works in both contexts — when running in a Keystone sandbox during evals, KEYSTONE_SANDBOX_ID is auto-injected and traces nest under the sandbox. In prod, only KEYSTONE_API_KEY is set and traces go to /v1/traces.
What you see in the dashboard
The Traces tab in the Keystone dashboard shows agent-mode traces — filterable by API key, time window, model, and tool. The shape:
| Column | Source |
|---|---|
| Time | ts |
| API Key | resolved from the auth header |
| Model | cost.model |
| Input tokens / output tokens | cost.input_tokens / cost.output_tokens |
| Cost | cost.estimated_usd |
| Tools called | aggregated from child tool_use events |
| Duration | duration_ms |
Click into a trace to see the full event tree (LLM call + tool calls + custom spans).
Sandbox vs agent mode comparison
| Sandbox mode | Agent mode | |
|---|---|---|
| Env vars | KEYSTONE_SANDBOX_ID + KEYSTONE_API_KEY (server-injected) | KEYSTONE_API_KEY only |
| Endpoint | POST /v1/sandboxes/:id/trace | POST /v1/traces |
| Tenant key | sandbox_id + api_key_id | api_key_id (sandbox_id null) |
| Use case | Experiments, CI, regression runs | Prod, staging, real traffic |
| Billed to | Sandbox's team (metadata stamped at create) | API key's billing owner |
| Dashboard view | Experiment detail (per-scenario) | Traces tab (filtered by API key) |
The two never mix in views — experiment-scoped queries ignore agent-mode rows; the Traces tab filters out sandbox-bound rows. Prod traffic doesn't pollute experiment metrics.
Custom span scoping
For a hosted SaaS where each customer has their own API key, agent mode automatically segregates traces by tenant:
// Customer A's traffic uses key_a
const ksA = new Keystone({ apiKey: "ks_live_customerA_xxx" });
const openaiA = ksA.wrap(new OpenAI());
// Customer B's traffic uses key_b
const ksB = new Keystone({ apiKey: "ks_live_customerB_yyy" });
const openaiB = ksB.wrap(new OpenAI());
// Each customer's traces are scoped to their own dashboard.This is the multi-tenant pattern: one Keystone account per customer, one Keystone instance per request, each with the right key.
CI / staging integration
Agent mode shines for environments where running full sandbox evals is too slow or too expensive:
- CI smoke tests. Fire 1–2 real agent calls, get traces back, manually inspect.
- Staging. Real users, real traffic, full observability — without the sandbox round-trip.
- Production canaries. Roll out a new agent version to 1% of traffic; compare its trace metrics to the 99% baseline.
Use API keys to segment:
# Production
export KEYSTONE_API_KEY=ks_live_prod_xxx
# Staging
export KEYSTONE_API_KEY=ks_live_staging_xxx
# Canary
export KEYSTONE_API_KEY=ks_live_canary_xxxThe dashboard's API-key filter lets you compare traces across all three.
Dashboard filters
In the Traces tab, common filters:
- API key — segment by environment / customer.
- Time range — last hour / last 24h / custom.
- Model — pick from
cost.model. - Tool — only traces that called a specific tool.
- Status — only errored traces (find production failures).
- Cost > X — only expensive traces (find cost runaways).
- Duration > X — only slow traces (find latency outliers).
Tying agent-mode traces back to evals
The natural workflow:
- In prod (agent mode). A user reports a bug. Find the trace in the Traces tab.
- Save the input. Copy the prompt, the tool definitions, and the expected output.
- Add to a regression dataset.
ks.datasets.add_records(...)— see Datasets. - Run as eval (sandbox mode). Build a spec that drives the dataset and asserts the right behavior. Now this prod bug has a permanent test case.
Every prod bug becomes a regression test. The two-mode design is what makes that easy.
What it doesn't do
- Doesn't enforce invariants. Agent mode is observability-only. Use sandbox mode + invariants for "did this go right" checks.
- Doesn't snapshot state. No filesystem snapshot, no audit log. Agent mode just collects LLM call + custom span events.
- Doesn't gate deploys. It's not a CI tool — you'd need experiment runs (sandbox mode) for that.
- Doesn't rate-limit your traffic. The SDK posts events fire-and-forget; if Keystone is down, your agent keeps running.
Privacy and data residency
In agent mode, the SDK sends:
- LLM request body (truncated to ~4KB)
- LLM response body (truncated to ~4KB)
- Token counts and cost
- Tool names and arguments (truncated)
- Span timing and status
It does not send:
- Full unredacted prompts beyond the truncation cap
- Streaming chunks individually (only the accumulated final state)
- Anything from outside the wrapped LLM call (no env, no filesystem)
Self-hosted Keystone deployments can pin all this to your own infrastructure. For Polarity-hosted, data is encrypted in transit and at rest; retention defaults to 30 days (Free) / 6 months (Pro) / 2 years (Enterprise).
For sensitive prompts, either use self-hosted or pre-redact in your application before the SDK sees them.