Tracing & Observability

Production Mode

Use the same SDK in production to trace real agent traffic — no sandbox required, scoped by API key.

The same SDK you use during evals works in production. Set KEYSTONE_API_KEY to a ks_live_ key, call wrap() and traced() the same way you would inside a sandbox — events post to POST /v1/traces and are scoped to the API key's billing owner server-side.

This is the agent mode path. No sandbox. No experiment. Just LLM + tool tracing for real traffic.

Why use agent mode

Three reasons:

  1. One SDK, two paths. The same agent code traces correctly during evals (sandbox mode) and in production (agent mode). No "production observability stack" to set up separately.
  2. Outcome correlation. When a user-reported bug happens, the trace tells you exactly what the model did — same trace UI you debug evals with.
  3. Cost & latency aggregates. Per-API-key dashboards: how many tokens did agent v2 burn last week? What's the p95 latency on the production cluster?

Agent mode runs alongside, not instead of, sandbox-based evals. Most teams do both.

Setup

One env var:

export KEYSTONE_API_KEY=ks_live_xxxxxxxxxxxxxxxxxxxxxx

That's it. Get the key from app.paragon.run/app/keystone/settingsAPI Keys. The same key works for evals and production — just don't share ks_live_ keys across deployments where you'd want different billing.

Use the SDK normally

import { Keystone, traced } from "@polarityinc/polarity-keystone";
import OpenAI from "openai";
 
const ks = new Keystone();           // reads KEYSTONE_API_KEY from env
ks.initTracing();                     // no sandbox id → agent mode
 
const openai = ks.wrap(new OpenAI());
 
await traced(async () => {
  const resp = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: userQuery }],
    tools: myTools,
  });
  // llm_call event + tool_use events posted to /v1/traces
}, { name: "agent.run" });

The SDK detects mode at call time:

  • KEYSTONE_SANDBOX_ID set → sandbox mode (events go under that sandbox).
  • Only KEYSTONE_API_KEY set → agent mode (events scoped by API key).
  • Neither → silent no-op.

The same code works in both contexts — when running in a Keystone sandbox during evals, KEYSTONE_SANDBOX_ID is auto-injected and traces nest under the sandbox. In prod, only KEYSTONE_API_KEY is set and traces go to /v1/traces.

What you see in the dashboard

The Traces tab in the Keystone dashboard shows agent-mode traces — filterable by API key, time window, model, and tool. The shape:

ColumnSource
Timets
API Keyresolved from the auth header
Modelcost.model
Input tokens / output tokenscost.input_tokens / cost.output_tokens
Costcost.estimated_usd
Tools calledaggregated from child tool_use events
Durationduration_ms

Click into a trace to see the full event tree (LLM call + tool calls + custom spans).

Sandbox vs agent mode comparison

Sandbox modeAgent mode
Env varsKEYSTONE_SANDBOX_ID + KEYSTONE_API_KEY (server-injected)KEYSTONE_API_KEY only
EndpointPOST /v1/sandboxes/:id/tracePOST /v1/traces
Tenant keysandbox_id + api_key_idapi_key_id (sandbox_id null)
Use caseExperiments, CI, regression runsProd, staging, real traffic
Billed toSandbox's team (metadata stamped at create)API key's billing owner
Dashboard viewExperiment detail (per-scenario)Traces tab (filtered by API key)

The two never mix in views — experiment-scoped queries ignore agent-mode rows; the Traces tab filters out sandbox-bound rows. Prod traffic doesn't pollute experiment metrics.

Custom span scoping

For a hosted SaaS where each customer has their own API key, agent mode automatically segregates traces by tenant:

// Customer A's traffic uses key_a
const ksA = new Keystone({ apiKey: "ks_live_customerA_xxx" });
const openaiA = ksA.wrap(new OpenAI());
 
// Customer B's traffic uses key_b
const ksB = new Keystone({ apiKey: "ks_live_customerB_yyy" });
const openaiB = ksB.wrap(new OpenAI());
 
// Each customer's traces are scoped to their own dashboard.

This is the multi-tenant pattern: one Keystone account per customer, one Keystone instance per request, each with the right key.

CI / staging integration

Agent mode shines for environments where running full sandbox evals is too slow or too expensive:

  • CI smoke tests. Fire 1–2 real agent calls, get traces back, manually inspect.
  • Staging. Real users, real traffic, full observability — without the sandbox round-trip.
  • Production canaries. Roll out a new agent version to 1% of traffic; compare its trace metrics to the 99% baseline.

Use API keys to segment:

# Production
export KEYSTONE_API_KEY=ks_live_prod_xxx
 
# Staging
export KEYSTONE_API_KEY=ks_live_staging_xxx
 
# Canary
export KEYSTONE_API_KEY=ks_live_canary_xxx

The dashboard's API-key filter lets you compare traces across all three.

Dashboard filters

In the Traces tab, common filters:

  • API key — segment by environment / customer.
  • Time range — last hour / last 24h / custom.
  • Model — pick from cost.model.
  • Tool — only traces that called a specific tool.
  • Status — only errored traces (find production failures).
  • Cost > X — only expensive traces (find cost runaways).
  • Duration > X — only slow traces (find latency outliers).

Tying agent-mode traces back to evals

The natural workflow:

  1. In prod (agent mode). A user reports a bug. Find the trace in the Traces tab.
  2. Save the input. Copy the prompt, the tool definitions, and the expected output.
  3. Add to a regression dataset. ks.datasets.add_records(...) — see Datasets.
  4. Run as eval (sandbox mode). Build a spec that drives the dataset and asserts the right behavior. Now this prod bug has a permanent test case.

Every prod bug becomes a regression test. The two-mode design is what makes that easy.

What it doesn't do

  • Doesn't enforce invariants. Agent mode is observability-only. Use sandbox mode + invariants for "did this go right" checks.
  • Doesn't snapshot state. No filesystem snapshot, no audit log. Agent mode just collects LLM call + custom span events.
  • Doesn't gate deploys. It's not a CI tool — you'd need experiment runs (sandbox mode) for that.
  • Doesn't rate-limit your traffic. The SDK posts events fire-and-forget; if Keystone is down, your agent keeps running.

Privacy and data residency

In agent mode, the SDK sends:

  • LLM request body (truncated to ~4KB)
  • LLM response body (truncated to ~4KB)
  • Token counts and cost
  • Tool names and arguments (truncated)
  • Span timing and status

It does not send:

  • Full unredacted prompts beyond the truncation cap
  • Streaming chunks individually (only the accumulated final state)
  • Anything from outside the wrapped LLM call (no env, no filesystem)

Self-hosted Keystone deployments can pin all this to your own infrastructure. For Polarity-hosted, data is encrypted in transit and at rest; retention defaults to 30 days (Free) / 6 months (Pro) / 2 years (Enterprise).

For sensitive prompts, either use self-hosted or pre-redact in your application before the SDK sees them.