Setup Guide
The fastest path is `ks setup`. This page explains every step `ks setup` runs, so you can run it whole or pick phases.
The fastest path to a working Keystone install is two commands:
curl -fsSL https://ks.polarity.so/install.sh | bash # install the CLI
cd ~/your-project && ks setup # run the wizardks setup runs seven phases end-to-end — writes coding-agent skill files, registers an MCP server, drops a starter spec, scans your code for LLM clients, installs the Keystone SDK, detects agent code, and runs a doctor check. Each phase is idempotent and re-running is safe; you can also run any single phase by name (ks setup spec, ks setup doctor, etc.).
This page documents every phase in detail. If you're impatient, run ks setup and skip to Step 4 — that's the only step the CLI can't do for you (the actual code change to wrap your LLM clients).
Total time: under 5 minutes for the basic flow. Longer if you're packaging your own agent as a snapshot.
Step 0 — Prerequisites
You need:
- A Keystone account. Sign up at app.paragon.run.
- An API key. Go to Keystone → Settings → API Keys → Create Key. The key starts with
ks_live_and is shown once at creation — copy it now. - Either:
- Node.js (≥18) for the TypeScript SDK +
npx ks ...CLI, OR - Python (≥3.9) for the Python SDK, OR
- Go (≥1.22) for the Go SDK + native CLI
- Node.js (≥18) for the TypeScript SDK +
You do not need Docker locally. Keystone runs sandboxes on Polarity's infrastructure.
Step 1 — Install the CLI
The CLI (ks) is one Go binary, no runtime deps. Install whichever way fits:
# macOS / Linux — canonical (what the CLI's auto-updater also uses)
curl -fsSL https://ks.polarity.so/install.sh | bash
# Pin a version
curl -fsSL https://ks.polarity.so/install.sh | bash -s ks-v0.1.2
# Or download a release binary directly
# → https://github.com/Polarityinc/ks/releases/latestVerify:
ks --version
# ks version v0.1.2 (or whatever the current release is)The CLI auto-checks for updates once per 24h. Run ks update to upgrade in-place.
ks.polarity.sovskeystone.polarity.so.ks.polarity.sois where the install script and release binaries live.keystone.polarity.sois the API server the CLI calls (the value ofKEYSTONE_BASE_URL). You'll only ever set the API URL if you're running self-hosted; for the hosted offering, the default is right.
Step 2 — Configure your API key
Drop the key in your project's .env (auto-loaded by ks from the current directory):
# .env (gitignored)
KEYSTONE_API_KEY=ks_live_xxxxxxxxxxxxxxxxxxxxxxOr export in your shell:
export KEYSTONE_API_KEY=ks_live_xxxxxxxxxxxxxxxxxxxxxxVerify:
ks setup doctorExpected:
✓ KEYSTONE_API_KEY — set (ks_live_xxxxxxxx…xxxx)
✓ server reachable (https://keystone.polarity.so) — OK
✓ auth — OK
✓ ks on PATH — /usr/local/bin/ks
all good — Keystone is wired up
If anything's red, run the suggested fix and re-run doctor until green.
Step 3 — Initialize your project with ks setup
This is the load-bearing step. One command wires everything in:
cd ~/my-project
ks setupks setup runs seven phases sequentially. Each is idempotent (safe to re-run) and independent (you can run any one by name).
What you'll see
When you run ks setup from a TTY, the wizard:
- Prints a styled banner explaining what it will do.
- Interactively prompts you for which coding agents to target (Claude Code, Cursor, Gemini CLI, OpenCode, Codex, Windsurf, VS Code, or "other / generic"). The default selection is "all."
- Runs each phase in order, printing per-phase status (✓ wrote
.claude/skills/keystone/SKILL.md, ⚠ already exists, etc.). - Ends with a copy-pasteable prompt block to hand to your coding agent — telling it exactly what's left to do (the LLM-client wrapping in Step 4 below).
If you pipe stdin (CI, scripts), it skips the prompt and targets all agents:
# Non-interactive default
ks setup < /dev/nullPick agents explicitly
# Comma-separated keys
ks setup --agents claude,cursor
ks setup --agents claude,cursor,vscode
ks setup --agents allAvailable keys: claude, cursor, gemini, opencode, codex, windsurf, vscode, other, all.
Skip the SDK auto-install
ks setup --no-install-sdkks setup detects your project's languages and runs the right package-manager command (go get, npm install, uv add, pip install, etc.). With --no-install-sdk, it just prints the command — useful when you want manual control over your dep tree.
Run individual phases
Each of the seven phases is a subcommand you can run in isolation:
ks setup skills # only write coding-agent skill files
ks setup mcp # only register the MCP server
ks setup spec # only drop the starter spec
ks setup instrument # only scan source for LLM clients
ks setup install # only install the SDK
ks setup snapshot # only detect agent code
ks setup doctor # only run the health checkEach is documented in detail below.
Phase 1 — ks setup skills
ks setup skillsWrites Keystone skill files into your project's coding-agent config directories. A skill file is a markdown document that teaches the coding agent (Claude Code, Cursor, Gemini, etc.) what Keystone is, what verbs the CLI exposes, and how to call them. After this, when you ask the agent "set up Keystone in this repo," it has the full playbook.
The content is the same SKILL.md body across every target — only the path and frontmatter format differs per agent (Cursor uses .mdc with a Cursor-specific YAML header; the rest use Claude Skills format).
| Agent key | Label | Skill file path |
|---|---|---|
claude | Claude Code | .claude/skills/keystone/SKILL.md |
cursor | Cursor | .cursor/rules/keystone.mdc |
gemini | Gemini CLI | .gemini/skills/keystone/SKILL.md |
opencode | OpenCode | .opencode/skills/keystone/SKILL.md |
codex | Codex | .codex/skills/keystone/SKILL.md |
windsurf | Windsurf | .windsurf/skills/keystone/SKILL.md |
vscode | VS Code | (no skill convention — this agent is skipped) |
other | Generic / other | .agents/skills/keystone/SKILL.md |
The generic other target is included by default — if you use a coding agent that follows the emerging .agents/skills/ convention, it picks Keystone up without you doing anything.
What the skill file contains:
- Spec YAML reference (every block, every type)
- CLI verbs table (
ks eval run,ks logs traces, etc.) - Go SDK examples
- LLM-client wrapping instructions per language
- A "set up Keystone in this project" task playbook with 4 numbered steps
The skill file is the source of truth the coding agent will reference whenever you ask it to do Keystone-related work.
# Write only for specific agents
ks setup skills --agents claude,cursorPhase 2 — ks setup mcp
ks setup mcpRegisters ks mcp serve (Keystone's stdio MCP server) in each selected coding agent's MCP config file. After this, your agent can call Keystone's verbs as MCP tools — useful for "improve this spec until pass rate is 95%" loops where the agent invokes evals, reads traces, and iterates without leaving the chat.
Per-agent MCP config locations:
| Agent | MCP config | Top-level key |
|---|---|---|
| Claude Code | .mcp.json | mcpServers |
| Cursor | .cursor/mcp.json | mcpServers |
| Gemini CLI | .gemini/settings.json | mcpServers |
| OpenCode | .opencode/mcp.json | mcpServers |
| Windsurf | .windsurf/mcp_config.json | mcpServers |
| VS Code | .vscode/mcp.json | servers |
| Codex | (user-scope only — skipped) | — |
The merger uses map[string]any so it preserves every other top-level field in the existing JSON — for files like .gemini/settings.json that hold theme, model, and other unrelated settings, naive struct unmarshal would silently drop them.
Example written to .mcp.json:
{
"mcpServers": {
"keystone": {
"command": "/usr/local/bin/ks",
"args": ["mcp", "serve"],
"env": {
"KEYSTONE_API_KEY": "ks_live_..."
}
}
}
}The command resolves to whatever ks binary the wizard found (os.Executable()). KEYSTONE_API_KEY and KEYSTONE_BASE_URL are forwarded from your env if set.
After running this phase, restart your coding agent so it picks up the new server entry. Verify ks mcp serve works manually:
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05"}}' | ks mcp serve
# Should print a JSON-RPC initialize response.The 10 MCP tools the server exposes are listed in the CLI page.
Phase 3 — ks setup spec
ks setup specWrites keystone/example.yaml — a starter spec you can run immediately. Skipped if keystone/example.yaml already exists (won't clobber your work).
The starter spec exercises every common block: a Node base image, a task prompt, the paragon agent, four invariants spanning all major check types (file_exists, command_exit, llm_as_judge), and pass_threshold: 0.7:
version: 1
id: example-rest-api
description: Build a tiny Express server and verify both routes work.
base: node:20
task:
prompt: |
Build a tiny Express server in server.js exposing:
GET /healthz → 200 with body "ok"
GET /users → 200 with a JSON array
Add a package.json so `npm test` runs (it can be a no-op).
agent:
type: paragon
timeout: 5m
invariants:
server_file_exists:
description: server.js was created
weight: 1.0
gate: true
check:
type: file_exists
path: server.js
package_json_exists:
description: package.json was created
weight: 0.5
check:
type: file_exists
path: package.json
server_loads:
description: server.js requires cleanly
weight: 1.5
check:
type: command_exit
command: node -e "require('./server.js')"
expect_exit_code: 0
routes_work:
description: server actually serves both routes (LLM-as-judge)
weight: 2.0
check:
type: llm_as_judge
rubric: |
Does server.js actually start an Express server and handle BOTH
/healthz and /users routes? Reject implementations that stub the
routes with TODO comments or return 500 / undefined.
scoring:
pass_threshold: 0.7
parallelism:
replicas: 1Run it:
ks eval run keystone/example.yamlExpected: a passing scenario in 10–30 seconds.
Phase 4 — ks setup instrument
ks setup instrumentWalks every source file in your repo (.go, .ts, .tsx, .js, .mjs, .cjs, .py) and grep-matches against a catalog of ~50 LLM SDK construction patterns, then prints each hit as path:line tagged with the SDK family.
Detected providers:
| Family | Patterns matched |
|---|---|
openai | new OpenAI(, OpenAI(, AsyncOpenAI(, new AzureOpenAI(, AzureOpenAI( |
anthropic | new Anthropic(, Anthropic(, AsyncAnthropic(, AnthropicVertex(, AnthropicBedrock( |
google-genai | new GoogleGenerativeAI(, genai.NewClient(, genai.Client(, GenerativeModel( |
google-vertex | vertexai.NewClient(, vertexai.init( |
cohere | cohere.NewClient(, new CohereClient(, cohere.Client(, cohere.AsyncClientV2( |
mistral | mistral.NewClient(, new MistralClient(, MistralClient( |
groq | new Groq(, Groq(, AsyncGroq( |
together | new Together(, Together(, AsyncTogether( |
fireworks | new Fireworks(, Fireworks(, AsyncFireworks( |
replicate | new Replicate(, replicate.Client(, Replicate( |
huggingface | new HfInference(, InferenceClient(, AsyncInferenceClient( |
bedrock | new BedrockRuntime(, bedrockruntime.NewFromConfig(, boto3.client('bedrock', BedrockRuntime( |
ollama | new Ollama(, ollama.NewClient(, ollama.Client(, ollama.AsyncClient( |
vercel-ai | @ai-sdk/openai, @ai-sdk/anthropic, @ai-sdk/google, generateText(, streamText(, generateObject(, streamObject( |
langchain | new ChatOpenAI(, ChatAnthropic(, ChatGoogleGenerativeAI(, ChatCohere(, ChatMistralAI(, ChatGroq(, ChatBedrock(, ChatVertexAI( |
langchain-go | langchaingo/llms |
litellm | litellm.completion(, litellm.acompletion(, litellm.Router( |
instructor | instructor.from_openai(, instructor.from_anthropic(, instructor.from_litellm(, instructor.from_gemini(, instructor.from_cohere( |
dspy | dspy.LM(, dspy.OpenAI(, dspy.Anthropic( |
pydantic-ai | OpenAIModel(, AnthropicModel(, GeminiModel(, GroqModel(, MistralModel( |
mastra | new Mastra( |
openai-agents | @openai/agents, from agents import |
http-option | option.WithHTTPClient( (Go — generic HTTP client option) |
Skipped directories (so the scanner doesn't flag vendored SDK source):
node_modules, .git, dist, build, vendor, .next, .nuxt, .turbo, target, out, coverage, __pycache__, .pytest_cache, .mypy_cache, .idea, .vscode, .gradle, .cargo, site-packages, anything starting with .venv / venv / env- / .env-, anything under a site-packages tree.
Skipped lines: comments (//, #, /*, >>>, etc.) and class/interface declarations (so vendored SDK source code isn't matched as a "construction").
Output is grouped by language with up to 3 file:line examples per language; the summary mentions the total hit count and the SDK families. The full list goes into the skill file (so the coding agent has every site when it does the wrapping in Step 4).
$ ks setup instrument
✓ detected 5 call site(s) across 2 language(s), 3 SDK familie(s)
TypeScript / JavaScript (3 hits, 2 families)
src/agents/openai.ts:14 [openai]
src/agents/openai.ts:42 [openai]
src/lib/llm.ts:8 [anthropic]
Python (2 hits, 2 families)
pipeline/judge.py:23 [openai]
rag/retriever.py:11 [langchain]
→ ask your coding agent to wrap these (full instructions in SKILL.md → "Step 1 — Wrap LLM clients")The wrap is the one step ks setup cannot do for you — see Step 4 below.
Phase 5 — ks setup install
ks setup installDetects your project's languages and runs the correct package-manager command for each. The detection is based on dep manifests at the project root — independent of whether step 4 found any actual LLM construction sites, since most users run ks setup before writing the LLM-using code.
Language → package manager mapping:
| Manifest detected | Package manager picked | Install command |
|---|---|---|
go.mod | go | go get github.com/Polarityinc/keystone-sdk-go |
bun.lockb or bun.lock | bun | bun add @polarityinc/polarity-keystone |
pnpm-lock.yaml | pnpm | pnpm add @polarityinc/polarity-keystone |
yarn.lock | yarn | yarn add @polarityinc/polarity-keystone |
package-lock.json (or only package.json) | npm | npm install @polarityinc/polarity-keystone |
uv.lock | uv | uv add polarity-keystone |
poetry.lock | poetry | poetry add polarity-keystone |
Pipfile.lock (or Pipfile) | pipenv | pipenv install polarity-keystone |
pyproject.toml (no lock) | uv (modern default) | uv add polarity-keystone |
requirements.txt | pip | pip install polarity-keystone |
Skipped if the SDK is already declared in the manifest. The wizard greps the right file (go.mod, package.json's dependencies/devDependencies, pyproject.toml's [project.dependencies], etc.) before running anything.
Print without installing: ks setup install --no-install-sdk prints the commands for each detected language but doesn't execute them. Useful when you want to vet the install before letting it touch your dep tree.
If your project has multiple languages (e.g., a Go server + Python tooling), the phase installs for each one detected.
Phase 6 — ks setup snapshot
ks setup snapshotDetects agent code in your repo and prints what it found. Use this only when you want to run your own agent inside Keystone (instead of the built-in paragon agent).
Detection markers (file → what it implies):
| File | Kind | Hint |
|---|---|---|
Dockerfile | dockerfile | Build a Docker image of your agent |
agent.Dockerfile | dockerfile | Agent-specific Dockerfile |
package.json | node | Node/TS agent (consider snapshot for fast iteration) |
pyproject.toml | python | Python agent (poetry/pdm/uv project) |
requirements.txt | python | Python agent (pip) |
agent.py | python | Python entrypoint |
agent.ts / agent.js | node | TS/JS entrypoint |
go.mod | go | Go agent module |
main.go | go | Go entrypoint |
Cargo.toml | rust | Rust agent |
Search paths: project root + agent/, agents/, src/agent/, cmd/agent/, apps/agent/. The phase doesn't recurse arbitrarily — it only looks at the immediate contents of these specific directories.
$ ks setup snapshot
✓ found 2 candidate agent location(s):
[python] agent/main.py
[node] apps/agent/package.json
→ ask your coding agent to package + upload a snapshot (full instructions in SKILL.md → "Step 3")The phase doesn't package or upload anything — it surfaces the locations and the skill file (Step 3a) has the full packaging walkthrough: tar the directory with tar -czf agent.tar.gz -C agent/ ., pick an entrypoint, call ks.agents.upload(...). See Agent Snapshots for the API details.
If no agent code is detected, the phase no-ops with a hint pointing at the skill file. Skip without consequence if you're using built-in Paragon.
Phase 7 — ks setup doctor
ks setup doctorThe "is everything wired up?" final check. Verifies each layer of the install in order, prints actionable hints for whatever's broken, and exits non-zero if any check fails (so you can use it in CI / pre-flight scripts).
Checks performed (in order):
.envparse errors — first to surface, because a malformed.envis the most common reason a "set" key looks unset. The CLI loads.envand.env.localfrom the current working directory; any parse error fromgodotenv.Loadis captured and shown here.KEYSTONE_API_KEYis set — picked up from--api-keyflag → env →.env. Shows a redacted hash (first 12 + last 4 chars) so you know which key was picked up without leaking the value. If unset, prints the get-a-key URL plus a hint about how.envis loaded — and a special note if a.envexists at the project root but didn't yield the key (typo / case-sensitivity check).- Server reachable — does an HTTP
GET /healthagainstKEYSTONE_BASE_URL(defaulthttps://keystone.polarity.so). Times out at 5 seconds. A 200 response is the bar. - Auth works — calls
client.Experiments.List(ctx)with a 10-second deadline. If it returns 401/403, prints a hint specifically about theks_live_*vs local-daemon mismatch (the most common confusion when self-hosting). ksis on PATH —exec.LookPath("ks"). Required for the MCP config (Phase 2) to resolve the binary. If not, hints atgo installor symlinking.
Sample output, all green:
✓ KEYSTONE_API_KEY — set (ks_live_a1b2c3d4…e5f6)
✓ server reachable (https://keystone.polarity.so) — 200 OK
✓ auth — OK
✓ ks on PATH — /usr/local/bin/ks
all good — Keystone is wired up
Sample output, broken:
✗ dotenv parse error — .env: line 3: invalid syntax
✗ KEYSTONE_API_KEY — get a key at https://app.paragon.run/app/keystone/settings, then put `KEYSTONE_API_KEY=<your-key>` in a `.env` at the project root (auto-loaded). Or pass --api-key / export it in your shell rc. — note: a `.env` exists at /Users/alex/proj/.env but didn't yield KEYSTONE_API_KEY (check spelling — names are case-sensitive)
✗ auth — keystone: API error 401 Unauthorized
→ if your key is `ks_live_*` (a hosted Keystone key) but you're hitting a local daemon, set KEYSTONE_BASE_URL to your hosted URL (default: https://keystone.polarity.so)
→ if your key was issued by a local daemon, set KEYSTONE_BASE_URL=http://localhost:8012
Pending tasks block. Whenever phases 4 or 6 left work behind (LLM clients to wrap, agent code to package as a snapshot), doctor ends with a styled box containing a copy-pasteable prompt for your coding agent — tells it to follow Step 1 / Step 3 of the skill file. The block always points at the right skill-file path for your selected agent (e.g. .claude/skills/keystone/SKILL.md).
Run doctor any time you suspect something's drifted — it's the canonical "what's wrong" check.
Step 4 — Wrap your LLM clients
This is the one step ks can't do for you — the actual code change. Find each LLM client construction (ks setup instrument showed you where) and pass it through ks.wrap():
// BEFORE
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
// AFTER
import Anthropic from "@anthropic-ai/sdk";
import { Keystone } from "@polarityinc/polarity-keystone";
const ks = new Keystone();
ks.initTracing(); // no-op outside a sandbox
const anthropic = ks.wrap(new Anthropic());
// Every anthropic.messages.create() now auto-reports prompt, tokens,
// latency, tool-calls. Response is unchanged.Repeat for every LLM client. Then wrap your tool functions with traced() so tool execution shows up alongside LLM calls in the trace tree:
import { traced } from "@polarityinc/polarity-keystone";
// BEFORE
async function readFile(path: string) {
return await fs.readFile(path, "utf-8");
}
// AFTER (decorator form — preferred)
const readFile = traced(async (path: string) => {
return await fs.readFile(path, "utf-8");
}, { name: "read_file" });
// OR (callback form — legacy but works)
async function readFile(path: string) {
return await traced("read_file", async () => {
return await fs.readFile(path, "utf-8");
});
}Repeat for every tool function. Nested traced() calls automatically build a parent-child span tree.
Verify the wrap is working
After running an experiment, check that traces flowed:
ks logs traces $EXP_ID --event-type llm_call | headIf you see llm_call events with model + tokens + cost, the wrap took effect. If empty, recheck step 4 — your agent's LLM client probably isn't wrapped.
Step 5 — Write a spec
Five questions, one YAML file:
- What task should the agent do? One sentence the agent receives as a prompt.
- What environment does it need? Base image, packages, repos, services (DB, mock APIs).
- How do you know it worked? Concrete yes/no checks (file content, exit code, SQL queries, mock assertions).
- What's off-limits? Forbidden DB writes / HTTP hosts / file paths.
- What API keys does the agent need at runtime? LLM provider keys, third-party creds.
Edit keystone/example.yaml to match your scenario, or copy from Examples.
Minimum required fields:
version: 1
id: my-eval
description: "What this evaluates"
base: "ubuntu:24.04"
task:
prompt: "<the instruction the agent receives>"
agent:
type: paragon # or cli, http, image, snapshot, python
timeout: 5m
invariants:
my_check:
description: "What this verifies"
weight: 1.0
gate: true
check:
type: file_exists
path: output.json
scoring:
pass_threshold: 1.0The full reference is in Spec Reference.
Step 6 — Run the experiment
ks eval run keystone/example.yamlThis:
- Uploads the spec.
- Auto-forwards declared secrets from your local
.env. - Creates an experiment.
- Runs scenarios in parallel (up to your tenant's concurrency limit).
- Polls until done.
- Prints the full
RunResultsJSON. - Exits non-zero if any scenario failed.
Expected output (passing):
{
"experiment_id": "exp-a1b2c3",
"total_scenarios": 1,
"passed": 1,
"failed": 0,
"metrics": {
"pass_rate": 1.0,
"mean_wall_ms": 12000,
"total_cost_usd": 0.043
},
"scenarios": [
{
"status": "pass",
"composite_score": 1.0,
"invariants": [
{ "name": "my_check", "passed": true, "gate": true, "weight": 1.0 }
]
}
]
}Failing run includes a reproducer command:
{
"scenarios": [
{
"status": "fail",
"composite_score": 0.0,
"invariants": [
{
"name": "my_check",
"passed": false,
"gate": true,
"message": "file 'output.json' does not exist"
}
],
"reproducer": {
"seed": 12345,
"command": "ks eval run keystone/example.yaml --seed 12345 --scenario scenario-000"
}
}
]
}The reproducer command re-runs that exact scenario with the same seed for debugging.
Step 7 — Iterate
The full inner loop:
- Inspect failures.
ks eval get $EXP_ID | jq '.scenarios[] | select(.status == "fail")' - Stream the trace.
ks logs traces $EXP_ID | jqto see what the agent did. - Reproduce locally. Use the
reproducer.command. - Fix. Adjust prompt, invariants, fixtures, or agent code.
- Re-run.
ks eval run keystone/example.yaml.
When the eval consistently passes, raise the bar:
- Add
parallelism.replicas: 10to measure flakiness. - Add a
matrixto compare configurations (models, prompts, etc.). - Move from
pass_threshold: 0.7to0.95. - Add
forbidden:rules to enforce trajectory constraints. - Add
audit:blocks to capture activity for forbidden checks.
Step 8 — Wire into CI
Once the eval is stable, gate your CI on it:
# .github/workflows/ci.yml
name: Eval
on: [pull_request]
jobs:
keystone-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install ks
run: curl -fsSL https://ks.polarity.so/install.sh | bash
- name: Run regression eval
env:
KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
run: ks eval run keystone/regression.yamlOptionally, gate on regressions vs. last green:
- name: Compare to last green
env:
KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
run: |
LATEST_EXP=$(ks eval list | jq -r '.[0].id')
ks eval compare $LAST_GREEN_EXP_ID $LATEST_EXP --gate--gate exits non-zero if pass rate dropped, latency rose, or cost rose past the default thresholds.
Step 9 — Add alerts
Get notified on regressions:
import { Keystone } from "@polarityinc/polarity-keystone";
const ks = new Keystone();
await ks.alerts.create({
name: "pass-rate-drop",
eval_id: "my-eval",
condition: "pass_rate < 0.85",
notify: "slack",
slack_channel: "#agent-alerts",
});
await ks.alerts.create({
name: "cost-spike",
condition: "mean_cost_per_run_usd > 1.00",
notify: "webhook",
webhook_url: "https://hooks.slack.com/services/...",
});Slack channel uses Block Kit; webhooks get raw JSON. See Alerts.
Step 10 — Production observability (optional)
The same SDK works for tracing real production traffic. Set KEYSTONE_API_KEY in your prod env, wrap your LLM clients, and traces flow to the dashboard's Traces tab — filterable by API key.
// Production code — same wrap, no sandbox
import { Keystone } from "@polarityinc/polarity-keystone";
import OpenAI from "openai";
const ks = new Keystone(); // reads KEYSTONE_API_KEY from prod env
const openai = ks.wrap(new OpenAI());
// Every prod call now traces to /v1/traces, scoped by your API key.See Production Mode.
Where to go from here
| If you want to | Read |
|---|---|
| Learn the spec syntax in detail | Spec Reference |
| Understand the SDK methods | SDK Reference |
| Set up backing services (DB, mocks) | Services |
| Inject credentials safely | Secrets |
| Write better invariants | Invariants |
| Block forbidden agent behavior | Forbidden Rules |
| Use 28 built-in scorers | Scorers Library |
| Trace LLM calls in detail | LLM Tracing |
| Trace custom code | Custom Spans |
| Compare two experiments | Experiments |
| Package your own agent | Agent Snapshots |
| Master the CLI | Keystone CLI |
| Debug something broken | Troubleshooting |
Common pitfalls
"My agent runs in the sandbox but I don't see traces" — make sure you called ks.wrap() and ks.initTracing() (or just ks.observe(...) which does both). The wrap silently no-ops outside a sandbox if no API key is set.
"Spec uploads fine but the experiment fails immediately" — usually a missing secret. Check the experiment's error message; it'll name the secret. Add it to your .env or the Dashboard Secrets tab.
"All my replicas pass but pass_rate is < 1.0" — replica_aggregation.strategy: percentage rounds pass-rate down. Switch to all_must_pass or accept the rounding.
"The agent works locally but fails in the sandbox" — the sandbox doesn't have your local files unless you copy them in via fixtures or pass them via setup.files. The agent has its own filesystem.
"{{ secrets.X }} isn't substituting" — interpolation only works in declared fields (service env, fixture seed, agent args). Free-form spec text isn't templated.
"The trace is empty" — ks.wrap() only intercepts messages.create() (Anthropic) or chat.completions.create() (OpenAI). If your agent uses a different method (e.g., responses.create()), wrap manually or use auto_instrument.
If something else is broken, Troubleshooting has the full list of common errors with fixes.