Keystone CLI (ks)
Every ks command, flag, and subcommand — eval, logs, setup, mcp, update — with worked examples for each.
ks is the command-line interface for Keystone. One Go binary, zero runtime dependencies, mirrors the SDKs verb-for-verb. CI scripts, dev machines, ad-hoc trace inspection — ks is the right tool when you don't want to write code.
Install
# macOS / Linux — one-liner (canonical)
curl -fsSL https://ks.polarity.so/install.sh | bash
# Pin a version
curl -fsSL https://ks.polarity.so/install.sh | bash -s ks-v0.1.2
# Or download a release binary directly
# https://github.com/Polarityinc/ks/releases/latestThe installer drops ks into ~/.local/bin/ (or /usr/local/bin/ if needed). Supports macOS arm64/amd64 and Linux arm64/amd64.
After install: ks --version prints the running version. The CLI auto-checks for updates once per 24h (silent on network failure; skipped for dev builds and the update subcommand itself).
Two URLs, not one.
https://ks.polarity.sois where the CLI installer lives and where the auto-updater pulls binaries from.https://keystone.polarity.sois the API server the CLI calls — the value of theKEYSTONE_BASE_URLenv var. You shouldn't normally need to set the API URL; the SDK and CLI both default to it.
Quick verification
ks --help # top-level help
ks setup doctor # health check
ks eval list # confirm API accessIf setup doctor is green, you're ready.
The fastest path: ks setup
If you've installed ks and dropped your API key in .env, one command does the rest:
cd ~/your-project
ks setupIt runs seven phases (each independent, each idempotent), interactively asks which coding agents to target, and ends with a starter spec at keystone/example.yaml that you can run immediately:
ks eval run keystone/example.yamlThat's the entire onboarding flow. Every phase is documented separately below if you'd rather pick and choose.
Configuration
Every flag has an env-var equivalent:
| Flag | Env var | Default |
|---|---|---|
--api-key | KEYSTONE_API_KEY | required |
--base-url | KEYSTONE_BASE_URL | https://keystone.polarity.so |
--timeout | — | 30s |
Resolution order: flag > env > .env / .env.local in cwd > default.
ks auto-loads .env and .env.local from the current directory (does not override existing env vars). A malformed .env is surfaced by setup doctor instead of silently ignored.
Get an API key from app.paragon.run/app/keystone/settings → API Keys → Create Key. Drop it in your project's .env as KEYSTONE_API_KEY=ks_live_....
Top-level command tree
ks
├── eval Run and inspect evals
│ ├── run <spec.yaml> Upload spec → create exp → run → wait → results
│ ├── list List experiments
│ ├── get <experiment-id> Full RunResults JSON
│ ├── metrics <experiment-id> Aggregate metrics + tool breakdown
│ └── compare <a> <b> [--gate] Side-by-side, exit non-zero on regression
│
├── logs Extract trace, span, scenario, score data
│ ├── traces <experiment-id> Stream trace events (NDJSON)
│ ├── trace <trace-id> Single trace by ID
│ ├── spans Stream spans by filter
│ ├── scenarios --experiment X Stream scenario rows
│ ├── scores --experiment X Stream offline score rows
│ └── export <experiment-id> Full bundle (spec + scenarios + traces + scores)
│
├── setup Wire Keystone into a repo (7 phases)
│ ├── skills Write agent skill files (.claude/skills, .cursor/rules, etc.)
│ ├── mcp Register MCP server in agent configs (.mcp.json, etc.)
│ ├── spec Drop a starter spec at keystone/example.yaml
│ ├── instrument Scan source for ~50 LLM SDKs and print wrapping snippets
│ ├── install Install the Keystone SDK for each detected language
│ ├── snapshot Detect agent code and explain how to package + upload
│ └── doctor Verify API key, server reachability, auth, ks on PATH
│
├── mcp Model Context Protocol server commands
│ └── serve Run as stdio MCP server (for Claude Code, Cursor, etc.)
│
└── update Upgrade to the latest release
ks eval — run and inspect evals
ks eval run <spec.yaml>
The most common verb. End-to-end: uploads the spec, creates an experiment, runs it, waits for results, prints the full RunResults JSON.
ks eval run specs/scenario-1.yamlWhat it does, step by step:
- Reads
specs/scenario-1.yamlfrom disk. - Calls
client.Specs.Create(ctx, yamlBytes)— uploads + parses + versions the spec. - Calls
keystone.CollectDeclaredSecretsFromFile(specPath)— resolves the spec'ssecrets:block from the local environment (env vars, files, shell commands). - Calls
client.Experiments.Create(ctx, …)with the resolved secrets. - Calls
client.Experiments.RunAndWait(ctx, expID, opts)— polls until done or timeout. - Prints the full
RunResultsJSON to stdout. - Exits non-zero if any scenario didn't pass — turns CI red on regression.
Flags:
| Flag | Default | Meaning |
|---|---|---|
--name | spec filename (without extension) | Experiment name for the dashboard |
--wait-timeout | 10m | Max time to wait for completion |
--poll | 2s | Poll interval while waiting |
Examples:
# Default — run, wait, print JSON, exit non-zero on failure
ks eval run keystone/example.yaml
# Custom name + longer wait window
ks eval run keystone/big-eval.yaml --name "release-candidate-v2" --wait-timeout 30m
# In CI — stream JSON to a file, exit non-zero if any scenario failed
ks eval run specs/regression.yaml > results.json
# Smaller poll for faster feedback in interactive use
ks eval run keystone/quick.yaml --poll 500msks eval list
ks eval listGET /v1/experiments — JSON array of every experiment scoped to your API key. Pipe to jq for filters:
# Last 5 experiments
ks eval list | jq '.[:5]'
# Just IDs and names
ks eval list | jq '.[] | {id, name, status, created_at}'
# Most recent experiment ID
ks eval list | jq -r '.[0].id'ks eval get <experiment-id>
ks eval get exp-a1b2c3Full RunResults JSON: per-scenario invariants, costs, traces, reproducer commands. Useful for ad-hoc inspection:
# Just the failing scenarios
ks eval get exp-abc | jq '.scenarios[] | select(.status == "fail")'
# Cost breakdown
ks eval get exp-abc | jq '.metrics.total_cost_usd'
# All reproducer commands (one per failure)
ks eval get exp-abc | jq -r '.scenarios[] | select(.status == "fail") | .reproducer.command'ks eval metrics <experiment-id>
ks eval metrics exp-a1b2c3GET /v1/metrics/experiments/<id> — aggregate metrics with per-tool breakdown and trends:
{
"experiment_id": "exp-a1b2c3",
"summary": {
"total_runs": 30,
"pass_rate": 0.93,
"total_cost_usd": 12.40,
"mean_cost_per_run_usd": 0.41,
"mean_wall_ms": 14200,
"p95_wall_ms": 23000,
"total_tool_calls": 342,
"tool_success_rate": 0.97
},
"tool_breakdown": {
"edit": { "count": 89, "mean_ms": 200, "error_rate": 0.02 },
"execute": { "count": 67, "mean_ms": 1500, "error_rate": 0.05 },
"read": { "count": 120, "mean_ms": 50, "error_rate": 0.0 }
},
"cost_trend": [
{ "run_id": "run-1", "cost_usd": 0.28, "ts": "..." },
{ "run_id": "run-2", "cost_usd": 0.31, "ts": "..." }
],
"pass_rate_trend": [...]
}Useful for:
# Slowest tool
ks eval metrics exp-abc | jq '.tool_breakdown | to_entries | sort_by(.value.mean_ms) | reverse | .[0]'
# Cost over the run
ks eval metrics exp-abc | jq '.cost_trend[] | "\(.ts) \(.cost_usd)"'ks eval compare <baseline-id> <candidate-id>
ks eval compare exp-baseline exp-new
ks eval compare exp-baseline exp-new --gateSide-by-side comparison. With --gate, exits non-zero if any metric regressed — the CI gating primitive:
# CI: block merge on regression
ks eval compare $LAST_GREEN_EXP_ID $NEW_EXP_ID --gate
# Exit 1 if pass_rate dropped, p95 latency rose, etc.Output:
{
"baseline_id": "exp-baseline",
"candidate_id": "exp-new",
"regressed": true,
"regressions": [
"pass_rate dropped from 0.95 to 0.78",
"p95_wall_ms rose from 12000 to 18000"
],
"metrics": [
{ "name": "pass_rate", "baseline": 0.95, "candidate": 0.78, "delta": -0.17, "direction": "worse" },
{ "name": "mean_wall_ms", "baseline": 14000, "candidate": 14500, "delta": 500, "direction": "same" },
{ "name": "p95_wall_ms", "baseline": 12000, "candidate": 18000, "delta": 6000, "direction": "worse" },
{ "name": "mean_cost_per_run_usd", "baseline": 0.30, "candidate": 0.42, "delta": 0.12, "direction": "worse" }
]
}The default regression thresholds:
| Metric | Threshold |
|---|---|
pass_rate | -2% |
p95_wall_ms | +20% |
mean_cost_per_run_usd | +20% |
tool_success_rate | -5% |
ks logs — extract trace data
Six subcommands. Each defaults to NDJSON (line-delimited JSON) output, suitable for piping into jq, wc, etc.
ks logs traces <experiment-id>
Stream trace events.
# Every event from the experiment
ks logs traces exp-abc
# Filter by tool name
ks logs traces exp-abc --tool write_file
# Filter by event type — only LLM calls
ks logs traces exp-abc --event-type llm_call
# Only events after a timestamp (RFC3339)
ks logs traces exp-abc --since 2026-04-28T22:00:00Z
# Write to a file instead of stdout
ks logs traces exp-abc -o traces.jsonl
# JSON array instead of NDJSON
ks logs traces exp-abc --format jsonFlags:
| Flag | Meaning |
|---|---|
--event-type | llm_call, tool_use, tool_call |
--tool | Filter by tool name |
--agent | Filter by agent name (for snapshot agents) |
--sandbox | Filter by sandbox ID |
--since | Only events after this RFC3339 timestamp |
-o, --output | Write to file (default: stdout) |
--format | ndjson (default) or json |
--page-size | Server pagination size (default 100) |
ks logs trace <trace-id>
Single trace by ID — full span tree as JSON.
ks logs trace trace-abc123GET /v1/traces/<id>. Returns:
{
"spans": [
{ "ts": "...", "event_type": "llm_call", "tool": "anthropic.create", "duration_ms": 2340, "cost": {...} },
{ "ts": "...", "event_type": "tool_use", "tool": "write_file", "parent_span_id": "span_xyz", "input": "..." }
]
}ks logs spans
Stream spans matching a filter. Spans are individual events within traces.
# All spans in an experiment
ks logs spans --experiment exp-abc
# Single trace's spans
ks logs spans --trace trace-abc123
# All descendants of a root span (the full sub-tree)
ks logs spans --root-span span_abc
# Only direct children of a span
ks logs spans --parent-span span_xyz
# Only tool_use spans for a specific tool
ks logs spans --experiment exp-abc --event-type tool_use --tool write_fileks logs scenarios
Stream per-scenario rows for an experiment. Required: --experiment.
# Every scenario (one line per scenario, with status, scores, etc.)
ks logs scenarios --experiment exp-abc
# Only failed scenarios
ks logs scenarios --experiment exp-abc --status failed
# A specific scenario
ks logs scenarios --experiment exp-abc --scenario scenario-000ks logs scores
Stream offline scoring results (from ks.scoring.scoreExperiment). Required: --experiment.
ks logs scores --experiment exp-abc
ks logs scores --experiment exp-abc --rule rule-factuality-1ks logs export <experiment-id>
Dump the entire experiment bundle — spec, scenarios, traces, scores — as one document.
ks logs export exp-abc -o exp-abc.json # JSON (default)
ks logs export exp-abc --format ndjson -o exp-abc.jsonlUse this for offline analysis: pandas, notebooks, external tooling.
ks setup — wire Keystone into a repo
ks setup (with no subcommand) runs every onboarding phase end-to-end. Each phase is idempotent — re-running is safe.
ks setupPhases run in this order:
| # | Phase | What it does |
|---|---|---|
| 1 | skills | Write agent skill files |
| 2 | mcp | Register the Keystone MCP server in agent configs |
| 3 | spec | Drop a starter spec at keystone/example.yaml |
| 4 | instrument | Scan source for LLM client construction sites and print wrapping snippets |
| 5 | install | Install the Keystone SDK for each detected language |
| 6 | snapshot | Detect agent code and explain how to package + upload as a snapshot |
| 7 | doctor | Verify KEYSTONE_API_KEY, server reachability, auth, ks on PATH |
Run individual phases:
ks setup skills
ks setup mcp
ks setup spec
ks setup instrument
ks setup install
ks setup snapshot
ks setup doctorEach is documented below.
ks setup skills
Writes Keystone skill files into your project's coding-agent config directories. The skill file teaches the coding agent (Claude Code, Cursor, Codex, etc.) what Keystone is and how to use it — so when you ask the agent to "set up Keystone in this repo," it knows what to do.
Targets selected interactively (TTY) or via --agents:
| Agent | Skill path written |
|---|---|
| Claude Code | .claude/skills/keystone/SKILL.md |
| Cursor | .cursor/rules/keystone.mdc |
| Gemini CLI | .gemini/skills/keystone/SKILL.md |
| OpenCode | .opencode/skills/keystone/SKILL.md |
| Codex | .codex/skills/keystone/SKILL.md |
| Windsurf | .windsurf/skills/keystone/SKILL.md |
| VS Code | (no skill convention yet) |
| Generic | .agents/skills/keystone/SKILL.md |
# Interactive prompt (when run from a TTY)
ks setup skills
# Explicit list of agents
ks setup skills --agents claude,cursor
# All agents (non-interactive default in CI)
ks setup skills --agents allCursor's .mdc format is auto-translated; the rest write the canonical SKILL.md markdown.
ks setup mcp
Registers ks mcp serve as an MCP server in your project's coding-agent configs. After this, your agent can call Keystone's eval/logs verbs as MCP tools — useful for "improve this spec until pass rate is 95%" loops.
| Agent | MCP config written |
|---|---|
| Claude Code | .mcp.json |
| Cursor | .cursor/mcp.json |
| Gemini CLI | .gemini/settings.json |
| OpenCode | .opencode/mcp.json |
| Windsurf | .windsurf/mcp_config.json |
| VS Code | .vscode/mcp.json |
The config preserves any other top-level keys — naive struct unmarshal would drop them, so the merger uses map[string]any.
Example written to .mcp.json:
{
"mcpServers": {
"keystone": {
"command": "/usr/local/bin/ks",
"args": ["mcp", "serve"],
"env": {
"KEYSTONE_API_KEY": "ks_live_..."
}
}
}
}After running, restart your agent so it picks up the new server.
ks setup spec
Drops a starter spec at keystone/example.yaml:
version: 1
id: example-rest-api
description: Build a tiny Express server and verify both routes work.
base: node:20
task:
prompt: |
Build a tiny Express server in server.js exposing:
GET /healthz → 200 with body "ok"
GET /users → 200 with a JSON array
Add a package.json so `npm test` runs (it can be a no-op).
agent:
type: paragon
timeout: 5m
invariants:
server_file_exists:
description: server.js was created
weight: 1.0
gate: true
check:
type: file_exists
path: server.js
package_json_exists:
description: package.json was created
weight: 0.5
check:
type: file_exists
path: package.json
server_loads:
description: server.js requires cleanly
weight: 1.5
check:
type: command_exit
command: node -e "require('./server.js')"
expect_exit_code: 0
routes_work:
description: server actually serves both routes (LLM-as-judge)
weight: 2.0
check:
type: llm_as_judge
rubric: |
Does server.js actually start an Express server and handle BOTH
/healthz and /users routes? Reject implementations that stub the
routes with TODO comments or return 500.
scoring:
pass_threshold: 0.7
parallelism:
replicas: 1If keystone/example.yaml already exists, the phase is skipped (won't clobber your work). Run it with:
ks eval run keystone/example.yamlks setup instrument
Scans your project for LLM client construction sites — new OpenAI(), Anthropic(), litellm.completion(...), new ChatAnthropic(...), etc. — across ~50 SDKs in Go, TypeScript, and Python:
| Family | Detected patterns |
|---|---|
| OpenAI | new OpenAI(, OpenAI(, AsyncOpenAI(, new AzureOpenAI(, etc. |
| Anthropic | new Anthropic(, Anthropic(, AsyncAnthropic(, AnthropicVertex, AnthropicBedrock |
genai.NewClient, new GoogleGenAI(, genai.Client(, vertexai.init( | |
| Cohere | cohere.NewClient(, new CohereClient(, cohere.Client(, cohere.AsyncClient( |
| Mistral | mistral.NewClient(, new MistralClient(, MistralClient( |
| Groq | new Groq(, Groq(, AsyncGroq( |
| Together | new Together(, Together(, AsyncTogether( |
| Fireworks | new Fireworks(, Fireworks(, AsyncFireworks( |
| Bedrock | bedrockruntime.NewFromConfig, boto3.client('bedrock', BedrockRuntime( |
| LangChain | new ChatOpenAI(, ChatAnthropic(, all Chat* constructors |
| Vercel AI SDK | @ai-sdk/openai, generateText(, streamText(, generateObject(, streamObject( |
| LiteLLM | litellm.completion(, litellm.acompletion(, litellm.Router( |
| Instructor | instructor.from_openai(, instructor.from_anthropic(, instructor.from_litellm( |
| DSPy | dspy.LM(, dspy.OpenAI(, dspy.Anthropic( |
| Pydantic AI | OpenAIModel(, AnthropicModel(, GeminiModel( |
| OpenAI Agents SDK | from agents import |
| Replicate, HuggingFace, Ollama, Mastra, OpenAI Agents, LangChainGo | (more) |
For each hit, prints path:line and the SDK family. Filters out vendored SDK source code (skips node_modules, .venv, vendor/, site-packages, etc.) and comments.
$ ks setup instrument
✓ detected 5 call site(s) across 2 language(s), 3 SDK familie(s)
TypeScript / JavaScript (3 hits, 2 families)
src/agents/openai.ts:14 [openai]
src/agents/openai.ts:42 [openai]
src/lib/llm.ts:8 [anthropic]
Python (2 hits, 2 families)
pipeline/judge.py:23 [openai]
rag/retriever.py:11 [langchain]
→ ask your coding agent to wrap these (full instructions in SKILL.md → "Step 1 — Wrap LLM clients")ks setup install
Auto-installs the Keystone SDK for each language present in the project. Detects:
| Language | Manifest detected | Package manager picked |
|---|---|---|
| Go | go.mod | go get github.com/Polarityinc/keystone-sdk-go |
| TypeScript | bun.lock / bun.lockb | bun add @polarityinc/polarity-keystone |
| TypeScript | pnpm-lock.yaml | pnpm add @polarityinc/polarity-keystone |
| TypeScript | yarn.lock | yarn add @polarityinc/polarity-keystone |
| TypeScript | package-lock.json / package.json | npm install @polarityinc/polarity-keystone |
| Python | uv.lock | uv add polarity-keystone |
| Python | poetry.lock | poetry add polarity-keystone |
| Python | Pipfile / Pipfile.lock | pipenv install polarity-keystone |
| Python | pyproject.toml (no lock) | uv add polarity-keystone (modern default) |
| Python | requirements.txt | pip install polarity-keystone |
Skipped if the SDK is already declared in the manifest. Use --no-install-sdk to print the recommended commands without running them.
ks setup snapshot
Detects agent code in your repo and explains how to package + upload as a Keystone snapshot. Looks for:
| File | Treated as |
|---|---|
Dockerfile, agent.Dockerfile | Docker-based agent |
package.json | Node/TS agent |
pyproject.toml, requirements.txt, agent.py, Pipfile | Python agent |
agent.ts, agent.js | TypeScript/JS agent |
go.mod, main.go | Go agent |
Cargo.toml | Rust agent |
Searches the project root plus common subdirectories (agent/, agents/, src/agent/, cmd/agent/, apps/agent/).
$ ks setup snapshot
✓ found 2 candidate agent location(s):
[python] agent/main.py
[node] apps/agent/package.json
→ ask your coding agent to package + upload a snapshot (full instructions in SKILL.md → "Step 3")The full packaging guide is in the skill file (Step 3). It walks the agent through tarring the code, picking an entrypoint, calling ks.agents.upload().
ks setup doctor
The "is everything wired up?" check. Verifies:
.envfiles load without errors (parse failures are surfaced first).KEYSTONE_API_KEYis set (shows a redacted hash so you know which key was picked up).- The Keystone server is reachable at
KEYSTONE_BASE_URL(or default). - The API key authenticates (round-trips a
GET /v1/experimentscall). - The
ksbinary is on$PATH(so MCP configs can resolve it).
$ ks setup doctor
✓ KEYSTONE_API_KEY — set (ks_live_a1b2c3d4...e5f6)
✓ server reachable (https://keystone.polarity.so) — OK
✓ auth — OK
✓ ks on PATH — /usr/local/bin/ks
all good — Keystone is wired upIf any check fails, the command exits non-zero with actionable hints:
$ ks setup doctor
✗ KEYSTONE_API_KEY — get a key at https://app.paragon.run/app/keystone/settings, then put `KEYSTONE_API_KEY=<your-key>` in a `.env` at the project root (auto-loaded). Or pass --api-key / export it in your shell rc.
✗ auth — keystone: API error 401 Unauthorized
→ if your key is `ks_live_*` (a hosted Keystone key) but you're hitting a local daemon, set KEYSTONE_BASE_URL to your hosted URL (default: https://keystone.polarity.so)
→ if your key was issued by a local daemon, set KEYSTONE_BASE_URL=http://localhost:8012ks setup flags
| Flag | Default | Meaning |
|---|---|---|
--agents | (interactive prompt) | Comma-separated list: claude,cursor,gemini,opencode,codex,windsurf,vscode,other,all |
--no-install-sdk | false | Print install commands without running them |
Examples:
# Pick exactly which agents to target
ks setup --agents claude,cursor
# All agents, no SDK install (manual control)
ks setup --agents all --no-install-sdk
# Just one phase, with explicit agent list
ks setup skills --agents cursorks mcp serve — Model Context Protocol server
Run as a stdio MCP server so coding agents (Claude Code, Cursor, Codex) can call Keystone verbs as tools.
ks mcp serveExposed tools:
| Tool name | What it does |
|---|---|
keystone_eval_run | Upload + run a spec; returns full RunResults |
keystone_eval_list | List experiments |
keystone_eval_get | Get one experiment's RunResults |
keystone_eval_metrics | Aggregate metrics + tool breakdown |
keystone_eval_compare | Compare two experiments |
keystone_logs_traces | Stream trace events |
keystone_logs_trace | Single trace by ID |
keystone_logs_spans | Stream spans by filter |
keystone_logs_scenarios | Stream scenario rows |
keystone_logs_scores | Stream offline scores |
Each tool's input schema mirrors the equivalent CLI command's flags. The agent calls them via standard MCP — no special integration.
Register via ks setup mcp (writes the right config files for every coding agent), or by hand:
{
"mcpServers": {
"keystone": {
"command": "/usr/local/bin/ks",
"args": ["mcp", "serve"],
"env": { "KEYSTONE_API_KEY": "ks_live_..." }
}
}
}Once running, your coding agent can do things like:
- "Run
specs/regression.yamland tell me which scenarios failed." - "Compare exp-abc and exp-xyz; if it regressed, find the trace for the failing tool call."
- "Stream traces for exp-abc, find any LLM call over $0.10, and tell me what tool it called."
ks update — upgrade
ks update # in-place upgrade from GitHub Releases
ks update --force # re-install even if already at latest versionThe CLI also auto-checks for updates once per 24h on every invocation (cached, silent on network failure). Skipped for dev builds (-dev suffix), for the ks update command itself, and when KS_NO_AUTO_UPDATE is set in the environment.
If you installed via a method that puts ks in a non-writable location, re-run the installer:
curl -fsSL https://ks.polarity.so/install.sh | bashCommon workflows
CI: run regression evals, gate on regression
# .github/workflows/ci.yml
- name: Run regression eval
env:
KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
run: |
ks eval run specs/regression.yaml > results.json
# Exits non-zero if any scenario failed → CI fails
- name: Compare to last green build
env:
KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
run: |
NEW_EXP_ID=$(jq -r .experiment_id results.json)
ks eval compare $LAST_GREEN_EXP_ID $NEW_EXP_ID --gateOnboard a new repo
cd ~/work/new-project
ks setup --agents claude,cursor # one command, everything wired
ks eval run keystone/example.yaml # smoke testDebug a failing scenario
# 1. Find the failing scenarios
ks eval get exp-abc | jq '.scenarios[] | select(.status == "fail")'
# 2. Re-run with the failing scenario's seed for reproduction
ks eval run specs/scenario-1.yaml --seed 12345
# 3. Stream trace events to see what went wrong
ks logs traces exp-new | jq 'select(.status == "error")'
# 4. Drill into the failing span
ks logs spans --root-span span_xyz | jqCost analysis
# Total cost for the experiment
ks logs traces exp-abc \
| jq -s 'map(.cost.estimated_usd // 0) | add'
# Most expensive single call
ks logs traces exp-abc --event-type llm_call \
| jq -s 'sort_by(.cost.estimated_usd // 0) | reverse | .[0]'
# Cost by model
ks logs traces exp-abc --event-type llm_call \
| jq -s 'group_by(.cost.model) | map({model: .[0].cost.model, total_usd: map(.cost.estimated_usd) | add})'Tool latency
# Average latency per tool
ks logs traces exp-abc --event-type tool_call \
| jq -s 'group_by(.tool) | map({tool: .[0].tool, count: length, mean_ms: (map(.duration_ms) | add / length)})'
# Slowest 10 tool calls
ks logs traces exp-abc --event-type tool_call \
| jq -s 'sort_by(.duration_ms) | reverse | .[:10]'Find which scenarios called which tools
# Tools by frequency
ks logs traces exp-abc --event-type tool_use \
| jq -r '.tool' | sort | uniq -c | sort -rn
# Scenarios that called a specific tool
ks logs spans --experiment exp-abc --tool write_file \
| jq -r '.scenario_id' | sort -uFull bundle export for offline analysis
# Export everything as one JSON document
ks logs export exp-abc -o exp-abc.json
# Or NDJSON for streaming
ks logs export exp-abc --format ndjson -o exp-abc.jsonlLoad in pandas:
import pandas as pd
import json
with open("exp-abc.json") as f:
bundle = json.load(f)
scenarios = pd.DataFrame(bundle["scenarios"])
print(scenarios.groupby("status").size())Output formats
| Format | When to use |
|---|---|
NDJSON (default for logs) | Pipe into jq, awk, wc -l. One event per line; safe for unbounded streams. |
JSON (default for eval) | Whole-bundle dumps — good for archiving and offline analysis. |
File output (-o file.json) | Use when you don't want to redirect stdout. |
NDJSON output streams as the server returns it — even huge experiments can be processed without loading everything in memory.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Command-line error, network failure, or experiment had failures |
1 (with --gate) | eval compare reported a regression |
ks eval run defaults to exit-1-on-any-failure — what you want for CI gating.
Environment variables (full list)
| Env var | Purpose |
|---|---|
KEYSTONE_API_KEY | API key for authentication |
KEYSTONE_BASE_URL | Server URL |
KEYSTONE_SANDBOX_ID | (Inside a sandbox only) — the current sandbox's ID |
KEYSTONE_SERVICE_<NAME>_HOST | (Inside a sandbox) service host |
KEYSTONE_SERVICE_<NAME>_PORT | (Inside a sandbox) service port |
The first two are CLI inputs; the rest are server-injected into agent processes.
Troubleshooting
"unauthorized" / 401 — KEYSTONE_API_KEY not set, invalid, or the wrong server. Run ks setup doctor.
"spec not found" — you're trying to run an experiment against a spec that wasn't uploaded. ks eval run uploads first; if you're using direct API calls, ensure ks.specs.create() ran.
"sandbox rejected: at capacity" — you've hit the concurrent-sandbox limit. Wait a minute or contact support to raise the limit.
.env ignored — ks only loads .env and .env.local from the current directory, not parent directories. Run from your project root.
Auto-update silently fails — intentional. Updates shouldn't break shell scripts. Run ks update --force (which prints any error to stderr instead of swallowing it) to see what's going wrong, or set KS_NO_AUTO_UPDATE=1 to disable the background check entirely.
ks setup mcp doesn't show up in my coding agent — restart the agent so it re-reads its MCP config. Then verify ks mcp serve works manually:
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05"}}' | ks mcp serve
# Should print a JSON response (the MCP initialization handshake)ks update says "up to date" but I'm not — auto-update is cached for 24h. ks update --force bypasses the cache.
Reference: every flag
ks
--api-key string Keystone API key (env: KEYSTONE_API_KEY)
--base-url string Server URL (env: KEYSTONE_BASE_URL)
--timeout duration HTTP timeout (default 30s)
-v, --version ks version
-h, --help help
ks eval run <spec.yaml>
--name string Experiment name (default: spec filename)
--wait-timeout dur Max wait time (default 10m)
--poll dur Poll interval (default 2s)
ks eval compare <a> <b>
--gate Exit non-zero on regression
ks logs traces <exp-id>
--event-type string llm_call | tool_use | tool_call
--tool string Filter by tool name
--agent string Filter by agent name
--sandbox string Filter by sandbox ID
--since string RFC3339 timestamp
-o, --output string Output file (default: stdout)
--format string ndjson (default) | json
--page-size int Server pagination size (default 100)
ks logs spans
--experiment string Experiment ID
--trace string Trace ID
--span string Span ID
--parent-span string Parent span ID
--root-span string Root span ID
--tool string Tool name
--event-type string Event type
-o, --output string
--format string
--page-size int
ks logs scenarios
--experiment string (required)
--status string passed | failed | error
--scenario string Specific scenario ID
-o, --output string
--format string
--page-size int
ks logs scores
--experiment string (required)
--rule string Filter by rule ID
-o, --output string
--format string
--page-size int
ks logs export <exp-id>
--format string json (default) | ndjson
-o, --output string
ks setup
--agents string claude,cursor,gemini,opencode,codex,windsurf,vscode,other,all
--no-install-sdk Print install commands without running them
ks update
--force Re-install even if already at latest