Keystone CLI (ks)

Every ks command, flag, and subcommand — eval, logs, setup, mcp, update — with worked examples for each.

ks is the command-line interface for Keystone. One Go binary, zero runtime dependencies, mirrors the SDKs verb-for-verb. CI scripts, dev machines, ad-hoc trace inspection — ks is the right tool when you don't want to write code.

Install

# macOS / Linux — one-liner (canonical)
curl -fsSL https://ks.polarity.so/install.sh | bash
 
# Pin a version
curl -fsSL https://ks.polarity.so/install.sh | bash -s ks-v0.1.2
 
# Or download a release binary directly
# https://github.com/Polarityinc/ks/releases/latest

The installer drops ks into ~/.local/bin/ (or /usr/local/bin/ if needed). Supports macOS arm64/amd64 and Linux arm64/amd64.

After install: ks --version prints the running version. The CLI auto-checks for updates once per 24h (silent on network failure; skipped for dev builds and the update subcommand itself).

Two URLs, not one. https://ks.polarity.so is where the CLI installer lives and where the auto-updater pulls binaries from. https://keystone.polarity.so is the API server the CLI calls — the value of the KEYSTONE_BASE_URL env var. You shouldn't normally need to set the API URL; the SDK and CLI both default to it.

Quick verification

ks --help               # top-level help
ks setup doctor         # health check
ks eval list            # confirm API access

If setup doctor is green, you're ready.

The fastest path: `ks setup`

If you've installed ks and dropped your API key in .env, one command does the rest:

cd ~/your-project
ks setup

It runs seven phases (each independent, each idempotent), interactively asks which coding agents to target, and ends with a starter spec at keystone/example.yaml that you can run immediately:

ks eval run keystone/example.yaml

That's the entire onboarding flow. Every phase is documented separately below if you'd rather pick and choose.

Configuration

Every flag has an env-var equivalent:

Flag	Env var	Default
`--api-key`	`KEYSTONE_API_KEY`	required
`--base-url`	`KEYSTONE_BASE_URL`	`https://keystone.polarity.so`
`--timeout`	—	30s

Resolution order: flag > env > .env / .env.local in cwd > default.

ks auto-loads .env and .env.local from the current directory (does not override existing env vars). A malformed .env is surfaced by setup doctor instead of silently ignored.

Get an API key from app.paragon.run/app/keystone/settings → API Keys → Create Key. Drop it in your project's .env as KEYSTONE_API_KEY=ks_live_....

Top-level command tree

ks
├── eval                           Run and inspect evals
│   ├── run <spec.yaml>            Upload spec → create exp → run → wait → results
│   ├── list                       List experiments
│   ├── get <experiment-id>        Full RunResults JSON
│   ├── metrics <experiment-id>    Aggregate metrics + tool breakdown
│   └── compare <a> <b> [--gate]   Side-by-side, exit non-zero on regression
│
├── logs                           Extract trace, span, scenario, score data
│   ├── traces <experiment-id>     Stream trace events (NDJSON)
│   ├── trace <trace-id>           Single trace by ID
│   ├── spans                      Stream spans by filter
│   ├── scenarios --experiment X   Stream scenario rows
│   ├── scores --experiment X      Stream offline score rows
│   └── export <experiment-id>     Full bundle (spec + scenarios + traces + scores)
│
├── setup                          Wire Keystone into a repo (7 phases)
│   ├── skills                     Write agent skill files (.claude/skills, .cursor/rules, etc.)
│   ├── mcp                        Register MCP server in agent configs (.mcp.json, etc.)
│   ├── spec                       Drop a starter spec at keystone/example.yaml
│   ├── instrument                 Scan source for ~50 LLM SDKs and print wrapping snippets
│   ├── install                    Install the Keystone SDK for each detected language
│   ├── snapshot                   Detect agent code and explain how to package + upload
│   └── doctor                     Verify API key, server reachability, auth, ks on PATH
│
├── mcp                            Model Context Protocol server commands
│   └── serve                      Run as stdio MCP server (for Claude Code, Cursor, etc.)
│
└── update                         Upgrade to the latest release

`ks eval` — run and inspect evals

`ks eval run <spec.yaml>`

The most common verb. End-to-end: uploads the spec, creates an experiment, runs it, waits for results, prints the full RunResults JSON.

ks eval run specs/scenario-1.yaml

What it does, step by step:

Reads specs/scenario-1.yaml from disk.
Calls client.Specs.Create(ctx, yamlBytes) — uploads + parses + versions the spec.
Calls keystone.CollectDeclaredSecretsFromFile(specPath) — resolves the spec's secrets: block from the local environment (env vars, files, shell commands).
Calls client.Experiments.Create(ctx, …) with the resolved secrets.
Calls client.Experiments.RunAndWait(ctx, expID, opts) — polls until done or timeout.
Prints the full RunResults JSON to stdout.
Exits non-zero if any scenario didn't pass — turns CI red on regression.

Flags:

Flag	Default	Meaning
`--name`	spec filename (without extension)	Experiment name for the dashboard
`--wait-timeout`	10m	Max time to wait for completion
`--poll`	2s	Poll interval while waiting

Examples:

# Default — run, wait, print JSON, exit non-zero on failure
ks eval run keystone/example.yaml
 
# Custom name + longer wait window
ks eval run keystone/big-eval.yaml --name "release-candidate-v2" --wait-timeout 30m
 
# In CI — stream JSON to a file, exit non-zero if any scenario failed
ks eval run specs/regression.yaml > results.json
 
# Smaller poll for faster feedback in interactive use
ks eval run keystone/quick.yaml --poll 500ms

`ks eval list`

ks eval list

GET /v1/experiments — JSON array of every experiment scoped to your API key. Pipe to jq for filters:

# Last 5 experiments
ks eval list | jq '.[:5]'
 
# Just IDs and names
ks eval list | jq '.[] | {id, name, status, created_at}'
 
# Most recent experiment ID
ks eval list | jq -r '.[0].id'

`ks eval get <experiment-id>`

ks eval get exp-a1b2c3

Full RunResults JSON: per-scenario invariants, costs, traces, reproducer commands. Useful for ad-hoc inspection:

# Just the failing scenarios
ks eval get exp-abc | jq '.scenarios[] | select(.status == "fail")'
 
# Cost breakdown
ks eval get exp-abc | jq '.metrics.total_cost_usd'
 
# All reproducer commands (one per failure)
ks eval get exp-abc | jq -r '.scenarios[] | select(.status == "fail") | .reproducer.command'

`ks eval metrics <experiment-id>`

ks eval metrics exp-a1b2c3

GET /v1/metrics/experiments/<id> — aggregate metrics with per-tool breakdown and trends:

{
  "experiment_id": "exp-a1b2c3",
  "summary": {
    "total_runs": 30,
    "pass_rate": 0.93,
    "total_cost_usd": 12.40,
    "mean_cost_per_run_usd": 0.41,
    "mean_wall_ms": 14200,
    "p95_wall_ms": 23000,
    "total_tool_calls": 342,
    "tool_success_rate": 0.97
  },
  "tool_breakdown": {
    "edit":    { "count": 89, "mean_ms": 200, "error_rate": 0.02 },
    "execute": { "count": 67, "mean_ms": 1500, "error_rate": 0.05 },
    "read":    { "count": 120, "mean_ms": 50, "error_rate": 0.0 }
  },
  "cost_trend": [
    { "run_id": "run-1", "cost_usd": 0.28, "ts": "..." },
    { "run_id": "run-2", "cost_usd": 0.31, "ts": "..." }
  ],
  "pass_rate_trend": [...]
}

Useful for:

# Slowest tool
ks eval metrics exp-abc | jq '.tool_breakdown | to_entries | sort_by(.value.mean_ms) | reverse | .[0]'
 
# Cost over the run
ks eval metrics exp-abc | jq '.cost_trend[] | "\(.ts) \(.cost_usd)"'

`ks eval compare <baseline-id> <candidate-id>`

ks eval compare exp-baseline exp-new
ks eval compare exp-baseline exp-new --gate

Side-by-side comparison. With --gate, exits non-zero if any metric regressed — the CI gating primitive:

# CI: block merge on regression
ks eval compare $LAST_GREEN_EXP_ID $NEW_EXP_ID --gate
# Exit 1 if pass_rate dropped, p95 latency rose, etc.

Output:

{
  "baseline_id": "exp-baseline",
  "candidate_id": "exp-new",
  "regressed": true,
  "regressions": [
    "pass_rate dropped from 0.95 to 0.78",
    "p95_wall_ms rose from 12000 to 18000"
  ],
  "metrics": [
    { "name": "pass_rate", "baseline": 0.95, "candidate": 0.78, "delta": -0.17, "direction": "worse" },
    { "name": "mean_wall_ms", "baseline": 14000, "candidate": 14500, "delta": 500, "direction": "same" },
    { "name": "p95_wall_ms", "baseline": 12000, "candidate": 18000, "delta": 6000, "direction": "worse" },
    { "name": "mean_cost_per_run_usd", "baseline": 0.30, "candidate": 0.42, "delta": 0.12, "direction": "worse" }
  ]
}

The default regression thresholds:

Metric	Threshold
`pass_rate`	-2%
`p95_wall_ms`	+20%
`mean_cost_per_run_usd`	+20%
`tool_success_rate`	-5%

`ks logs` — extract trace data

Six subcommands. Each defaults to NDJSON (line-delimited JSON) output, suitable for piping into jq, wc, etc.

`ks logs traces <experiment-id>`

Stream trace events.

# Every event from the experiment
ks logs traces exp-abc
 
# Filter by tool name
ks logs traces exp-abc --tool write_file
 
# Filter by event type — only LLM calls
ks logs traces exp-abc --event-type llm_call
 
# Only events after a timestamp (RFC3339)
ks logs traces exp-abc --since 2026-04-28T22:00:00Z
 
# Write to a file instead of stdout
ks logs traces exp-abc -o traces.jsonl
 
# JSON array instead of NDJSON
ks logs traces exp-abc --format json

Flags:

Flag	Meaning
`--event-type`	`llm_call`, `tool_use`, `tool_call`
`--tool`	Filter by tool name
`--agent`	Filter by agent name (for snapshot agents)
`--sandbox`	Filter by sandbox ID
`--since`	Only events after this RFC3339 timestamp
`-o, --output`	Write to file (default: stdout)
`--format`	`ndjson` (default) or `json`
`--page-size`	Server pagination size (default 100)

`ks logs trace <trace-id>`

Single trace by ID — full span tree as JSON.

ks logs trace trace-abc123

GET /v1/traces/<id>. Returns:

{
  "spans": [
    { "ts": "...", "event_type": "llm_call", "tool": "anthropic.create", "duration_ms": 2340, "cost": {...} },
    { "ts": "...", "event_type": "tool_use", "tool": "write_file", "parent_span_id": "span_xyz", "input": "..." }
  ]
}

`ks logs spans`

Stream spans matching a filter. Spans are individual events within traces.

# All spans in an experiment
ks logs spans --experiment exp-abc
 
# Single trace's spans
ks logs spans --trace trace-abc123
 
# All descendants of a root span (the full sub-tree)
ks logs spans --root-span span_abc
 
# Only direct children of a span
ks logs spans --parent-span span_xyz
 
# Only tool_use spans for a specific tool
ks logs spans --experiment exp-abc --event-type tool_use --tool write_file

`ks logs scenarios`

Stream per-scenario rows for an experiment. Required: --experiment.

# Every scenario (one line per scenario, with status, scores, etc.)
ks logs scenarios --experiment exp-abc
 
# Only failed scenarios
ks logs scenarios --experiment exp-abc --status failed
 
# A specific scenario
ks logs scenarios --experiment exp-abc --scenario scenario-000

`ks logs scores`

Stream offline scoring results (from ks.scoring.scoreExperiment). Required: --experiment.

ks logs scores --experiment exp-abc
ks logs scores --experiment exp-abc --rule rule-factuality-1

`ks logs export <experiment-id>`

Dump the entire experiment bundle — spec, scenarios, traces, scores — as one document.

ks logs export exp-abc -o exp-abc.json                # JSON (default)
ks logs export exp-abc --format ndjson -o exp-abc.jsonl

Use this for offline analysis: pandas, notebooks, external tooling.

`ks setup` — wire Keystone into a repo

ks setup (with no subcommand) runs every onboarding phase end-to-end. Each phase is idempotent — re-running is safe.

ks setup

Phases run in this order:

#	Phase	What it does
1	`skills`	Write agent skill files
2	`mcp`	Register the Keystone MCP server in agent configs
3	`spec`	Drop a starter spec at `keystone/example.yaml`
4	`instrument`	Scan source for LLM client construction sites and print wrapping snippets
5	`install`	Install the Keystone SDK for each detected language
6	`snapshot`	Detect agent code and explain how to package + upload as a snapshot
7	`doctor`	Verify `KEYSTONE_API_KEY`, server reachability, auth, `ks` on PATH

Run individual phases:

ks setup skills
ks setup mcp
ks setup spec
ks setup instrument
ks setup install
ks setup snapshot
ks setup doctor

Each is documented below.

`ks setup skills`

Writes Keystone skill files into your project's coding-agent config directories. The skill file teaches the coding agent (Claude Code, Cursor, Codex, etc.) what Keystone is and how to use it — so when you ask the agent to "set up Keystone in this repo," it knows what to do.

Targets selected interactively (TTY) or via --agents:

Agent	Skill path written
Claude Code	`.claude/skills/keystone/SKILL.md`
Cursor	`.cursor/rules/keystone.mdc`
Gemini CLI	`.gemini/skills/keystone/SKILL.md`
OpenCode	`.opencode/skills/keystone/SKILL.md`
Codex	`.codex/skills/keystone/SKILL.md`
Windsurf	`.windsurf/skills/keystone/SKILL.md`
VS Code	(no skill convention yet)
Generic	`.agents/skills/keystone/SKILL.md`

# Interactive prompt (when run from a TTY)
ks setup skills
 
# Explicit list of agents
ks setup skills --agents claude,cursor
 
# All agents (non-interactive default in CI)
ks setup skills --agents all

Cursor's .mdc format is auto-translated; the rest write the canonical SKILL.md markdown.

`ks setup mcp`

Registers ks mcp serve as an MCP server in your project's coding-agent configs. After this, your agent can call Keystone's eval/logs verbs as MCP tools — useful for "improve this spec until pass rate is 95%" loops.

Agent	MCP config written
Claude Code	`.mcp.json`
Cursor	`.cursor/mcp.json`
Gemini CLI	`.gemini/settings.json`
OpenCode	`.opencode/mcp.json`
Windsurf	`.windsurf/mcp_config.json`
VS Code	`.vscode/mcp.json`

The config preserves any other top-level keys — naive struct unmarshal would drop them, so the merger uses map[string]any.

Example written to .mcp.json:

{
  "mcpServers": {
    "keystone": {
      "command": "/usr/local/bin/ks",
      "args": ["mcp", "serve"],
      "env": {
        "KEYSTONE_API_KEY": "ks_live_..."
      }
    }
  }
}

After running, restart your agent so it picks up the new server.

`ks setup spec`

Drops a starter spec at keystone/example.yaml:

version: 1
id: example-rest-api
description: Build a tiny Express server and verify both routes work.
 
base: node:20
 
task:
  prompt: |
    Build a tiny Express server in server.js exposing:
      GET  /healthz  →  200 with body "ok"
      GET  /users    →  200 with a JSON array
    Add a package.json so `npm test` runs (it can be a no-op).
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  server_file_exists:
    description: server.js was created
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: server.js
 
  package_json_exists:
    description: package.json was created
    weight: 0.5
    check:
      type: file_exists
      path: package.json
 
  server_loads:
    description: server.js requires cleanly
    weight: 1.5
    check:
      type: command_exit
      command: node -e "require('./server.js')"
      expect_exit_code: 0
 
  routes_work:
    description: server actually serves both routes (LLM-as-judge)
    weight: 2.0
    check:
      type: llm_as_judge
      rubric: |
        Does server.js actually start an Express server and handle BOTH
        /healthz and /users routes? Reject implementations that stub the
        routes with TODO comments or return 500.
 
scoring:
  pass_threshold: 0.7
 
parallelism:
  replicas: 1

If keystone/example.yaml already exists, the phase is skipped (won't clobber your work). Run it with:

ks eval run keystone/example.yaml

`ks setup instrument`

Scans your project for LLM client construction sites — new OpenAI(), Anthropic(), litellm.completion(...), new ChatAnthropic(...), etc. — across ~50 SDKs in Go, TypeScript, and Python:

Family	Detected patterns
OpenAI	`new OpenAI(`, `OpenAI(`, `AsyncOpenAI(`, `new AzureOpenAI(`, etc.
Anthropic	`new Anthropic(`, `Anthropic(`, `AsyncAnthropic(`, `AnthropicVertex`, `AnthropicBedrock`
Google	`genai.NewClient`, `new GoogleGenAI(`, `genai.Client(`, `vertexai.init(`
Cohere	`cohere.NewClient(`, `new CohereClient(`, `cohere.Client(`, `cohere.AsyncClient(`
Mistral	`mistral.NewClient(`, `new MistralClient(`, `MistralClient(`
Groq	`new Groq(`, `Groq(`, `AsyncGroq(`
Together	`new Together(`, `Together(`, `AsyncTogether(`
Fireworks	`new Fireworks(`, `Fireworks(`, `AsyncFireworks(`
Bedrock	`bedrockruntime.NewFromConfig`, `boto3.client('bedrock'`, `BedrockRuntime(`
LangChain	`new ChatOpenAI(`, `ChatAnthropic(`, all `Chat*` constructors
Vercel AI SDK	`@ai-sdk/openai`, `generateText(`, `streamText(`, `generateObject(`, `streamObject(`
LiteLLM	`litellm.completion(`, `litellm.acompletion(`, `litellm.Router(`
Instructor	`instructor.from_openai(`, `instructor.from_anthropic(`, `instructor.from_litellm(`
DSPy	`dspy.LM(`, `dspy.OpenAI(`, `dspy.Anthropic(`
Pydantic AI	`OpenAIModel(`, `AnthropicModel(`, `GeminiModel(`
OpenAI Agents SDK	`from agents import`
Replicate, HuggingFace, Ollama, Mastra, OpenAI Agents, LangChainGo	(more)

For each hit, prints path:line and the SDK family. Filters out vendored SDK source code (skips node_modules, .venv, vendor/, site-packages, etc.) and comments.

$ ks setup instrument
✓ detected 5 call site(s) across 2 language(s), 3 SDK familie(s)
 
  TypeScript / JavaScript (3 hits, 2 families)
    src/agents/openai.ts:14   [openai]
    src/agents/openai.ts:42   [openai]
    src/lib/llm.ts:8           [anthropic]
 
  Python (2 hits, 2 families)
    pipeline/judge.py:23       [openai]
    rag/retriever.py:11        [langchain]
 
→ ask your coding agent to wrap these (full instructions in SKILL.md → "Step 1 — Wrap LLM clients")

`ks setup install`

Auto-installs the Keystone SDK for each language present in the project. Detects:

Language	Manifest detected	Package manager picked
Go	`go.mod`	`go get github.com/Polarityinc/keystone-sdk-go`
TypeScript	`bun.lock` / `bun.lockb`	`bun add @polarityinc/polarity-keystone`
TypeScript	`pnpm-lock.yaml`	`pnpm add @polarityinc/polarity-keystone`
TypeScript	`yarn.lock`	`yarn add @polarityinc/polarity-keystone`
TypeScript	`package-lock.json` / `package.json`	`npm install @polarityinc/polarity-keystone`
Python	`uv.lock`	`uv add polarity-keystone`
Python	`poetry.lock`	`poetry add polarity-keystone`
Python	`Pipfile` / `Pipfile.lock`	`pipenv install polarity-keystone`
Python	`pyproject.toml` (no lock)	`uv add polarity-keystone` (modern default)
Python	`requirements.txt`	`pip install polarity-keystone`

Skipped if the SDK is already declared in the manifest. Use --no-install-sdk to print the recommended commands without running them.

`ks setup snapshot`

Detects agent code in your repo and explains how to package + upload as a Keystone snapshot. Looks for:

File	Treated as
`Dockerfile`, `agent.Dockerfile`	Docker-based agent
`package.json`	Node/TS agent
`pyproject.toml`, `requirements.txt`, `agent.py`, `Pipfile`	Python agent
`agent.ts`, `agent.js`	TypeScript/JS agent
`go.mod`, `main.go`	Go agent
`Cargo.toml`	Rust agent

Searches the project root plus common subdirectories (agent/, agents/, src/agent/, cmd/agent/, apps/agent/).

$ ks setup snapshot
✓ found 2 candidate agent location(s):
    [python]  agent/main.py
    [node]    apps/agent/package.json
 
→ ask your coding agent to package + upload a snapshot (full instructions in SKILL.md → "Step 3")

The full packaging guide is in the skill file (Step 3). It walks the agent through tarring the code, picking an entrypoint, calling ks.agents.upload().

`ks setup doctor`

The "is everything wired up?" check. Verifies:

.env files load without errors (parse failures are surfaced first).
KEYSTONE_API_KEY is set (shows a redacted hash so you know which key was picked up).
The Keystone server is reachable at KEYSTONE_BASE_URL (or default).
The API key authenticates (round-trips a GET /v1/experiments call).
The ks binary is on $PATH (so MCP configs can resolve it).

$ ks setup doctor
✓ KEYSTONE_API_KEY — set (ks_live_a1b2c3d4...e5f6)
✓ server reachable (https://keystone.polarity.so) — OK
✓ auth — OK
✓ ks on PATH — /usr/local/bin/ks
 
 all good — Keystone is wired up

If any check fails, the command exits non-zero with actionable hints:

$ ks setup doctor
✗ KEYSTONE_API_KEY — get a key at https://app.paragon.run/app/keystone/settings, then put `KEYSTONE_API_KEY=<your-key>` in a `.env` at the project root (auto-loaded). Or pass --api-key / export it in your shell rc.
✗ auth — keystone: API error 401 Unauthorized
      → if your key is `ks_live_*` (a hosted Keystone key) but you're hitting a local daemon, set KEYSTONE_BASE_URL to your hosted URL (default: https://keystone.polarity.so)
      → if your key was issued by a local daemon, set KEYSTONE_BASE_URL=http://localhost:8012

`ks setup` flags

Flag	Default	Meaning
`--agents`	(interactive prompt)	Comma-separated list: `claude,cursor,gemini,opencode,codex,windsurf,vscode,other,all`
`--no-install-sdk`	false	Print install commands without running them

Examples:

# Pick exactly which agents to target
ks setup --agents claude,cursor
 
# All agents, no SDK install (manual control)
ks setup --agents all --no-install-sdk
 
# Just one phase, with explicit agent list
ks setup skills --agents cursor

`ks mcp serve` — Model Context Protocol server

Run as a stdio MCP server so coding agents (Claude Code, Cursor, Codex) can call Keystone verbs as tools.

ks mcp serve

Exposed tools:

Tool name	What it does
`keystone_eval_run`	Upload + run a spec; returns full RunResults
`keystone_eval_list`	List experiments
`keystone_eval_get`	Get one experiment's RunResults
`keystone_eval_metrics`	Aggregate metrics + tool breakdown
`keystone_eval_compare`	Compare two experiments
`keystone_logs_traces`	Stream trace events
`keystone_logs_trace`	Single trace by ID
`keystone_logs_spans`	Stream spans by filter
`keystone_logs_scenarios`	Stream scenario rows
`keystone_logs_scores`	Stream offline scores

Each tool's input schema mirrors the equivalent CLI command's flags. The agent calls them via standard MCP — no special integration.

{
  "mcpServers": {
    "keystone": {
      "command": "/usr/local/bin/ks",
      "args": ["mcp", "serve"],
      "env": { "KEYSTONE_API_KEY": "ks_live_..." }
    }
  }
}

Once running, your coding agent can do things like:

"Run specs/regression.yaml and tell me which scenarios failed."
"Compare exp-abc and exp-xyz; if it regressed, find the trace for the failing tool call."
"Stream traces for exp-abc, find any LLM call over $0.10, and tell me what tool it called."

`ks update` — upgrade

ks update              # in-place upgrade from GitHub Releases
ks update --force      # re-install even if already at latest version

The CLI also auto-checks for updates once per 24h on every invocation (cached, silent on network failure). Skipped for dev builds (-dev suffix), for the ks update command itself, and when KS_NO_AUTO_UPDATE is set in the environment.

If you installed via a method that puts ks in a non-writable location, re-run the installer:

curl -fsSL https://ks.polarity.so/install.sh | bash

Common workflows

CI: run regression evals, gate on regression

# .github/workflows/ci.yml
- name: Run regression eval
  env:
    KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
  run: |
    ks eval run specs/regression.yaml > results.json
    # Exits non-zero if any scenario failed → CI fails
 
- name: Compare to last green build
  env:
    KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
  run: |
    NEW_EXP_ID=$(jq -r .experiment_id results.json)
    ks eval compare $LAST_GREEN_EXP_ID $NEW_EXP_ID --gate

Onboard a new repo

cd ~/work/new-project
ks setup --agents claude,cursor    # one command, everything wired
ks eval run keystone/example.yaml  # smoke test

Debug a failing scenario

# 1. Find the failing scenarios
ks eval get exp-abc | jq '.scenarios[] | select(.status == "fail")'
 
# 2. Re-run with the failing scenario's seed for reproduction
ks eval run specs/scenario-1.yaml --seed 12345
 
# 3. Stream trace events to see what went wrong
ks logs traces exp-new | jq 'select(.status == "error")'
 
# 4. Drill into the failing span
ks logs spans --root-span span_xyz | jq

Cost analysis

# Total cost for the experiment
ks logs traces exp-abc \
  | jq -s 'map(.cost.estimated_usd // 0) | add'
 
# Most expensive single call
ks logs traces exp-abc --event-type llm_call \
  | jq -s 'sort_by(.cost.estimated_usd // 0) | reverse | .[0]'
 
# Cost by model
ks logs traces exp-abc --event-type llm_call \
  | jq -s 'group_by(.cost.model) | map({model: .[0].cost.model, total_usd: map(.cost.estimated_usd) | add})'

Tool latency

# Average latency per tool
ks logs traces exp-abc --event-type tool_call \
  | jq -s 'group_by(.tool) | map({tool: .[0].tool, count: length, mean_ms: (map(.duration_ms) | add / length)})'
 
# Slowest 10 tool calls
ks logs traces exp-abc --event-type tool_call \
  | jq -s 'sort_by(.duration_ms) | reverse | .[:10]'

Find which scenarios called which tools

# Tools by frequency
ks logs traces exp-abc --event-type tool_use \
  | jq -r '.tool' | sort | uniq -c | sort -rn
 
# Scenarios that called a specific tool
ks logs spans --experiment exp-abc --tool write_file \
  | jq -r '.scenario_id' | sort -u

Full bundle export for offline analysis

# Export everything as one JSON document
ks logs export exp-abc -o exp-abc.json
 
# Or NDJSON for streaming
ks logs export exp-abc --format ndjson -o exp-abc.jsonl

Load in pandas:

import pandas as pd
import json
 
with open("exp-abc.json") as f:
    bundle = json.load(f)
 
scenarios = pd.DataFrame(bundle["scenarios"])
print(scenarios.groupby("status").size())

Output formats

Format	When to use
NDJSON (default for `logs`)	Pipe into `jq`, `awk`, `wc -l`. One event per line; safe for unbounded streams.
JSON (default for `eval`)	Whole-bundle dumps — good for archiving and offline analysis.
File output (`-o file.json`)	Use when you don't want to redirect stdout.

NDJSON output streams as the server returns it — even huge experiments can be processed without loading everything in memory.

Exit codes

Code	Meaning
0	Success
1	Command-line error, network failure, or experiment had failures
1 (with `--gate`)	`eval compare` reported a regression

ks eval run defaults to exit-1-on-any-failure — what you want for CI gating.

Environment variables (full list)

Env var	Purpose
`KEYSTONE_API_KEY`	API key for authentication
`KEYSTONE_BASE_URL`	Server URL
`KEYSTONE_SANDBOX_ID`	(Inside a sandbox only) — the current sandbox's ID
`KEYSTONE_SERVICE_<NAME>_HOST`	(Inside a sandbox) service host
`KEYSTONE_SERVICE_<NAME>_PORT`	(Inside a sandbox) service port

The first two are CLI inputs; the rest are server-injected into agent processes.

Troubleshooting

"unauthorized" / 401 — KEYSTONE_API_KEY not set, invalid, or the wrong server. Run ks setup doctor.

"spec not found" — you're trying to run an experiment against a spec that wasn't uploaded. ks eval run uploads first; if you're using direct API calls, ensure ks.specs.create() ran.

"sandbox rejected: at capacity" — you've hit the concurrent-sandbox limit. Wait a minute or contact support to raise the limit.

.env ignored — ks only loads .env and .env.local from the current directory, not parent directories. Run from your project root.

Auto-update silently fails — intentional. Updates shouldn't break shell scripts. Run ks update --force (which prints any error to stderr instead of swallowing it) to see what's going wrong, or set KS_NO_AUTO_UPDATE=1 to disable the background check entirely.

ks setup mcp doesn't show up in my coding agent — restart the agent so it re-reads its MCP config. Then verify ks mcp serve works manually:

echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05"}}' | ks mcp serve
# Should print a JSON response (the MCP initialization handshake)

ks update says "up to date" but I'm not — auto-update is cached for 24h. ks update --force bypasses the cache.

Reference: every flag

ks
  --api-key string       Keystone API key (env: KEYSTONE_API_KEY)
  --base-url string      Server URL (env: KEYSTONE_BASE_URL)
  --timeout duration     HTTP timeout (default 30s)
  -v, --version          ks version
  -h, --help             help

ks eval run <spec.yaml>
  --name string          Experiment name (default: spec filename)
  --wait-timeout dur     Max wait time (default 10m)
  --poll dur             Poll interval (default 2s)

ks eval compare <a> <b>
  --gate                 Exit non-zero on regression

ks logs traces <exp-id>
  --event-type string    llm_call | tool_use | tool_call
  --tool string          Filter by tool name
  --agent string         Filter by agent name
  --sandbox string       Filter by sandbox ID
  --since string         RFC3339 timestamp
  -o, --output string    Output file (default: stdout)
  --format string        ndjson (default) | json
  --page-size int        Server pagination size (default 100)

ks logs spans
  --experiment string    Experiment ID
  --trace string         Trace ID
  --span string          Span ID
  --parent-span string   Parent span ID
  --root-span string     Root span ID
  --tool string          Tool name
  --event-type string    Event type
  -o, --output string
  --format string
  --page-size int

ks logs scenarios
  --experiment string    (required)
  --status string        passed | failed | error
  --scenario string      Specific scenario ID
  -o, --output string
  --format string
  --page-size int

ks logs scores
  --experiment string    (required)
  --rule string          Filter by rule ID
  -o, --output string
  --format string
  --page-size int

ks logs export <exp-id>
  --format string        json (default) | ndjson
  -o, --output string

ks setup
  --agents string        claude,cursor,gemini,opencode,codex,windsurf,vscode,other,all
  --no-install-sdk       Print install commands without running them

ks update
  --force                Re-install even if already at latest

REST API Billing & Usage

Keystone CLI (ks)

Install

Quick verification

The fastest path: ks setup

Configuration

Top-level command tree

ks eval — run and inspect evals

ks eval run <spec.yaml>

ks eval list

ks eval get <experiment-id>

ks eval metrics <experiment-id>

ks eval compare <baseline-id> <candidate-id>

ks logs — extract trace data

ks logs traces <experiment-id>

ks logs trace <trace-id>

ks logs spans

ks logs scenarios

ks logs scores

ks logs export <experiment-id>

ks setup — wire Keystone into a repo

ks setup skills

ks setup mcp

ks setup spec

ks setup instrument

ks setup install

ks setup snapshot

ks setup doctor

ks setup flags

ks mcp serve — Model Context Protocol server

ks update — upgrade