SDK Reference

Keystone CLI (ks)

Every ks command, flag, and subcommand — eval, logs, setup, mcp, update — with worked examples for each.

ks is the command-line interface for Keystone. One Go binary, zero runtime dependencies, mirrors the SDKs verb-for-verb. CI scripts, dev machines, ad-hoc trace inspection — ks is the right tool when you don't want to write code.

Install

# macOS / Linux — one-liner (canonical)
curl -fsSL https://ks.polarity.so/install.sh | bash
 
# Pin a version
curl -fsSL https://ks.polarity.so/install.sh | bash -s ks-v0.1.2
 
# Or download a release binary directly
# https://github.com/Polarityinc/ks/releases/latest

The installer drops ks into ~/.local/bin/ (or /usr/local/bin/ if needed). Supports macOS arm64/amd64 and Linux arm64/amd64.

After install: ks --version prints the running version. The CLI auto-checks for updates once per 24h (silent on network failure; skipped for dev builds and the update subcommand itself).

Two URLs, not one. https://ks.polarity.so is where the CLI installer lives and where the auto-updater pulls binaries from. https://keystone.polarity.so is the API server the CLI calls — the value of the KEYSTONE_BASE_URL env var. You shouldn't normally need to set the API URL; the SDK and CLI both default to it.

Quick verification

ks --help               # top-level help
ks setup doctor         # health check
ks eval list            # confirm API access

If setup doctor is green, you're ready.

The fastest path: ks setup

If you've installed ks and dropped your API key in .env, one command does the rest:

cd ~/your-project
ks setup

It runs seven phases (each independent, each idempotent), interactively asks which coding agents to target, and ends with a starter spec at keystone/example.yaml that you can run immediately:

ks eval run keystone/example.yaml

That's the entire onboarding flow. Every phase is documented separately below if you'd rather pick and choose.

Configuration

Every flag has an env-var equivalent:

FlagEnv varDefault
--api-keyKEYSTONE_API_KEYrequired
--base-urlKEYSTONE_BASE_URLhttps://keystone.polarity.so
--timeout30s

Resolution order: flag > env > .env / .env.local in cwd > default.

ks auto-loads .env and .env.local from the current directory (does not override existing env vars). A malformed .env is surfaced by setup doctor instead of silently ignored.

Get an API key from app.paragon.run/app/keystone/settingsAPI KeysCreate Key. Drop it in your project's .env as KEYSTONE_API_KEY=ks_live_....

Top-level command tree

ks
├── eval                           Run and inspect evals
│   ├── run <spec.yaml>            Upload spec → create exp → run → wait → results
│   ├── list                       List experiments
│   ├── get <experiment-id>        Full RunResults JSON
│   ├── metrics <experiment-id>    Aggregate metrics + tool breakdown
│   └── compare <a> <b> [--gate]   Side-by-side, exit non-zero on regression
│
├── logs                           Extract trace, span, scenario, score data
│   ├── traces <experiment-id>     Stream trace events (NDJSON)
│   ├── trace <trace-id>           Single trace by ID
│   ├── spans                      Stream spans by filter
│   ├── scenarios --experiment X   Stream scenario rows
│   ├── scores --experiment X      Stream offline score rows
│   └── export <experiment-id>     Full bundle (spec + scenarios + traces + scores)
│
├── setup                          Wire Keystone into a repo (7 phases)
│   ├── skills                     Write agent skill files (.claude/skills, .cursor/rules, etc.)
│   ├── mcp                        Register MCP server in agent configs (.mcp.json, etc.)
│   ├── spec                       Drop a starter spec at keystone/example.yaml
│   ├── instrument                 Scan source for ~50 LLM SDKs and print wrapping snippets
│   ├── install                    Install the Keystone SDK for each detected language
│   ├── snapshot                   Detect agent code and explain how to package + upload
│   └── doctor                     Verify API key, server reachability, auth, ks on PATH
│
├── mcp                            Model Context Protocol server commands
│   └── serve                      Run as stdio MCP server (for Claude Code, Cursor, etc.)
│
└── update                         Upgrade to the latest release

ks eval — run and inspect evals

ks eval run <spec.yaml>

The most common verb. End-to-end: uploads the spec, creates an experiment, runs it, waits for results, prints the full RunResults JSON.

ks eval run specs/scenario-1.yaml

What it does, step by step:

  1. Reads specs/scenario-1.yaml from disk.
  2. Calls client.Specs.Create(ctx, yamlBytes) — uploads + parses + versions the spec.
  3. Calls keystone.CollectDeclaredSecretsFromFile(specPath) — resolves the spec's secrets: block from the local environment (env vars, files, shell commands).
  4. Calls client.Experiments.Create(ctx, …) with the resolved secrets.
  5. Calls client.Experiments.RunAndWait(ctx, expID, opts) — polls until done or timeout.
  6. Prints the full RunResults JSON to stdout.
  7. Exits non-zero if any scenario didn't pass — turns CI red on regression.

Flags:

FlagDefaultMeaning
--namespec filename (without extension)Experiment name for the dashboard
--wait-timeout10mMax time to wait for completion
--poll2sPoll interval while waiting

Examples:

# Default — run, wait, print JSON, exit non-zero on failure
ks eval run keystone/example.yaml
 
# Custom name + longer wait window
ks eval run keystone/big-eval.yaml --name "release-candidate-v2" --wait-timeout 30m
 
# In CI — stream JSON to a file, exit non-zero if any scenario failed
ks eval run specs/regression.yaml > results.json
 
# Smaller poll for faster feedback in interactive use
ks eval run keystone/quick.yaml --poll 500ms

ks eval list

ks eval list

GET /v1/experiments — JSON array of every experiment scoped to your API key. Pipe to jq for filters:

# Last 5 experiments
ks eval list | jq '.[:5]'
 
# Just IDs and names
ks eval list | jq '.[] | {id, name, status, created_at}'
 
# Most recent experiment ID
ks eval list | jq -r '.[0].id'

ks eval get <experiment-id>

ks eval get exp-a1b2c3

Full RunResults JSON: per-scenario invariants, costs, traces, reproducer commands. Useful for ad-hoc inspection:

# Just the failing scenarios
ks eval get exp-abc | jq '.scenarios[] | select(.status == "fail")'
 
# Cost breakdown
ks eval get exp-abc | jq '.metrics.total_cost_usd'
 
# All reproducer commands (one per failure)
ks eval get exp-abc | jq -r '.scenarios[] | select(.status == "fail") | .reproducer.command'

ks eval metrics <experiment-id>

ks eval metrics exp-a1b2c3

GET /v1/metrics/experiments/<id> — aggregate metrics with per-tool breakdown and trends:

{
  "experiment_id": "exp-a1b2c3",
  "summary": {
    "total_runs": 30,
    "pass_rate": 0.93,
    "total_cost_usd": 12.40,
    "mean_cost_per_run_usd": 0.41,
    "mean_wall_ms": 14200,
    "p95_wall_ms": 23000,
    "total_tool_calls": 342,
    "tool_success_rate": 0.97
  },
  "tool_breakdown": {
    "edit":    { "count": 89, "mean_ms": 200, "error_rate": 0.02 },
    "execute": { "count": 67, "mean_ms": 1500, "error_rate": 0.05 },
    "read":    { "count": 120, "mean_ms": 50, "error_rate": 0.0 }
  },
  "cost_trend": [
    { "run_id": "run-1", "cost_usd": 0.28, "ts": "..." },
    { "run_id": "run-2", "cost_usd": 0.31, "ts": "..." }
  ],
  "pass_rate_trend": [...]
}

Useful for:

# Slowest tool
ks eval metrics exp-abc | jq '.tool_breakdown | to_entries | sort_by(.value.mean_ms) | reverse | .[0]'
 
# Cost over the run
ks eval metrics exp-abc | jq '.cost_trend[] | "\(.ts) \(.cost_usd)"'

ks eval compare <baseline-id> <candidate-id>

ks eval compare exp-baseline exp-new
ks eval compare exp-baseline exp-new --gate

Side-by-side comparison. With --gate, exits non-zero if any metric regressed — the CI gating primitive:

# CI: block merge on regression
ks eval compare $LAST_GREEN_EXP_ID $NEW_EXP_ID --gate
# Exit 1 if pass_rate dropped, p95 latency rose, etc.

Output:

{
  "baseline_id": "exp-baseline",
  "candidate_id": "exp-new",
  "regressed": true,
  "regressions": [
    "pass_rate dropped from 0.95 to 0.78",
    "p95_wall_ms rose from 12000 to 18000"
  ],
  "metrics": [
    { "name": "pass_rate", "baseline": 0.95, "candidate": 0.78, "delta": -0.17, "direction": "worse" },
    { "name": "mean_wall_ms", "baseline": 14000, "candidate": 14500, "delta": 500, "direction": "same" },
    { "name": "p95_wall_ms", "baseline": 12000, "candidate": 18000, "delta": 6000, "direction": "worse" },
    { "name": "mean_cost_per_run_usd", "baseline": 0.30, "candidate": 0.42, "delta": 0.12, "direction": "worse" }
  ]
}

The default regression thresholds:

MetricThreshold
pass_rate-2%
p95_wall_ms+20%
mean_cost_per_run_usd+20%
tool_success_rate-5%

ks logs — extract trace data

Six subcommands. Each defaults to NDJSON (line-delimited JSON) output, suitable for piping into jq, wc, etc.

ks logs traces <experiment-id>

Stream trace events.

# Every event from the experiment
ks logs traces exp-abc
 
# Filter by tool name
ks logs traces exp-abc --tool write_file
 
# Filter by event type — only LLM calls
ks logs traces exp-abc --event-type llm_call
 
# Only events after a timestamp (RFC3339)
ks logs traces exp-abc --since 2026-04-28T22:00:00Z
 
# Write to a file instead of stdout
ks logs traces exp-abc -o traces.jsonl
 
# JSON array instead of NDJSON
ks logs traces exp-abc --format json

Flags:

FlagMeaning
--event-typellm_call, tool_use, tool_call
--toolFilter by tool name
--agentFilter by agent name (for snapshot agents)
--sandboxFilter by sandbox ID
--sinceOnly events after this RFC3339 timestamp
-o, --outputWrite to file (default: stdout)
--formatndjson (default) or json
--page-sizeServer pagination size (default 100)

ks logs trace <trace-id>

Single trace by ID — full span tree as JSON.

ks logs trace trace-abc123

GET /v1/traces/<id>. Returns:

{
  "spans": [
    { "ts": "...", "event_type": "llm_call", "tool": "anthropic.create", "duration_ms": 2340, "cost": {...} },
    { "ts": "...", "event_type": "tool_use", "tool": "write_file", "parent_span_id": "span_xyz", "input": "..." }
  ]
}

ks logs spans

Stream spans matching a filter. Spans are individual events within traces.

# All spans in an experiment
ks logs spans --experiment exp-abc
 
# Single trace's spans
ks logs spans --trace trace-abc123
 
# All descendants of a root span (the full sub-tree)
ks logs spans --root-span span_abc
 
# Only direct children of a span
ks logs spans --parent-span span_xyz
 
# Only tool_use spans for a specific tool
ks logs spans --experiment exp-abc --event-type tool_use --tool write_file

ks logs scenarios

Stream per-scenario rows for an experiment. Required: --experiment.

# Every scenario (one line per scenario, with status, scores, etc.)
ks logs scenarios --experiment exp-abc
 
# Only failed scenarios
ks logs scenarios --experiment exp-abc --status failed
 
# A specific scenario
ks logs scenarios --experiment exp-abc --scenario scenario-000

ks logs scores

Stream offline scoring results (from ks.scoring.scoreExperiment). Required: --experiment.

ks logs scores --experiment exp-abc
ks logs scores --experiment exp-abc --rule rule-factuality-1

ks logs export <experiment-id>

Dump the entire experiment bundle — spec, scenarios, traces, scores — as one document.

ks logs export exp-abc -o exp-abc.json                # JSON (default)
ks logs export exp-abc --format ndjson -o exp-abc.jsonl

Use this for offline analysis: pandas, notebooks, external tooling.

ks setup — wire Keystone into a repo

ks setup (with no subcommand) runs every onboarding phase end-to-end. Each phase is idempotent — re-running is safe.

ks setup

Phases run in this order:

#PhaseWhat it does
1skillsWrite agent skill files
2mcpRegister the Keystone MCP server in agent configs
3specDrop a starter spec at keystone/example.yaml
4instrumentScan source for LLM client construction sites and print wrapping snippets
5installInstall the Keystone SDK for each detected language
6snapshotDetect agent code and explain how to package + upload as a snapshot
7doctorVerify KEYSTONE_API_KEY, server reachability, auth, ks on PATH

Run individual phases:

ks setup skills
ks setup mcp
ks setup spec
ks setup instrument
ks setup install
ks setup snapshot
ks setup doctor

Each is documented below.

ks setup skills

Writes Keystone skill files into your project's coding-agent config directories. The skill file teaches the coding agent (Claude Code, Cursor, Codex, etc.) what Keystone is and how to use it — so when you ask the agent to "set up Keystone in this repo," it knows what to do.

Targets selected interactively (TTY) or via --agents:

AgentSkill path written
Claude Code.claude/skills/keystone/SKILL.md
Cursor.cursor/rules/keystone.mdc
Gemini CLI.gemini/skills/keystone/SKILL.md
OpenCode.opencode/skills/keystone/SKILL.md
Codex.codex/skills/keystone/SKILL.md
Windsurf.windsurf/skills/keystone/SKILL.md
VS Code(no skill convention yet)
Generic.agents/skills/keystone/SKILL.md
# Interactive prompt (when run from a TTY)
ks setup skills
 
# Explicit list of agents
ks setup skills --agents claude,cursor
 
# All agents (non-interactive default in CI)
ks setup skills --agents all

Cursor's .mdc format is auto-translated; the rest write the canonical SKILL.md markdown.

ks setup mcp

Registers ks mcp serve as an MCP server in your project's coding-agent configs. After this, your agent can call Keystone's eval/logs verbs as MCP tools — useful for "improve this spec until pass rate is 95%" loops.

AgentMCP config written
Claude Code.mcp.json
Cursor.cursor/mcp.json
Gemini CLI.gemini/settings.json
OpenCode.opencode/mcp.json
Windsurf.windsurf/mcp_config.json
VS Code.vscode/mcp.json

The config preserves any other top-level keys — naive struct unmarshal would drop them, so the merger uses map[string]any.

Example written to .mcp.json:

{
  "mcpServers": {
    "keystone": {
      "command": "/usr/local/bin/ks",
      "args": ["mcp", "serve"],
      "env": {
        "KEYSTONE_API_KEY": "ks_live_..."
      }
    }
  }
}

After running, restart your agent so it picks up the new server.

ks setup spec

Drops a starter spec at keystone/example.yaml:

version: 1
id: example-rest-api
description: Build a tiny Express server and verify both routes work.
 
base: node:20
 
task:
  prompt: |
    Build a tiny Express server in server.js exposing:
      GET  /healthz  →  200 with body "ok"
      GET  /users    →  200 with a JSON array
    Add a package.json so `npm test` runs (it can be a no-op).
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  server_file_exists:
    description: server.js was created
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: server.js
 
  package_json_exists:
    description: package.json was created
    weight: 0.5
    check:
      type: file_exists
      path: package.json
 
  server_loads:
    description: server.js requires cleanly
    weight: 1.5
    check:
      type: command_exit
      command: node -e "require('./server.js')"
      expect_exit_code: 0
 
  routes_work:
    description: server actually serves both routes (LLM-as-judge)
    weight: 2.0
    check:
      type: llm_as_judge
      rubric: |
        Does server.js actually start an Express server and handle BOTH
        /healthz and /users routes? Reject implementations that stub the
        routes with TODO comments or return 500.
 
scoring:
  pass_threshold: 0.7
 
parallelism:
  replicas: 1

If keystone/example.yaml already exists, the phase is skipped (won't clobber your work). Run it with:

ks eval run keystone/example.yaml

ks setup instrument

Scans your project for LLM client construction sites — new OpenAI(), Anthropic(), litellm.completion(...), new ChatAnthropic(...), etc. — across ~50 SDKs in Go, TypeScript, and Python:

FamilyDetected patterns
OpenAInew OpenAI(, OpenAI(, AsyncOpenAI(, new AzureOpenAI(, etc.
Anthropicnew Anthropic(, Anthropic(, AsyncAnthropic(, AnthropicVertex, AnthropicBedrock
Googlegenai.NewClient, new GoogleGenAI(, genai.Client(, vertexai.init(
Coherecohere.NewClient(, new CohereClient(, cohere.Client(, cohere.AsyncClient(
Mistralmistral.NewClient(, new MistralClient(, MistralClient(
Groqnew Groq(, Groq(, AsyncGroq(
Togethernew Together(, Together(, AsyncTogether(
Fireworksnew Fireworks(, Fireworks(, AsyncFireworks(
Bedrockbedrockruntime.NewFromConfig, boto3.client('bedrock', BedrockRuntime(
LangChainnew ChatOpenAI(, ChatAnthropic(, all Chat* constructors
Vercel AI SDK@ai-sdk/openai, generateText(, streamText(, generateObject(, streamObject(
LiteLLMlitellm.completion(, litellm.acompletion(, litellm.Router(
Instructorinstructor.from_openai(, instructor.from_anthropic(, instructor.from_litellm(
DSPydspy.LM(, dspy.OpenAI(, dspy.Anthropic(
Pydantic AIOpenAIModel(, AnthropicModel(, GeminiModel(
OpenAI Agents SDKfrom agents import
Replicate, HuggingFace, Ollama, Mastra, OpenAI Agents, LangChainGo(more)

For each hit, prints path:line and the SDK family. Filters out vendored SDK source code (skips node_modules, .venv, vendor/, site-packages, etc.) and comments.

$ ks setup instrument
 detected 5 call site(s) across 2 language(s), 3 SDK familie(s)
 
  TypeScript / JavaScript (3 hits, 2 families)
    src/agents/openai.ts:14   [openai]
    src/agents/openai.ts:42   [openai]
    src/lib/llm.ts:8           [anthropic]
 
  Python (2 hits, 2 families)
    pipeline/judge.py:23       [openai]
    rag/retriever.py:11        [langchain]
 
 ask your coding agent to wrap these (full instructions in SKILL.md "Step 1 — Wrap LLM clients")

ks setup install

Auto-installs the Keystone SDK for each language present in the project. Detects:

LanguageManifest detectedPackage manager picked
Gogo.modgo get github.com/Polarityinc/keystone-sdk-go
TypeScriptbun.lock / bun.lockbbun add @polarityinc/polarity-keystone
TypeScriptpnpm-lock.yamlpnpm add @polarityinc/polarity-keystone
TypeScriptyarn.lockyarn add @polarityinc/polarity-keystone
TypeScriptpackage-lock.json / package.jsonnpm install @polarityinc/polarity-keystone
Pythonuv.lockuv add polarity-keystone
Pythonpoetry.lockpoetry add polarity-keystone
PythonPipfile / Pipfile.lockpipenv install polarity-keystone
Pythonpyproject.toml (no lock)uv add polarity-keystone (modern default)
Pythonrequirements.txtpip install polarity-keystone

Skipped if the SDK is already declared in the manifest. Use --no-install-sdk to print the recommended commands without running them.

ks setup snapshot

Detects agent code in your repo and explains how to package + upload as a Keystone snapshot. Looks for:

FileTreated as
Dockerfile, agent.DockerfileDocker-based agent
package.jsonNode/TS agent
pyproject.toml, requirements.txt, agent.py, PipfilePython agent
agent.ts, agent.jsTypeScript/JS agent
go.mod, main.goGo agent
Cargo.tomlRust agent

Searches the project root plus common subdirectories (agent/, agents/, src/agent/, cmd/agent/, apps/agent/).

$ ks setup snapshot
 found 2 candidate agent location(s):
    [python]  agent/main.py
    [node]    apps/agent/package.json
 
 ask your coding agent to package + upload a snapshot (full instructions in SKILL.md "Step 3")

The full packaging guide is in the skill file (Step 3). It walks the agent through tarring the code, picking an entrypoint, calling ks.agents.upload().

ks setup doctor

The "is everything wired up?" check. Verifies:

  1. .env files load without errors (parse failures are surfaced first).
  2. KEYSTONE_API_KEY is set (shows a redacted hash so you know which key was picked up).
  3. The Keystone server is reachable at KEYSTONE_BASE_URL (or default).
  4. The API key authenticates (round-trips a GET /v1/experiments call).
  5. The ks binary is on $PATH (so MCP configs can resolve it).
$ ks setup doctor
 KEYSTONE_API_KEY set (ks_live_a1b2c3d4...e5f6)
 server reachable (https://keystone.polarity.so) — OK
 auth OK
 ks on PATH /usr/local/bin/ks
 
 all good Keystone is wired up

If any check fails, the command exits non-zero with actionable hints:

$ ks setup doctor
 KEYSTONE_API_KEY get a key at https://app.paragon.run/app/keystone/settings, then put `KEYSTONE_API_KEY=<your-key>` in a `.env` at the project root (auto-loaded). Or pass --api-key / export it in your shell rc.
 auth keystone: API error 401 Unauthorized
 if your key is `ks_live_*` (a hosted Keystone key) but you're hitting a local daemon, set KEYSTONE_BASE_URL to your hosted URL (default: https://keystone.polarity.so)
      → if your key was issued by a local daemon, set KEYSTONE_BASE_URL=http://localhost:8012

ks setup flags

FlagDefaultMeaning
--agents(interactive prompt)Comma-separated list: claude,cursor,gemini,opencode,codex,windsurf,vscode,other,all
--no-install-sdkfalsePrint install commands without running them

Examples:

# Pick exactly which agents to target
ks setup --agents claude,cursor
 
# All agents, no SDK install (manual control)
ks setup --agents all --no-install-sdk
 
# Just one phase, with explicit agent list
ks setup skills --agents cursor

ks mcp serve — Model Context Protocol server

Run as a stdio MCP server so coding agents (Claude Code, Cursor, Codex) can call Keystone verbs as tools.

ks mcp serve

Exposed tools:

Tool nameWhat it does
keystone_eval_runUpload + run a spec; returns full RunResults
keystone_eval_listList experiments
keystone_eval_getGet one experiment's RunResults
keystone_eval_metricsAggregate metrics + tool breakdown
keystone_eval_compareCompare two experiments
keystone_logs_tracesStream trace events
keystone_logs_traceSingle trace by ID
keystone_logs_spansStream spans by filter
keystone_logs_scenariosStream scenario rows
keystone_logs_scoresStream offline scores

Each tool's input schema mirrors the equivalent CLI command's flags. The agent calls them via standard MCP — no special integration.

Register via ks setup mcp (writes the right config files for every coding agent), or by hand:

{
  "mcpServers": {
    "keystone": {
      "command": "/usr/local/bin/ks",
      "args": ["mcp", "serve"],
      "env": { "KEYSTONE_API_KEY": "ks_live_..." }
    }
  }
}

Once running, your coding agent can do things like:

  • "Run specs/regression.yaml and tell me which scenarios failed."
  • "Compare exp-abc and exp-xyz; if it regressed, find the trace for the failing tool call."
  • "Stream traces for exp-abc, find any LLM call over $0.10, and tell me what tool it called."

ks update — upgrade

ks update              # in-place upgrade from GitHub Releases
ks update --force      # re-install even if already at latest version

The CLI also auto-checks for updates once per 24h on every invocation (cached, silent on network failure). Skipped for dev builds (-dev suffix), for the ks update command itself, and when KS_NO_AUTO_UPDATE is set in the environment.

If you installed via a method that puts ks in a non-writable location, re-run the installer:

curl -fsSL https://ks.polarity.so/install.sh | bash

Common workflows

CI: run regression evals, gate on regression

# .github/workflows/ci.yml
- name: Run regression eval
  env:
    KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
  run: |
    ks eval run specs/regression.yaml > results.json
    # Exits non-zero if any scenario failed → CI fails
 
- name: Compare to last green build
  env:
    KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
  run: |
    NEW_EXP_ID=$(jq -r .experiment_id results.json)
    ks eval compare $LAST_GREEN_EXP_ID $NEW_EXP_ID --gate

Onboard a new repo

cd ~/work/new-project
ks setup --agents claude,cursor    # one command, everything wired
ks eval run keystone/example.yaml  # smoke test

Debug a failing scenario

# 1. Find the failing scenarios
ks eval get exp-abc | jq '.scenarios[] | select(.status == "fail")'
 
# 2. Re-run with the failing scenario's seed for reproduction
ks eval run specs/scenario-1.yaml --seed 12345
 
# 3. Stream trace events to see what went wrong
ks logs traces exp-new | jq 'select(.status == "error")'
 
# 4. Drill into the failing span
ks logs spans --root-span span_xyz | jq

Cost analysis

# Total cost for the experiment
ks logs traces exp-abc \
  | jq -s 'map(.cost.estimated_usd // 0) | add'
 
# Most expensive single call
ks logs traces exp-abc --event-type llm_call \
  | jq -s 'sort_by(.cost.estimated_usd // 0) | reverse | .[0]'
 
# Cost by model
ks logs traces exp-abc --event-type llm_call \
  | jq -s 'group_by(.cost.model) | map({model: .[0].cost.model, total_usd: map(.cost.estimated_usd) | add})'

Tool latency

# Average latency per tool
ks logs traces exp-abc --event-type tool_call \
  | jq -s 'group_by(.tool) | map({tool: .[0].tool, count: length, mean_ms: (map(.duration_ms) | add / length)})'
 
# Slowest 10 tool calls
ks logs traces exp-abc --event-type tool_call \
  | jq -s 'sort_by(.duration_ms) | reverse | .[:10]'

Find which scenarios called which tools

# Tools by frequency
ks logs traces exp-abc --event-type tool_use \
  | jq -r '.tool' | sort | uniq -c | sort -rn
 
# Scenarios that called a specific tool
ks logs spans --experiment exp-abc --tool write_file \
  | jq -r '.scenario_id' | sort -u

Full bundle export for offline analysis

# Export everything as one JSON document
ks logs export exp-abc -o exp-abc.json
 
# Or NDJSON for streaming
ks logs export exp-abc --format ndjson -o exp-abc.jsonl

Load in pandas:

import pandas as pd
import json
 
with open("exp-abc.json") as f:
    bundle = json.load(f)
 
scenarios = pd.DataFrame(bundle["scenarios"])
print(scenarios.groupby("status").size())

Output formats

FormatWhen to use
NDJSON (default for logs)Pipe into jq, awk, wc -l. One event per line; safe for unbounded streams.
JSON (default for eval)Whole-bundle dumps — good for archiving and offline analysis.
File output (-o file.json)Use when you don't want to redirect stdout.

NDJSON output streams as the server returns it — even huge experiments can be processed without loading everything in memory.

Exit codes

CodeMeaning
0Success
1Command-line error, network failure, or experiment had failures
1 (with --gate)eval compare reported a regression

ks eval run defaults to exit-1-on-any-failure — what you want for CI gating.

Environment variables (full list)

Env varPurpose
KEYSTONE_API_KEYAPI key for authentication
KEYSTONE_BASE_URLServer URL
KEYSTONE_SANDBOX_ID(Inside a sandbox only) — the current sandbox's ID
KEYSTONE_SERVICE_<NAME>_HOST(Inside a sandbox) service host
KEYSTONE_SERVICE_<NAME>_PORT(Inside a sandbox) service port

The first two are CLI inputs; the rest are server-injected into agent processes.

Troubleshooting

"unauthorized" / 401KEYSTONE_API_KEY not set, invalid, or the wrong server. Run ks setup doctor.

"spec not found" — you're trying to run an experiment against a spec that wasn't uploaded. ks eval run uploads first; if you're using direct API calls, ensure ks.specs.create() ran.

"sandbox rejected: at capacity" — you've hit the concurrent-sandbox limit. Wait a minute or contact support to raise the limit.

.env ignoredks only loads .env and .env.local from the current directory, not parent directories. Run from your project root.

Auto-update silently fails — intentional. Updates shouldn't break shell scripts. Run ks update --force (which prints any error to stderr instead of swallowing it) to see what's going wrong, or set KS_NO_AUTO_UPDATE=1 to disable the background check entirely.

ks setup mcp doesn't show up in my coding agent — restart the agent so it re-reads its MCP config. Then verify ks mcp serve works manually:

echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05"}}' | ks mcp serve
# Should print a JSON response (the MCP initialization handshake)

ks update says "up to date" but I'm not — auto-update is cached for 24h. ks update --force bypasses the cache.

Reference: every flag

ks
  --api-key string       Keystone API key (env: KEYSTONE_API_KEY)
  --base-url string      Server URL (env: KEYSTONE_BASE_URL)
  --timeout duration     HTTP timeout (default 30s)
  -v, --version          ks version
  -h, --help             help

ks eval run <spec.yaml>
  --name string          Experiment name (default: spec filename)
  --wait-timeout dur     Max wait time (default 10m)
  --poll dur             Poll interval (default 2s)

ks eval compare <a> <b>
  --gate                 Exit non-zero on regression

ks logs traces <exp-id>
  --event-type string    llm_call | tool_use | tool_call
  --tool string          Filter by tool name
  --agent string         Filter by agent name
  --sandbox string       Filter by sandbox ID
  --since string         RFC3339 timestamp
  -o, --output string    Output file (default: stdout)
  --format string        ndjson (default) | json
  --page-size int        Server pagination size (default 100)

ks logs spans
  --experiment string    Experiment ID
  --trace string         Trace ID
  --span string          Span ID
  --parent-span string   Parent span ID
  --root-span string     Root span ID
  --tool string          Tool name
  --event-type string    Event type
  -o, --output string
  --format string
  --page-size int

ks logs scenarios
  --experiment string    (required)
  --status string        passed | failed | error
  --scenario string      Specific scenario ID
  -o, --output string
  --format string
  --page-size int

ks logs scores
  --experiment string    (required)
  --rule string          Filter by rule ID
  -o, --output string
  --format string
  --page-size int

ks logs export <exp-id>
  --format string        json (default) | ndjson
  -o, --output string

ks setup
  --agents string        claude,cursor,gemini,opencode,codex,windsurf,vscode,other,all
  --no-install-sdk       Print install commands without running them

ks update
  --force                Re-install even if already at latest