Setup Guide

The fastest path is `ks setup`. This page explains every step `ks setup` runs, so you can run it whole or pick phases.

The fastest path to a working Keystone install is two commands:

curl -fsSL https://ks.polarity.so/install.sh | bash    # install the CLI
cd ~/your-project && ks setup                          # run the wizard

ks setup runs seven phases end-to-end — writes coding-agent skill files, registers an MCP server, drops a starter spec, scans your code for LLM clients, installs the Keystone SDK, detects agent code, and runs a doctor check. Each phase is idempotent and re-running is safe; you can also run any single phase by name (ks setup spec, ks setup doctor, etc.).

This page documents every phase in detail. If you're impatient, run ks setup and skip to Step 4 — that's the only step the CLI can't do for you (the actual code change to wrap your LLM clients).

Total time: under 5 minutes for the basic flow. Longer if you're packaging your own agent as a snapshot.

Step 0 — Prerequisites

You need:

A Keystone account. Sign up at app.paragon.run.
An API key. Go to Keystone → Settings → API Keys → Create Key. The key starts with ks_live_ and is shown once at creation — copy it now.
Either:
- Node.js (≥18) for the TypeScript SDK + npx ks ... CLI, OR
- Python (≥3.9) for the Python SDK, OR
- Go (≥1.22) for the Go SDK + native CLI

You do not need Docker locally. Keystone runs sandboxes on Polarity's infrastructure.

Step 1 — Install the CLI

The CLI (ks) is one Go binary, no runtime deps. Install whichever way fits:

# macOS / Linux — canonical (what the CLI's auto-updater also uses)
curl -fsSL https://ks.polarity.so/install.sh | bash
 
# Pin a version
curl -fsSL https://ks.polarity.so/install.sh | bash -s ks-v0.1.2
 
# Or download a release binary directly
# → https://github.com/Polarityinc/ks/releases/latest

Verify:

ks --version
# ks version v0.1.2 (or whatever the current release is)

The CLI auto-checks for updates once per 24h. Run ks update to upgrade in-place.

ks.polarity.so vs keystone.polarity.so. ks.polarity.so is where the install script and release binaries live. keystone.polarity.so is the API server the CLI calls (the value of KEYSTONE_BASE_URL). You'll only ever set the API URL if you're running self-hosted; for the hosted offering, the default is right.

Step 2 — Configure your API key

Drop the key in your project's .env (auto-loaded by ks from the current directory):

# .env (gitignored)
KEYSTONE_API_KEY=ks_live_xxxxxxxxxxxxxxxxxxxxxx

Or export in your shell:

export KEYSTONE_API_KEY=ks_live_xxxxxxxxxxxxxxxxxxxxxx

Verify:

ks setup doctor

Expected:

✓ KEYSTONE_API_KEY — set (ks_live_xxxxxxxx…xxxx)
✓ server reachable (https://keystone.polarity.so) — OK
✓ auth — OK
✓ ks on PATH — /usr/local/bin/ks

 all good — Keystone is wired up

If anything's red, run the suggested fix and re-run doctor until green.

Step 3 — Initialize your project with `ks setup`

This is the load-bearing step. One command wires everything in:

cd ~/my-project
ks setup

ks setup runs seven phases sequentially. Each is idempotent (safe to re-run) and independent (you can run any one by name).

What you'll see

When you run ks setup from a TTY, the wizard:

Prints a styled banner explaining what it will do.
Interactively prompts you for which coding agents to target (Claude Code, Cursor, Gemini CLI, OpenCode, Codex, Windsurf, VS Code, or "other / generic"). The default selection is "all."
Runs each phase in order, printing per-phase status (✓ wrote .claude/skills/keystone/SKILL.md, ⚠ already exists, etc.).
Ends with a copy-pasteable prompt block to hand to your coding agent — telling it exactly what's left to do (the LLM-client wrapping in Step 4 below).

If you pipe stdin (CI, scripts), it skips the prompt and targets all agents:

# Non-interactive default
ks setup < /dev/null

Pick agents explicitly

# Comma-separated keys
ks setup --agents claude,cursor
ks setup --agents claude,cursor,vscode
ks setup --agents all

Available keys: claude, cursor, gemini, opencode, codex, windsurf, vscode, other, all.

Skip the SDK auto-install

ks setup --no-install-sdk

ks setup detects your project's languages and runs the right package-manager command (go get, npm install, uv add, pip install, etc.). With --no-install-sdk, it just prints the command — useful when you want manual control over your dep tree.

Run individual phases

Each of the seven phases is a subcommand you can run in isolation:

ks setup skills        # only write coding-agent skill files
ks setup mcp           # only register the MCP server
ks setup spec          # only drop the starter spec
ks setup instrument    # only scan source for LLM clients
ks setup install       # only install the SDK
ks setup snapshot      # only detect agent code
ks setup doctor        # only run the health check

Each is documented in detail below.

Phase 1 — `ks setup skills`

ks setup skills

Writes Keystone skill files into your project's coding-agent config directories. A skill file is a markdown document that teaches the coding agent (Claude Code, Cursor, Gemini, etc.) what Keystone is, what verbs the CLI exposes, and how to call them. After this, when you ask the agent "set up Keystone in this repo," it has the full playbook.

The content is the same SKILL.md body across every target — only the path and frontmatter format differs per agent (Cursor uses .mdc with a Cursor-specific YAML header; the rest use Claude Skills format).

Agent key	Label	Skill file path
`claude`	Claude Code	`.claude/skills/keystone/SKILL.md`
`cursor`	Cursor	`.cursor/rules/keystone.mdc`
`gemini`	Gemini CLI	`.gemini/skills/keystone/SKILL.md`
`opencode`	OpenCode	`.opencode/skills/keystone/SKILL.md`
`codex`	Codex	`.codex/skills/keystone/SKILL.md`
`windsurf`	Windsurf	`.windsurf/skills/keystone/SKILL.md`
`vscode`	VS Code	(no skill convention — this agent is skipped)
`other`	Generic / other	`.agents/skills/keystone/SKILL.md`

The generic other target is included by default — if you use a coding agent that follows the emerging .agents/skills/ convention, it picks Keystone up without you doing anything.

What the skill file contains:

Spec YAML reference (every block, every type)
CLI verbs table (ks eval run, ks logs traces, etc.)
Go SDK examples
LLM-client wrapping instructions per language
A "set up Keystone in this project" task playbook with 4 numbered steps

The skill file is the source of truth the coding agent will reference whenever you ask it to do Keystone-related work.

# Write only for specific agents
ks setup skills --agents claude,cursor

Phase 2 — `ks setup mcp`

ks setup mcp

Registers ks mcp serve (Keystone's stdio MCP server) in each selected coding agent's MCP config file. After this, your agent can call Keystone's verbs as MCP tools — useful for "improve this spec until pass rate is 95%" loops where the agent invokes evals, reads traces, and iterates without leaving the chat.

Per-agent MCP config locations:

Agent	MCP config	Top-level key
Claude Code	`.mcp.json`	`mcpServers`
Cursor	`.cursor/mcp.json`	`mcpServers`
Gemini CLI	`.gemini/settings.json`	`mcpServers`
OpenCode	`.opencode/mcp.json`	`mcpServers`
Windsurf	`.windsurf/mcp_config.json`	`mcpServers`
VS Code	`.vscode/mcp.json`	`servers`
Codex	(user-scope only — skipped)	—

The merger uses map[string]any so it preserves every other top-level field in the existing JSON — for files like .gemini/settings.json that hold theme, model, and other unrelated settings, naive struct unmarshal would silently drop them.

Example written to .mcp.json:

{
  "mcpServers": {
    "keystone": {
      "command": "/usr/local/bin/ks",
      "args": ["mcp", "serve"],
      "env": {
        "KEYSTONE_API_KEY": "ks_live_..."
      }
    }
  }
}

The command resolves to whatever ks binary the wizard found (os.Executable()). KEYSTONE_API_KEY and KEYSTONE_BASE_URL are forwarded from your env if set.

After running this phase, restart your coding agent so it picks up the new server entry. Verify ks mcp serve works manually:

echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05"}}' | ks mcp serve
# Should print a JSON-RPC initialize response.

The 10 MCP tools the server exposes are listed in the CLI page.

Phase 3 — `ks setup spec`

ks setup spec

Writes keystone/example.yaml — a starter spec you can run immediately. Skipped if keystone/example.yaml already exists (won't clobber your work).

The starter spec exercises every common block: a Node base image, a task prompt, the paragon agent, four invariants spanning all major check types (file_exists, command_exit, llm_as_judge), and pass_threshold: 0.7:

version: 1
id: example-rest-api
description: Build a tiny Express server and verify both routes work.
 
base: node:20
 
task:
  prompt: |
    Build a tiny Express server in server.js exposing:
      GET  /healthz  →  200 with body "ok"
      GET  /users    →  200 with a JSON array
    Add a package.json so `npm test` runs (it can be a no-op).
 
agent:
  type: paragon
  timeout: 5m
 
invariants:
  server_file_exists:
    description: server.js was created
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: server.js
 
  package_json_exists:
    description: package.json was created
    weight: 0.5
    check:
      type: file_exists
      path: package.json
 
  server_loads:
    description: server.js requires cleanly
    weight: 1.5
    check:
      type: command_exit
      command: node -e "require('./server.js')"
      expect_exit_code: 0
 
  routes_work:
    description: server actually serves both routes (LLM-as-judge)
    weight: 2.0
    check:
      type: llm_as_judge
      rubric: |
        Does server.js actually start an Express server and handle BOTH
        /healthz and /users routes? Reject implementations that stub the
        routes with TODO comments or return 500 / undefined.
 
scoring:
  pass_threshold: 0.7
 
parallelism:
  replicas: 1

Run it:

ks eval run keystone/example.yaml

Expected: a passing scenario in 10–30 seconds.

Phase 4 — `ks setup instrument`

ks setup instrument

Walks every source file in your repo (.go, .ts, .tsx, .js, .mjs, .cjs, .py) and grep-matches against a catalog of ~50 LLM SDK construction patterns, then prints each hit as path:line tagged with the SDK family.

Detected providers:

Family	Patterns matched
`openai`	`new OpenAI(`, `OpenAI(`, `AsyncOpenAI(`, `new AzureOpenAI(`, `AzureOpenAI(`
`anthropic`	`new Anthropic(`, `Anthropic(`, `AsyncAnthropic(`, `AnthropicVertex(`, `AnthropicBedrock(`
`google-genai`	`new GoogleGenerativeAI(`, `genai.NewClient(`, `genai.Client(`, `GenerativeModel(`
`google-vertex`	`vertexai.NewClient(`, `vertexai.init(`
`cohere`	`cohere.NewClient(`, `new CohereClient(`, `cohere.Client(`, `cohere.AsyncClientV2(`
`mistral`	`mistral.NewClient(`, `new MistralClient(`, `MistralClient(`
`groq`	`new Groq(`, `Groq(`, `AsyncGroq(`
`together`	`new Together(`, `Together(`, `AsyncTogether(`
`fireworks`	`new Fireworks(`, `Fireworks(`, `AsyncFireworks(`
`replicate`	`new Replicate(`, `replicate.Client(`, `Replicate(`
`huggingface`	`new HfInference(`, `InferenceClient(`, `AsyncInferenceClient(`
`bedrock`	`new BedrockRuntime(`, `bedrockruntime.NewFromConfig(`, `boto3.client('bedrock'`, `BedrockRuntime(`
`ollama`	`new Ollama(`, `ollama.NewClient(`, `ollama.Client(`, `ollama.AsyncClient(`
`vercel-ai`	`@ai-sdk/openai`, `@ai-sdk/anthropic`, `@ai-sdk/google`, `generateText(`, `streamText(`, `generateObject(`, `streamObject(`
`langchain`	`new ChatOpenAI(`, `ChatAnthropic(`, `ChatGoogleGenerativeAI(`, `ChatCohere(`, `ChatMistralAI(`, `ChatGroq(`, `ChatBedrock(`, `ChatVertexAI(`
`langchain-go`	`langchaingo/llms`
`litellm`	`litellm.completion(`, `litellm.acompletion(`, `litellm.Router(`
`instructor`	`instructor.from_openai(`, `instructor.from_anthropic(`, `instructor.from_litellm(`, `instructor.from_gemini(`, `instructor.from_cohere(`
`dspy`	`dspy.LM(`, `dspy.OpenAI(`, `dspy.Anthropic(`
`pydantic-ai`	`OpenAIModel(`, `AnthropicModel(`, `GeminiModel(`, `GroqModel(`, `MistralModel(`
`mastra`	`new Mastra(`
`openai-agents`	`@openai/agents`, `from agents import`
`http-option`	`option.WithHTTPClient(` (Go — generic HTTP client option)

Skipped directories (so the scanner doesn't flag vendored SDK source):

node_modules, .git, dist, build, vendor, .next, .nuxt, .turbo, target, out, coverage, __pycache__, .pytest_cache, .mypy_cache, .idea, .vscode, .gradle, .cargo, site-packages, anything starting with .venv / venv / env- / .env-, anything under a site-packages tree.

Skipped lines: comments (//, #, /*, >>>, etc.) and class/interface declarations (so vendored SDK source code isn't matched as a "construction").

Output is grouped by language with up to 3 file:line examples per language; the summary mentions the total hit count and the SDK families. The full list goes into the skill file (so the coding agent has every site when it does the wrapping in Step 4).

$ ks setup instrument
✓ detected 5 call site(s) across 2 language(s), 3 SDK familie(s)
 
  TypeScript / JavaScript (3 hits, 2 families)
    src/agents/openai.ts:14   [openai]
    src/agents/openai.ts:42   [openai]
    src/lib/llm.ts:8           [anthropic]
 
  Python (2 hits, 2 families)
    pipeline/judge.py:23       [openai]
    rag/retriever.py:11        [langchain]
 
→ ask your coding agent to wrap these (full instructions in SKILL.md → "Step 1 — Wrap LLM clients")

The wrap is the one step ks setup cannot do for you — see Step 4 below.

Phase 5 — `ks setup install`

ks setup install

Detects your project's languages and runs the correct package-manager command for each. The detection is based on dep manifests at the project root — independent of whether step 4 found any actual LLM construction sites, since most users run ks setup before writing the LLM-using code.

Language → package manager mapping:

Manifest detected	Package manager picked	Install command
`go.mod`	`go`	`go get github.com/Polarityinc/keystone-sdk-go`
`bun.lockb` or `bun.lock`	`bun`	`bun add @polarityinc/polarity-keystone`
`pnpm-lock.yaml`	`pnpm`	`pnpm add @polarityinc/polarity-keystone`
`yarn.lock`	`yarn`	`yarn add @polarityinc/polarity-keystone`
`package-lock.json` (or only `package.json`)	`npm`	`npm install @polarityinc/polarity-keystone`
`uv.lock`	`uv`	`uv add polarity-keystone`
`poetry.lock`	`poetry`	`poetry add polarity-keystone`
`Pipfile.lock` (or `Pipfile`)	`pipenv`	`pipenv install polarity-keystone`
`pyproject.toml` (no lock)	`uv` (modern default)	`uv add polarity-keystone`
`requirements.txt`	`pip`	`pip install polarity-keystone`

Skipped if the SDK is already declared in the manifest. The wizard greps the right file (go.mod, package.json's dependencies/devDependencies, pyproject.toml's [project.dependencies], etc.) before running anything.

Print without installing: ks setup install --no-install-sdk prints the commands for each detected language but doesn't execute them. Useful when you want to vet the install before letting it touch your dep tree.

If your project has multiple languages (e.g., a Go server + Python tooling), the phase installs for each one detected.

Phase 6 — `ks setup snapshot`

ks setup snapshot

Detects agent code in your repo and prints what it found. Use this only when you want to run your own agent inside Keystone (instead of the built-in paragon agent).

Detection markers (file → what it implies):

File	Kind	Hint
`Dockerfile`	`dockerfile`	Build a Docker image of your agent
`agent.Dockerfile`	`dockerfile`	Agent-specific Dockerfile
`package.json`	`node`	Node/TS agent (consider snapshot for fast iteration)
`pyproject.toml`	`python`	Python agent (poetry/pdm/uv project)
`requirements.txt`	`python`	Python agent (pip)
`agent.py`	`python`	Python entrypoint
`agent.ts` / `agent.js`	`node`	TS/JS entrypoint
`go.mod`	`go`	Go agent module
`main.go`	`go`	Go entrypoint
`Cargo.toml`	`rust`	Rust agent

Search paths: project root + agent/, agents/, src/agent/, cmd/agent/, apps/agent/. The phase doesn't recurse arbitrarily — it only looks at the immediate contents of these specific directories.

$ ks setup snapshot
✓ found 2 candidate agent location(s):
    [python]  agent/main.py
    [node]    apps/agent/package.json
 
→ ask your coding agent to package + upload a snapshot (full instructions in SKILL.md → "Step 3")

The phase doesn't package or upload anything — it surfaces the locations and the skill file (Step 3a) has the full packaging walkthrough: tar the directory with tar -czf agent.tar.gz -C agent/ ., pick an entrypoint, call ks.agents.upload(...). See Agent Snapshots for the API details.

If no agent code is detected, the phase no-ops with a hint pointing at the skill file. Skip without consequence if you're using built-in Paragon.

Phase 7 — `ks setup doctor`

ks setup doctor

The "is everything wired up?" final check. Verifies each layer of the install in order, prints actionable hints for whatever's broken, and exits non-zero if any check fails (so you can use it in CI / pre-flight scripts).

Checks performed (in order):

.env parse errors — first to surface, because a malformed .env is the most common reason a "set" key looks unset. The CLI loads .env and .env.local from the current working directory; any parse error from godotenv.Load is captured and shown here.
KEYSTONE_API_KEY is set — picked up from --api-key flag → env → .env. Shows a redacted hash (first 12 + last 4 chars) so you know which key was picked up without leaking the value. If unset, prints the get-a-key URL plus a hint about how .env is loaded — and a special note if a .env exists at the project root but didn't yield the key (typo / case-sensitivity check).
Server reachable — does an HTTP GET /health against KEYSTONE_BASE_URL (default https://keystone.polarity.so). Times out at 5 seconds. A 200 response is the bar.
Auth works — calls client.Experiments.List(ctx) with a 10-second deadline. If it returns 401/403, prints a hint specifically about the ks_live_* vs local-daemon mismatch (the most common confusion when self-hosting).
ks is on PATH — exec.LookPath("ks"). Required for the MCP config (Phase 2) to resolve the binary. If not, hints at go install or symlinking.

Sample output, all green:

✓ KEYSTONE_API_KEY — set (ks_live_a1b2c3d4…e5f6)
✓ server reachable (https://keystone.polarity.so) — 200 OK
✓ auth — OK
✓ ks on PATH — /usr/local/bin/ks

 all good — Keystone is wired up

Sample output, broken:

✗ dotenv parse error — .env: line 3: invalid syntax
✗ KEYSTONE_API_KEY — get a key at https://app.paragon.run/app/keystone/settings, then put `KEYSTONE_API_KEY=<your-key>` in a `.env` at the project root (auto-loaded). Or pass --api-key / export it in your shell rc. — note: a `.env` exists at /Users/alex/proj/.env but didn't yield KEYSTONE_API_KEY (check spelling — names are case-sensitive)
✗ auth — keystone: API error 401 Unauthorized
      → if your key is `ks_live_*` (a hosted Keystone key) but you're hitting a local daemon, set KEYSTONE_BASE_URL to your hosted URL (default: https://keystone.polarity.so)
      → if your key was issued by a local daemon, set KEYSTONE_BASE_URL=http://localhost:8012

Pending tasks block. Whenever phases 4 or 6 left work behind (LLM clients to wrap, agent code to package as a snapshot), doctor ends with a styled box containing a copy-pasteable prompt for your coding agent — tells it to follow Step 1 / Step 3 of the skill file. The block always points at the right skill-file path for your selected agent (e.g. .claude/skills/keystone/SKILL.md).

Run doctor any time you suspect something's drifted — it's the canonical "what's wrong" check.

Step 4 — Wrap your LLM clients

This is the one step ks can't do for you — the actual code change. Find each LLM client construction (ks setup instrument showed you where) and pass it through ks.wrap():

// BEFORE
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
 
// AFTER
import Anthropic from "@anthropic-ai/sdk";
import { Keystone } from "@polarityinc/polarity-keystone";
 
const ks = new Keystone();
ks.initTracing();   // no-op outside a sandbox
 
const anthropic = ks.wrap(new Anthropic());
// Every anthropic.messages.create() now auto-reports prompt, tokens,
// latency, tool-calls. Response is unchanged.

Repeat for every LLM client. Then wrap your tool functions with traced() so tool execution shows up alongside LLM calls in the trace tree:

import { traced } from "@polarityinc/polarity-keystone";
 
// BEFORE
async function readFile(path: string) {
  return await fs.readFile(path, "utf-8");
}
 
// AFTER (decorator form — preferred)
const readFile = traced(async (path: string) => {
  return await fs.readFile(path, "utf-8");
}, { name: "read_file" });
 
// OR (callback form — legacy but works)
async function readFile(path: string) {
  return await traced("read_file", async () => {
    return await fs.readFile(path, "utf-8");
  });
}

Repeat for every tool function. Nested traced() calls automatically build a parent-child span tree.

Verify the wrap is working

After running an experiment, check that traces flowed:

ks logs traces $EXP_ID --event-type llm_call | head

If you see llm_call events with model + tokens + cost, the wrap took effect. If empty, recheck step 4 — your agent's LLM client probably isn't wrapped.

Step 5 — Write a spec

Five questions, one YAML file:

What task should the agent do? One sentence the agent receives as a prompt.
What environment does it need? Base image, packages, repos, services (DB, mock APIs).
How do you know it worked? Concrete yes/no checks (file content, exit code, SQL queries, mock assertions).
What's off-limits? Forbidden DB writes / HTTP hosts / file paths.
What API keys does the agent need at runtime? LLM provider keys, third-party creds.

Edit keystone/example.yaml to match your scenario, or copy from Examples.

Minimum required fields:

version: 1
id: my-eval
description: "What this evaluates"
 
base: "ubuntu:24.04"
 
task:
  prompt: "<the instruction the agent receives>"
 
agent:
  type: paragon            # or cli, http, image, snapshot, python
  timeout: 5m
 
invariants:
  my_check:
    description: "What this verifies"
    weight: 1.0
    gate: true
    check:
      type: file_exists
      path: output.json
 
scoring:
  pass_threshold: 1.0

The full reference is in Spec Reference.

Step 6 — Run the experiment

ks eval run keystone/example.yaml

This:

Uploads the spec.
Auto-forwards declared secrets from your local .env.
Creates an experiment.
Runs scenarios in parallel (up to your tenant's concurrency limit).
Polls until done.
Prints the full RunResults JSON.
Exits non-zero if any scenario failed.

Expected output (passing):

{
  "experiment_id": "exp-a1b2c3",
  "total_scenarios": 1,
  "passed": 1,
  "failed": 0,
  "metrics": {
    "pass_rate": 1.0,
    "mean_wall_ms": 12000,
    "total_cost_usd": 0.043
  },
  "scenarios": [
    {
      "status": "pass",
      "composite_score": 1.0,
      "invariants": [
        { "name": "my_check", "passed": true, "gate": true, "weight": 1.0 }
      ]
    }
  ]
}

Failing run includes a reproducer command:

{
  "scenarios": [
    {
      "status": "fail",
      "composite_score": 0.0,
      "invariants": [
        {
          "name": "my_check",
          "passed": false,
          "gate": true,
          "message": "file 'output.json' does not exist"
        }
      ],
      "reproducer": {
        "seed": 12345,
        "command": "ks eval run keystone/example.yaml --seed 12345 --scenario scenario-000"
      }
    }
  ]
}

The reproducer command re-runs that exact scenario with the same seed for debugging.

Step 7 — Iterate

The full inner loop:

Inspect failures. ks eval get $EXP_ID | jq '.scenarios[] | select(.status == "fail")'
Stream the trace. ks logs traces $EXP_ID | jq to see what the agent did.
Reproduce locally. Use the reproducer.command.
Fix. Adjust prompt, invariants, fixtures, or agent code.
Re-run. ks eval run keystone/example.yaml.

When the eval consistently passes, raise the bar:

Add parallelism.replicas: 10 to measure flakiness.
Add a matrix to compare configurations (models, prompts, etc.).
Move from pass_threshold: 0.7 to 0.95.
Add forbidden: rules to enforce trajectory constraints.
Add audit: blocks to capture activity for forbidden checks.

Step 8 — Wire into CI

Once the eval is stable, gate your CI on it:

# .github/workflows/ci.yml
name: Eval
on: [pull_request]
 
jobs:
  keystone-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install ks
        run: curl -fsSL https://ks.polarity.so/install.sh | bash
      - name: Run regression eval
        env:
          KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
        run: ks eval run keystone/regression.yaml

Optionally, gate on regressions vs. last green:

- name: Compare to last green
  env:
    KEYSTONE_API_KEY: ${{ secrets.KEYSTONE_API_KEY }}
  run: |
    LATEST_EXP=$(ks eval list | jq -r '.[0].id')
    ks eval compare $LAST_GREEN_EXP_ID $LATEST_EXP --gate

--gate exits non-zero if pass rate dropped, latency rose, or cost rose past the default thresholds.

Step 9 — Add alerts

Get notified on regressions:

import { Keystone } from "@polarityinc/polarity-keystone";
const ks = new Keystone();
 
await ks.alerts.create({
  name: "pass-rate-drop",
  eval_id: "my-eval",
  condition: "pass_rate < 0.85",
  notify: "slack",
  slack_channel: "#agent-alerts",
});
 
await ks.alerts.create({
  name: "cost-spike",
  condition: "mean_cost_per_run_usd > 1.00",
  notify: "webhook",
  webhook_url: "https://hooks.slack.com/services/...",
});

Slack channel uses Block Kit; webhooks get raw JSON. See Alerts.

Step 10 — Production observability (optional)

The same SDK works for tracing real production traffic. Set KEYSTONE_API_KEY in your prod env, wrap your LLM clients, and traces flow to the dashboard's Traces tab — filterable by API key.

// Production code — same wrap, no sandbox
import { Keystone } from "@polarityinc/polarity-keystone";
import OpenAI from "openai";
 
const ks = new Keystone();      // reads KEYSTONE_API_KEY from prod env
const openai = ks.wrap(new OpenAI());
// Every prod call now traces to /v1/traces, scoped by your API key.

See Production Mode.

Where to go from here

If you want to	Read
Learn the spec syntax in detail	Spec Reference
Understand the SDK methods	SDK Reference
Set up backing services (DB, mocks)	Services
Inject credentials safely	Secrets
Write better invariants	Invariants
Block forbidden agent behavior	Forbidden Rules
Use 28 built-in scorers	Scorers Library
Trace LLM calls in detail	LLM Tracing
Trace custom code	Custom Spans
Compare two experiments	Experiments
Package your own agent	Agent Snapshots
Master the CLI	Keystone CLI
Debug something broken	Troubleshooting

Common pitfalls

"My agent runs in the sandbox but I don't see traces" — make sure you called ks.wrap() and ks.initTracing() (or just ks.observe(...) which does both). The wrap silently no-ops outside a sandbox if no API key is set.

"Spec uploads fine but the experiment fails immediately" — usually a missing secret. Check the experiment's error message; it'll name the secret. Add it to your .env or the Dashboard Secrets tab.

"All my replicas pass but pass_rate is < 1.0" — replica_aggregation.strategy: percentage rounds pass-rate down. Switch to all_must_pass or accept the rounding.

"The agent works locally but fails in the sandbox" — the sandbox doesn't have your local files unless you copy them in via fixtures or pass them via setup.files. The agent has its own filesystem.

"{{ secrets.X }} isn't substituting" — interpolation only works in declared fields (service env, fixture seed, agent args). Free-form spec text isn't templated.

"The trace is empty" — ks.wrap() only intercepts messages.create() (Anthropic) or chat.completions.create() (OpenAI). If your agent uses a different method (e.g., responses.create()), wrap manually or use auto_instrument.

If something else is broken, Troubleshooting has the full list of common errors with fixes.

Quick Start Examples