Experiments

Run scenarios at scale — replicas, matrices, comparisons, alerts. The full experiment SDK across TS, Python, and Go.

An experiment is a named batch of scenarios run against a spec. One experiment = one run; the same spec can have many experiments over its lifetime. Replicas measure flakiness; matrices compare configurations; comparison flags regressions.

This page is the full surface of ks.experiments.*.

`experiments.create(opts)`

Creates an experiment record. Doesn't run it yet.

const exp = await ks.experiments.create({
  name: "baseline-v1",
  spec_id: "fix-failing-test",
  specPath: "./specs/fix-failing-test.yaml",   // optional — auto-forwards .env secrets
  secrets: { ANTHROPIC_API_KEY: "..." },        // optional — explicit overrides
});
// Returns: { id, name, spec_id, status: "created", created_at }

What this does: posts to POST /v1/experiments with the body. The server creates an experiment row and assigns an ID; no scenarios run yet. Pass specPath (TS/Python) or call CollectDeclaredSecretsFromFile (Go) to auto-forward declared secrets from your local environment — see Secrets.

`experiments.run(id)`

Triggers an asynchronous run. Returns immediately.

await ks.experiments.run(exp.id);
// Server enqueues scenarios; the call returns when the queue accepts them.

What this does: POST /v1/experiments/:id/run. The server expands the spec's parallelism block (replicas × matrix entries) into individual scenario jobs, enqueues them on the priority lanes, and returns. The actual scenarios execute asynchronously on worker nodes.

`experiments.runAndWait(id, opts?)`

The most common entry point. Triggers the run, polls until completion, returns final results.

const results = await ks.experiments.runAndWait(exp.id, {
  pollInterval: 2000,            // ms between polls (default 2000)
  timeout: 300_000,               // max ms to wait (default 5 min)
  scores: [/* client-side scorers */],   // optional
});

What this does: calls run(), then polls GET /v1/experiments/:id every pollInterval until done == total or timeout is reached. If scores are provided, runs each client-side scorer over each completed scenario and appends the results to the scenario's invariants list before returning.

Server returns 5xx are treated as transient and retried; 4xx errors abort.

`experiments.get(id)`

Reads the current results object. Includes partial results if the experiment is still running.

const results = await ks.experiments.get(exp.id);
console.log(`${results.passed}/${results.total_scenarios} passed`);

What this does: GET /v1/experiments/:id. Returns a RunResults object — see Results.

`experiments.list()`

const all = await ks.experiments.list();   // Experiment[]

GET /v1/experiments. Returns every experiment scoped to the API key. Use to find experiment IDs for comparison or metrics fetch.

`experiments.compare(baselineId, candidateId)`

Side-by-side diff of two experiments. Detects regressions in pass rate, cost, latency, and tool success rate.

const cmp = await ks.experiments.compare("exp-baseline", "exp-new");
// {
//   baseline_id: "exp-baseline",
//   candidate_id: "exp-new",
//   regressed: true,
//   regressions: ["pass_rate dropped from 0.95 to 0.78"],
//   metrics: [
//     { name: "pass_rate",            baseline: 0.95, candidate: 0.78, delta: -0.17, direction: "worse" },
//     { name: "p95_wall_ms",          baseline: 12000, candidate: 18000, delta: 6000, direction: "worse" },
//     { name: "mean_cost_per_run_usd", baseline: 0.30, candidate: 0.42, delta: 0.12, direction: "worse" },
//   ],
// }

What this does: POST /v1/experiments/compare. Server computes per-metric deltas. regressed: true if any metric got significantly worse (default thresholds: pass_rate -2%, p95_wall_ms +20%, cost +20%).

Use this in CI to gate merges:

ks eval compare exp-baseline exp-new --gate
# Exits non-zero if regressed

`experiments.metrics(id)`

Detailed metrics: per-tool breakdown, cost trend, pass-rate trend.

const metrics = await ks.experiments.metrics(exp.id);
// {
//   experiment_id: "exp-...",
//   summary: { total_runs, pass_rate, total_cost_usd, p95_wall_ms, ... },
//   tool_breakdown: { write_file: { count, mean_ms, error_rate } },
//   cost_trend: [{ run_id, cost_usd, ts }, ...],
//   pass_rate_trend: [{ run_id, pass_rate, ts }, ...],
// }

What this does: GET /v1/metrics/experiments/:id. Useful for dashboards.

Results shape

runAndWait and get both return RunResults:

{
  ran_at: "2026-04-28T22:00:00Z",
  spec_id: "fix-failing-test",
  experiment_id: "exp-a1b2c3",
  seed: 12345,
  total_scenarios: 30,                   // 10 replicas × 3 matrix entries
  passed: 27,
  failed: 2,
  flaky: 1,                              // some replicas passed, some failed
  errors: 0,                             // sandbox boot or agent timeout
  metrics: {
    pass_rate: 0.9,
    mean_wall_ms: 14200,
    p95_wall_ms: 23000,
    mean_tool_calls: 8.3,
    mean_tokens: 12000,
    total_cost_usd: 5.40,
    mean_cost_per_run_usd: 0.18,
    tool_success_rate: 0.97,
    side_effect_violations: 0,
  },
  scenarios: [
    {
      scenario_id: "scenario-000",
      sandbox_id: "sb-abc",
      status: "pass" | "fail" | "flaky" | "error",
      parameters: { model: "claude-sonnet-4-5", locale: "en_US" },
      wall_ms: 12000,
      exit_code: 0,
      tool_calls: 7,
      composite_score: 1.0,
      agent_output: "...",
      agent_stderr: "",
      invariants: [
        { name: "tests_pass", passed: true, gate: true, weight: 1.0, score: 1.0 },
      ],
      forbidden_checks: [
        { rule: "secrets_in_logs", violated: false },
      ],
      trace_file: "/v1/traces/trace-id-...",
      reproducer: {
        spec_file: "specs/fix-failing-test.yaml",
        seed: 12345,
        scenario_id: "scenario-000",
        command: "ks eval run specs/fix-failing-test.yaml --seed 12345 --scenario scenario-000",
      },
      cost: { input_tokens: 4200, output_tokens: 1800, model: "claude-sonnet-4-5", estimated_usd: 0.043 },
    },
    // ...one entry per scenario
  ],
}

Verdicts

pass — composite ≥ pass_threshold, no gates failed, no forbidden violations.
fail — composite < threshold, OR a gate failed, OR a forbidden rule fired.
flaky — multi-replica scenario where replica_aggregation couldn't decide a clean pass/fail (some replicas passed, some failed, didn't meet min_pass_rate).
error — sandbox boot failed, agent exceeded its timeout, or infrastructure issue. Distinct from fail.

Replicas & matrix

Defined in your spec:

parallelism:
  replicas: 10
  isolation: per_run                   # per_run = own sandbox + services | shared
  matrix:
    - { model: claude-sonnet-4-5, locale: en_US }
    - { model: claude-opus-4-5,   locale: en_US }
    - { model: gpt-4o,            locale: ja_JP }

10 replicas × 3 matrix entries = 30 scenarios. Each scenario runs in its own sandbox (because isolation: per_run).

shared reuses sandboxes across scenarios — faster, but state can leak between runs. Use per_run for any real measurement.

Matrix interpolation

Matrix values are available in your spec via {{ matrix.<key> }}:

parallelism:
  matrix:
    - { model: claude-sonnet-4-5 }
    - { model: gpt-4o }
 
agent:
  type: cli
  binary: my-agent
  args: ["--model", "{{ matrix.model }}"]

Each scenario runs with matrix.model substituted accordingly.

Run lifecycle

created → queued → running → completed
                              error (on infra failure)

The server enqueues scenario jobs to one of three priority lanes:

Lane	Use
`critical`	Gate invariant checks (must run fast)
`normal`	Standard scenario execution
`background`	Teardown, exports, cleanup

Workers pull from the queue, create sandboxes, run agents, score, report. Default worker pool size is 10; configurable on self-hosted.

Concurrency

Per-tenant concurrency limits apply: typically 5–10 sandboxes simultaneously on the free tier, more on Pro/Enterprise. If you hit the limit, scenarios queue and run as slots free up.

To control concurrency from your spec:

resources:
  concurrency_limit: 3                 # max 3 sandboxes for this spec at once

Useful when you want a particular scenario to not contend with others.

Cancelling

Cancel an experiment in flight:

POST /v1/experiments/:id/cancel

Server stops enqueueing new scenarios; running ones complete or hit their timeouts. Already-completed scenarios stay in the results.

(There's no SDK helper today; use direct HTTP.)

CI integration

The standard CI pattern:

# 1. Run the experiment
EXP_ID=$(ks eval run specs/regression.yaml --json | jq -r .experiment_id)
 
# 2. Compare to last green build
ks eval compare $LAST_GREEN_EXP_ID $EXP_ID --gate
# Exits non-zero if regressed → CI fails
 
# 3. If green, save EXP_ID as the new baseline
echo "$EXP_ID" > .keystone/last-green

Or in TypeScript:

const exp = await ks.experiments.create({ name: `ci-${process.env.GITHUB_SHA}`, spec_id: "regression" });
const results = await ks.experiments.runAndWait(exp.id);
if (results.passed < results.total_scenarios) {
  console.error("Experiment failed");
  process.exit(1);
}
 
if (process.env.LAST_GREEN_EXP_ID) {
  const cmp = await ks.experiments.compare(process.env.LAST_GREEN_EXP_ID, exp.id);
  if (cmp.regressed) {
    console.error("Regressed:", cmp.regressions);
    process.exit(1);
  }
}

Patterns

Smoke test (1 replica, fail-fast)

parallelism:
  replicas: 1
 
scoring:
  pass_threshold: 1.0

For "did this even work" CI checks. One run, all-or-nothing.

Flakiness measurement (50 replicas, percentile pass)

parallelism:
  replicas: 50
 
scoring:
  pass_threshold: 0.9
  replica_aggregation:
    strategy: percentage
    min_pass_rate: 0.85         # 85% of replicas must pass

Captures noise. Use sparingly — 50× the cost.

Model bake-off (matrix without replicas)

parallelism:
  replicas: 1
  matrix:
    - { model: claude-sonnet-4-5 }
    - { model: claude-opus-4-5 }
    - { model: gpt-4o }
    - { model: gemini-2.5-pro }

One scenario per model. Compare in the dashboard or programmatically.

Comprehensive (matrix + replicas)

parallelism:
  replicas: 10
  matrix:
    - { model: claude-sonnet-4-5, prompt_style: terse }
    - { model: claude-sonnet-4-5, prompt_style: verbose }
    - { model: gpt-4o,            prompt_style: terse }
    - { model: gpt-4o,            prompt_style: verbose }

40 scenarios, four prompt × model combos with 10 replicas each. The dashboard slices by parameter so you can see "verbose prompts pass more on Sonnet but cost 30% more."

LLM-as-Judge Agent Snapshots