Run

Experiments

Run scenarios at scale — replicas, matrices, comparisons, alerts. The full experiment SDK across TS, Python, and Go.

An experiment is a named batch of scenarios run against a spec. One experiment = one run; the same spec can have many experiments over its lifetime. Replicas measure flakiness; matrices compare configurations; comparison flags regressions.

This page is the full surface of ks.experiments.*.

experiments.create(opts)

Creates an experiment record. Doesn't run it yet.

const exp = await ks.experiments.create({
  name: "baseline-v1",
  spec_id: "fix-failing-test",
  specPath: "./specs/fix-failing-test.yaml",   // optional — auto-forwards .env secrets
  secrets: { ANTHROPIC_API_KEY: "..." },        // optional — explicit overrides
});
// Returns: { id, name, spec_id, status: "created", created_at }

What this does: posts to POST /v1/experiments with the body. The server creates an experiment row and assigns an ID; no scenarios run yet. Pass specPath (TS/Python) or call CollectDeclaredSecretsFromFile (Go) to auto-forward declared secrets from your local environment — see Secrets.

experiments.run(id)

Triggers an asynchronous run. Returns immediately.

await ks.experiments.run(exp.id);
// Server enqueues scenarios; the call returns when the queue accepts them.

What this does: POST /v1/experiments/:id/run. The server expands the spec's parallelism block (replicas × matrix entries) into individual scenario jobs, enqueues them on the priority lanes, and returns. The actual scenarios execute asynchronously on worker nodes.

experiments.runAndWait(id, opts?)

The most common entry point. Triggers the run, polls until completion, returns final results.

const results = await ks.experiments.runAndWait(exp.id, {
  pollInterval: 2000,            // ms between polls (default 2000)
  timeout: 300_000,               // max ms to wait (default 5 min)
  scores: [/* client-side scorers */],   // optional
});

What this does: calls run(), then polls GET /v1/experiments/:id every pollInterval until done == total or timeout is reached. If scores are provided, runs each client-side scorer over each completed scenario and appends the results to the scenario's invariants list before returning.

Server returns 5xx are treated as transient and retried; 4xx errors abort.

experiments.get(id)

Reads the current results object. Includes partial results if the experiment is still running.

const results = await ks.experiments.get(exp.id);
console.log(`${results.passed}/${results.total_scenarios} passed`);

What this does: GET /v1/experiments/:id. Returns a RunResults object — see Results.

experiments.list()

const all = await ks.experiments.list();   // Experiment[]

GET /v1/experiments. Returns every experiment scoped to the API key. Use to find experiment IDs for comparison or metrics fetch.

experiments.compare(baselineId, candidateId)

Side-by-side diff of two experiments. Detects regressions in pass rate, cost, latency, and tool success rate.

const cmp = await ks.experiments.compare("exp-baseline", "exp-new");
// {
//   baseline_id: "exp-baseline",
//   candidate_id: "exp-new",
//   regressed: true,
//   regressions: ["pass_rate dropped from 0.95 to 0.78"],
//   metrics: [
//     { name: "pass_rate",            baseline: 0.95, candidate: 0.78, delta: -0.17, direction: "worse" },
//     { name: "p95_wall_ms",          baseline: 12000, candidate: 18000, delta: 6000, direction: "worse" },
//     { name: "mean_cost_per_run_usd", baseline: 0.30, candidate: 0.42, delta: 0.12, direction: "worse" },
//   ],
// }

What this does: POST /v1/experiments/compare. Server computes per-metric deltas. regressed: true if any metric got significantly worse (default thresholds: pass_rate -2%, p95_wall_ms +20%, cost +20%).

Use this in CI to gate merges:

ks eval compare exp-baseline exp-new --gate
# Exits non-zero if regressed

experiments.metrics(id)

Detailed metrics: per-tool breakdown, cost trend, pass-rate trend.

const metrics = await ks.experiments.metrics(exp.id);
// {
//   experiment_id: "exp-...",
//   summary: { total_runs, pass_rate, total_cost_usd, p95_wall_ms, ... },
//   tool_breakdown: { write_file: { count, mean_ms, error_rate } },
//   cost_trend: [{ run_id, cost_usd, ts }, ...],
//   pass_rate_trend: [{ run_id, pass_rate, ts }, ...],
// }

What this does: GET /v1/metrics/experiments/:id. Useful for dashboards.

Results shape

runAndWait and get both return RunResults:

{
  ran_at: "2026-04-28T22:00:00Z",
  spec_id: "fix-failing-test",
  experiment_id: "exp-a1b2c3",
  seed: 12345,
  total_scenarios: 30,                   // 10 replicas × 3 matrix entries
  passed: 27,
  failed: 2,
  flaky: 1,                              // some replicas passed, some failed
  errors: 0,                             // sandbox boot or agent timeout
  metrics: {
    pass_rate: 0.9,
    mean_wall_ms: 14200,
    p95_wall_ms: 23000,
    mean_tool_calls: 8.3,
    mean_tokens: 12000,
    total_cost_usd: 5.40,
    mean_cost_per_run_usd: 0.18,
    tool_success_rate: 0.97,
    side_effect_violations: 0,
  },
  scenarios: [
    {
      scenario_id: "scenario-000",
      sandbox_id: "sb-abc",
      status: "pass" | "fail" | "flaky" | "error",
      parameters: { model: "claude-sonnet-4-5", locale: "en_US" },
      wall_ms: 12000,
      exit_code: 0,
      tool_calls: 7,
      composite_score: 1.0,
      agent_output: "...",
      agent_stderr: "",
      invariants: [
        { name: "tests_pass", passed: true, gate: true, weight: 1.0, score: 1.0 },
      ],
      forbidden_checks: [
        { rule: "secrets_in_logs", violated: false },
      ],
      trace_file: "/v1/traces/trace-id-...",
      reproducer: {
        spec_file: "specs/fix-failing-test.yaml",
        seed: 12345,
        scenario_id: "scenario-000",
        command: "ks eval run specs/fix-failing-test.yaml --seed 12345 --scenario scenario-000",
      },
      cost: { input_tokens: 4200, output_tokens: 1800, model: "claude-sonnet-4-5", estimated_usd: 0.043 },
    },
    // ...one entry per scenario
  ],
}

Verdicts

  • pass — composite ≥ pass_threshold, no gates failed, no forbidden violations.
  • fail — composite < threshold, OR a gate failed, OR a forbidden rule fired.
  • flaky — multi-replica scenario where replica_aggregation couldn't decide a clean pass/fail (some replicas passed, some failed, didn't meet min_pass_rate).
  • error — sandbox boot failed, agent exceeded its timeout, or infrastructure issue. Distinct from fail.

Replicas & matrix

Defined in your spec:

parallelism:
  replicas: 10
  isolation: per_run                   # per_run = own sandbox + services | shared
  matrix:
    - { model: claude-sonnet-4-5, locale: en_US }
    - { model: claude-opus-4-5,   locale: en_US }
    - { model: gpt-4o,            locale: ja_JP }

10 replicas × 3 matrix entries = 30 scenarios. Each scenario runs in its own sandbox (because isolation: per_run).

shared reuses sandboxes across scenarios — faster, but state can leak between runs. Use per_run for any real measurement.

Matrix interpolation

Matrix values are available in your spec via {{ matrix.<key> }}:

parallelism:
  matrix:
    - { model: claude-sonnet-4-5 }
    - { model: gpt-4o }
 
agent:
  type: cli
  binary: my-agent
  args: ["--model", "{{ matrix.model }}"]

Each scenario runs with matrix.model substituted accordingly.

Run lifecycle

created → queued → running → completed
                              error (on infra failure)

The server enqueues scenario jobs to one of three priority lanes:

LaneUse
criticalGate invariant checks (must run fast)
normalStandard scenario execution
backgroundTeardown, exports, cleanup

Workers pull from the queue, create sandboxes, run agents, score, report. Default worker pool size is 10; configurable on self-hosted.

Concurrency

Per-tenant concurrency limits apply: typically 5–10 sandboxes simultaneously on the free tier, more on Pro/Enterprise. If you hit the limit, scenarios queue and run as slots free up.

To control concurrency from your spec:

resources:
  concurrency_limit: 3                 # max 3 sandboxes for this spec at once

Useful when you want a particular scenario to not contend with others.

Cancelling

Cancel an experiment in flight:

POST /v1/experiments/:id/cancel

Server stops enqueueing new scenarios; running ones complete or hit their timeouts. Already-completed scenarios stay in the results.

(There's no SDK helper today; use direct HTTP.)

CI integration

The standard CI pattern:

# 1. Run the experiment
EXP_ID=$(ks eval run specs/regression.yaml --json | jq -r .experiment_id)
 
# 2. Compare to last green build
ks eval compare $LAST_GREEN_EXP_ID $EXP_ID --gate
# Exits non-zero if regressed → CI fails
 
# 3. If green, save EXP_ID as the new baseline
echo "$EXP_ID" > .keystone/last-green

Or in TypeScript:

const exp = await ks.experiments.create({ name: `ci-${process.env.GITHUB_SHA}`, spec_id: "regression" });
const results = await ks.experiments.runAndWait(exp.id);
if (results.passed < results.total_scenarios) {
  console.error("Experiment failed");
  process.exit(1);
}
 
if (process.env.LAST_GREEN_EXP_ID) {
  const cmp = await ks.experiments.compare(process.env.LAST_GREEN_EXP_ID, exp.id);
  if (cmp.regressed) {
    console.error("Regressed:", cmp.regressions);
    process.exit(1);
  }
}

Patterns

Smoke test (1 replica, fail-fast)

parallelism:
  replicas: 1
 
scoring:
  pass_threshold: 1.0

For "did this even work" CI checks. One run, all-or-nothing.

Flakiness measurement (50 replicas, percentile pass)

parallelism:
  replicas: 50
 
scoring:
  pass_threshold: 0.9
  replica_aggregation:
    strategy: percentage
    min_pass_rate: 0.85         # 85% of replicas must pass

Captures noise. Use sparingly — 50× the cost.

Model bake-off (matrix without replicas)

parallelism:
  replicas: 1
  matrix:
    - { model: claude-sonnet-4-5 }
    - { model: claude-opus-4-5 }
    - { model: gpt-4o }
    - { model: gemini-2.5-pro }

One scenario per model. Compare in the dashboard or programmatically.

Comprehensive (matrix + replicas)

parallelism:
  replicas: 10
  matrix:
    - { model: claude-sonnet-4-5, prompt_style: terse }
    - { model: claude-sonnet-4-5, prompt_style: verbose }
    - { model: gpt-4o,            prompt_style: terse }
    - { model: gpt-4o,            prompt_style: verbose }

40 scenarios, four prompt × model combos with 10 replicas each. The dashboard slices by parameter so you can see "verbose prompts pass more on Sonnet but cost 30% more."