Spec

Determinism

Pin the clock, seed RNG, freeze DNS — the controls that make Keystone scenarios reproducible across runs.

Agent evaluations are noisy by default. The same prompt produces different output on different runs because: the model is non-deterministic, the wall clock is moving, randomness is sampled from /dev/urandom, and DNS results change over time. Keystone's determinism: block clamps those inputs so a flaky failure can be reproduced, debugged, and fixed.

determinism:
  clock: "2026-01-01T00:00:00Z"   # fixed timestamp visible to the sandbox
  seed: 42                          # PRNG seed for sampling, drift, agent
  network_latency: 0ms              # if non-zero, simulated extra latency
  dns: static                       # static | live

Why determinism matters

Two scenarios:

Without determinism. Your agent fails one in ten runs. You re-run it five times — passes every time. You look at the trace from the failure: it called the wrong API at 3:47:21am UTC. You can't reproduce because now it's 4:12:03am and the agent's code reads Date.now(). The bug stays, ships to prod, breaks for users in the relevant time zone.

With determinism. The failed scenario has a reproducer block: seed: 12345, clock: 2026-01-01T03:47:21Z. You re-run with that seed and clock, the same failure happens deterministically, you find the bug.

The reproducer is the unlock. Determinism lets every flaky failure become a test case.

clock

determinism:
  clock: "2026-01-01T00:00:00Z"

Pins the system clock to the specified ISO 8601 timestamp. Keystone uses libfaketimeLD_PRELOAD-injected into the agent process — to intercept every gettimeofday/clock_gettime syscall. Every language's "what time is it" function (Date.now(), time.time(), time.Now()) returns the frozen time.

Time advances during the run — it's not literally frozen, just offset. If the spec says 2026-01-01T00:00:00Z and the run takes 10 seconds, the agent sees 2026-01-01T00:00:10Z at the end.

Why a fixed clock matters

  • Time-based logic. Date math, expirations, "is this within the last 24 hours" checks all become deterministic.
  • Cache keys. If your agent uses Date.now() as part of a cache key, the key is stable across runs.
  • Replay. Logs and traces match between runs.

seed

determinism:
  seed: 42

Provides a deterministic RNG seed exposed in two places:

  1. KEYSTONE_SEED env var — the agent reads it and seeds its own PRNG (Math.random(), Python's random.seed(), Go's rand.Seed).
  2. Fixture interpolation{{ determinism.seed }} substitutes into fixture parameters like drift's seed:.

The agent isn't forced to use the seed — it's the agent's job to read KEYSTONE_SEED and reseed. But once it does, sampling, shuffling, and any RNG-dependent behavior becomes reproducible.

Reproducer commands

When a scenario fails, the result includes:

{
  "reproducer": {
    "spec_file": "specs/scenario-1.yaml",
    "seed": 12345,
    "scenario_id": "scenario-000",
    "command": "ks eval run specs/scenario-1.yaml --seed 12345 --scenario scenario-000"
  }
}

Re-run with that exact command and you get the same failure.

network_latency

determinism:
  network_latency: 100ms

Simulates extra network latency by injecting a sleep into outbound HTTP requests. Useful for testing whether your agent handles slow APIs correctly (timeouts, retries, exponential backoff).

Use with the chaos testing network.shape: block for jitter and packet loss.

dns

determinism:
  dns: static            # use a frozen DNS snapshot
ValueEffect
live (default)Resolve hostnames live at request time
staticUse a snapshot taken at sandbox boot — same IPs for the whole run

static matters when:

  • The same hostname could resolve to different IPs (load balancer rotation).
  • The agent caches IPs and you want that cache to be stable.
  • You need bit-exact reproduction across runs.

Combined with network.dns_overrides:, you get a fully predictable resolution layer.

Scoring & reproducer

Determinism feeds into scoring three ways:

  1. Replicas. With parallelism.replicas: 10 and seed: 42, all 10 replicas use the same seed (each gets seed = base_seed + replica_index actually, so they don't all run identically — variance is what replicas is for).
  2. Matrix. Each matrix entry uses seed = base_seed + matrix_index. Different parameter combos with reproducible RNG.
  3. Reproducer. Every failed scenario records the exact seed it ran with so you can re-run that exact scenario.

Patterns

Tight time/seed pinning for regression tests

determinism:
  clock: "2026-01-01T00:00:00Z"
  seed: 42
  dns: static
 
parallelism:
  replicas: 5
 
scoring:
  pass_threshold: 1.0
  replica_aggregation:
    strategy: all_must_pass     # if any replica fails, scenario fails

Use this when you've already debugged the test and want a hard bar.

Light pinning for "measure consistency"

determinism:
  clock: "2026-01-01T00:00:00Z"
  # No seed — let RNG vary, measure how often it passes.
 
parallelism:
  replicas: 50
 
scoring:
  pass_threshold: 0.9
  replica_aggregation:
    strategy: percentage
    min_pass_rate: 0.85          # 85% of replicas must pass

Use this to measure agent flakiness — "passes 9 out of 10 times" is a real success criterion.

No determinism

For exploratory runs or when realism matters more than reproducibility:

# determinism: omitted entirely

The clock is real, RNG is from /dev/urandom, DNS is live. Failures lack a useful reproducer command but the run is closer to production.

What determinism does NOT control

  • The LLM. Set temperature: 0 in your model parameters, and you reduce non-determinism, but providers still have small drift. Keystone can't make claude-sonnet-4-5 byte-identical across runs.
  • Wall-clock duration. A scenario that takes 12 seconds today might take 14 tomorrow. Use metrics.p95_wall_ms for stability checks, not equality.
  • External APIs. If you let the agent hit api.openai.com, every run sees a fresh server-side state. Mock external APIs (see Network & Audit) for true determinism.