Determinism
Pin the clock, seed RNG, freeze DNS — the controls that make Keystone scenarios reproducible across runs.
Agent evaluations are noisy by default. The same prompt produces different output on different runs because: the model is non-deterministic, the wall clock is moving, randomness is sampled from /dev/urandom, and DNS results change over time. Keystone's determinism: block clamps those inputs so a flaky failure can be reproduced, debugged, and fixed.
determinism:
clock: "2026-01-01T00:00:00Z" # fixed timestamp visible to the sandbox
seed: 42 # PRNG seed for sampling, drift, agent
network_latency: 0ms # if non-zero, simulated extra latency
dns: static # static | liveWhy determinism matters
Two scenarios:
Without determinism. Your agent fails one in ten runs. You re-run it five times — passes every time. You look at the trace from the failure: it called the wrong API at 3:47:21am UTC. You can't reproduce because now it's 4:12:03am and the agent's code reads Date.now(). The bug stays, ships to prod, breaks for users in the relevant time zone.
With determinism. The failed scenario has a reproducer block: seed: 12345, clock: 2026-01-01T03:47:21Z. You re-run with that seed and clock, the same failure happens deterministically, you find the bug.
The reproducer is the unlock. Determinism lets every flaky failure become a test case.
clock
determinism:
clock: "2026-01-01T00:00:00Z"Pins the system clock to the specified ISO 8601 timestamp. Keystone uses libfaketime — LD_PRELOAD-injected into the agent process — to intercept every gettimeofday/clock_gettime syscall. Every language's "what time is it" function (Date.now(), time.time(), time.Now()) returns the frozen time.
Time advances during the run — it's not literally frozen, just offset. If the spec says 2026-01-01T00:00:00Z and the run takes 10 seconds, the agent sees 2026-01-01T00:00:10Z at the end.
Why a fixed clock matters
- Time-based logic. Date math, expirations, "is this within the last 24 hours" checks all become deterministic.
- Cache keys. If your agent uses
Date.now()as part of a cache key, the key is stable across runs. - Replay. Logs and traces match between runs.
seed
determinism:
seed: 42Provides a deterministic RNG seed exposed in two places:
KEYSTONE_SEEDenv var — the agent reads it and seeds its own PRNG (Math.random(), Python'srandom.seed(), Go'srand.Seed).- Fixture interpolation —
{{ determinism.seed }}substitutes into fixture parameters likedrift'sseed:.
The agent isn't forced to use the seed — it's the agent's job to read KEYSTONE_SEED and reseed. But once it does, sampling, shuffling, and any RNG-dependent behavior becomes reproducible.
Reproducer commands
When a scenario fails, the result includes:
{
"reproducer": {
"spec_file": "specs/scenario-1.yaml",
"seed": 12345,
"scenario_id": "scenario-000",
"command": "ks eval run specs/scenario-1.yaml --seed 12345 --scenario scenario-000"
}
}Re-run with that exact command and you get the same failure.
network_latency
determinism:
network_latency: 100msSimulates extra network latency by injecting a sleep into outbound HTTP requests. Useful for testing whether your agent handles slow APIs correctly (timeouts, retries, exponential backoff).
Use with the chaos testing network.shape: block for jitter and packet loss.
dns
determinism:
dns: static # use a frozen DNS snapshot| Value | Effect |
|---|---|
live (default) | Resolve hostnames live at request time |
static | Use a snapshot taken at sandbox boot — same IPs for the whole run |
static matters when:
- The same hostname could resolve to different IPs (load balancer rotation).
- The agent caches IPs and you want that cache to be stable.
- You need bit-exact reproduction across runs.
Combined with network.dns_overrides:, you get a fully predictable resolution layer.
Scoring & reproducer
Determinism feeds into scoring three ways:
- Replicas. With
parallelism.replicas: 10andseed: 42, all 10 replicas use the same seed (each getsseed = base_seed + replica_indexactually, so they don't all run identically — variance is whatreplicasis for). - Matrix. Each matrix entry uses
seed = base_seed + matrix_index. Different parameter combos with reproducible RNG. - Reproducer. Every failed scenario records the exact seed it ran with so you can re-run that exact scenario.
Patterns
Tight time/seed pinning for regression tests
determinism:
clock: "2026-01-01T00:00:00Z"
seed: 42
dns: static
parallelism:
replicas: 5
scoring:
pass_threshold: 1.0
replica_aggregation:
strategy: all_must_pass # if any replica fails, scenario failsUse this when you've already debugged the test and want a hard bar.
Light pinning for "measure consistency"
determinism:
clock: "2026-01-01T00:00:00Z"
# No seed — let RNG vary, measure how often it passes.
parallelism:
replicas: 50
scoring:
pass_threshold: 0.9
replica_aggregation:
strategy: percentage
min_pass_rate: 0.85 # 85% of replicas must passUse this to measure agent flakiness — "passes 9 out of 10 times" is a real success criterion.
No determinism
For exploratory runs or when realism matters more than reproducibility:
# determinism: omitted entirelyThe clock is real, RNG is from /dev/urandom, DNS is live. Failures lack a useful reproducer command but the run is closer to production.
What determinism does NOT control
- The LLM. Set
temperature: 0in your model parameters, and you reduce non-determinism, but providers still have small drift. Keystone can't makeclaude-sonnet-4-5byte-identical across runs. - Wall-clock duration. A scenario that takes 12 seconds today might take 14 tomorrow. Use
metrics.p95_wall_msfor stability checks, not equality. - External APIs. If you let the agent hit
api.openai.com, every run sees a fresh server-side state. Mock external APIs (see Network & Audit) for true determinism.