Network & Audit
Egress allowlists, DNS overrides for redirecting real APIs to mocks, and structured capture of every action the agent takes.
Two of the most common gotchas with agent evals are: (1) the agent sneakily makes a real API call you didn't expect, and (2) you can't tell what it did when something fails. Keystone solves both with network policy and audit capture — declared in the spec, enforced at the runtime, surfaced as structured data.
Network policy
network:
egress:
default: deny # block all outbound
allow:
- registry.npmjs.org
- github.com
- "*.services.internal"
ingress:
default: deny
allow:
- from: host
to_port: 3000
dns_overrides:
api.stripe.com: stripe-mock.services.internal
smtp.sendgrid.net: smtp.services.internalEgress
Controls outbound traffic from the sandbox. Two settings combine:
| Field | Type | Meaning |
|---|---|---|
default | "deny" or "allow" | What happens to traffic not matching allow: |
allow | list of strings | Hostnames or globs that bypass the default |
Default-deny is the right choice for almost every eval. Real-world agents that need to install packages, clone repos, or hit specific APIs declare those hosts in allow:.
Wildcards are supported: *.example.com matches api.example.com and cdn.example.com. Service-network hosts (*.services.internal) are conventionally allowed because they refer to your own backing containers.
Ingress
By default, nothing reaches the sandbox from the host. To allow it (e.g., for a debug terminal hitting a dev server inside the sandbox):
network:
ingress:
default: deny
allow:
- from: host
to_port: 3000 # only port 3000 reachable from hostfrom: host is the only supported source today.
DNS overrides
The most powerful pattern in the network block. Redirect real-world hostnames to in-sandbox mocks:
network:
dns_overrides:
api.stripe.com: stripe-mock.services.internal
smtp.sendgrid.net: smtp.services.internal
api.openai.com: openai-mock.services.internalNow every time the agent calls https://api.stripe.com/... it actually hits your stripe-mock container. Combined with record: true on the mock service and http_mock_assertions invariants, this is how you verify the agent makes the right Stripe calls without ever touching real Stripe.
Audit
Audit captures everything the agent does inside the sandbox as structured events. Without it, your invariants can only check end-state — with it, they can check trajectory (forbidden rules).
audit:
db_writes: true # log every INSERT/UPDATE/DELETE
http_calls: true # log every outbound HTTP request
process_spawns: true # log every child process
stdout_capture: true # capture stdout for secret detection
file_system:
watch: ["src/", "config/"]
track: [writes, reads, deletes]Each block toggles a different capture source.
db_writes
When true, Keystone proxies database connections and intercepts every INSERT/UPDATE/DELETE. The audit log captures:
- Service name (
db,cache, etc.) - Operation (
insert,update,delete) - Table name
- Row count
- Timestamp
This is what makes forbidden.db_writes_outside: [users, orders] work — the audit log lists every table written, and Keystone fails the scenario if any aren't allowed.
http_calls
When true, every outbound HTTP request from the agent (and from service containers) is logged with:
- Method, host, path
- Status code, response time
- Headers (sensitive headers redacted)
- Request body size, response body size
Pairs with forbidden.http_except: [stripe-mock, smtp] to require all HTTP go through allowed services only.
process_spawns
Logs every child process the agent runs (via exec, popen, or shell). Captures:
- Command string
- Exit code
- Duration
- Output size (truncated to 4KB if
stdout_capture: true)
stdout_capture
When true, the agent's stdout is captured (truncated to 4KB per chunk) and attached to process_spawn events. Required for forbidden.secrets_in_logs: deny — Keystone scans this stream for any secret value.
file_system
Runs an inotify/fsevents watcher on the named directories:
audit:
file_system:
watch: ["src/", "config/", "/tmp"]
track: [writes, reads, deletes]track value | What it logs |
|---|---|
writes | Every file written or modified |
reads | Every file read |
deletes | Every file deleted |
This is what makes forbidden.file_writes_outside: [src/, output/] enforceable.
The audit log
The full event stream is written to <workspace>/.keystone/audit.jsonl — one event per line. You can stream it during a run via SSE or read it after via the export API. Every event has:
{
"ts": "2026-04-28T22:00:00.123Z",
"sandbox_id": "sb-abc123",
"type": "file_write" | "db_write" | "http_call" | "process_spawn" | "stdout" | "warning",
"details": { ...type-specific... }
}For process_spawn:
{
"ts": "...",
"type": "process_spawn",
"details": {
"command": "npm test",
"exit_code": 0,
"duration_ms": 12340
}
}For db_write:
{
"ts": "...",
"type": "db_write",
"details": {
"service": "db",
"table": "orders",
"operation": "insert",
"row_count": 1
}
}For http_call:
{
"ts": "...",
"type": "http_call",
"details": {
"method": "POST",
"host": "stripe-mock",
"path": "/v1/charge",
"status": 200,
"duration_ms": 45,
"request_bytes": 312,
"response_bytes": 98
}
}Live event stream (SSE)
Stream events in real-time via Server-Sent Events:
GET /v1/sandboxes/:id/events
Accept: text/event-stream
Useful for dashboards and progress indicators. Includes status changes, fixture application progress, and audit events.
const es = new EventSource(`${baseUrl}/v1/sandboxes/sb-abc/events`);
es.onmessage = (e) => {
const event = JSON.parse(e.data);
console.log(`[${event.event_type}]`, event.data);
};Teardown export
To persist the audit log after the sandbox is destroyed:
teardown:
always_run: true
export:
- type: audit_log
to: "results/{{ run_id }}/audit.jsonl"
- type: db_dump
service: db
to: "results/{{ run_id }}/db.sql"
- type: snapshot
to: "results/{{ run_id }}/final-state/"Templates: {{ run_id }} and {{ scenario_id }} are interpolated. The destination is interpreted relative to whatever artifact store the server is configured for (S3, local FS).
Patterns
Real API blocked, mock substituted
network:
egress:
default: deny
allow:
- github.com # for cloning
- registry.npmjs.org # for npm install
- "*.services.internal" # for our mocks
dns_overrides:
api.stripe.com: stripe-mock.services.internal
services:
- name: stripe-mock
type: http_mock
record: true
routes:
- method: POST
path: /v1/charge
response: '{"id":"ch_test","status":"succeeded"}'The agent's existing code calls https://api.stripe.com/v1/charge — but inside the sandbox that resolves to stripe-mock, which records the call. An invariant later asserts request_count == 1.
Disallow real LLM providers in offline-mode tests
network:
egress:
default: deny
allow:
- "*.services.internal" # services only
- keystone.polarity.so # for trace ingestion
# No api.anthropic.com / api.openai.com — agent must use a mock LLMIf the agent tries to call a real provider, the request fails. Forces you to wire up a mock LLM service for offline-mode evals.
Auditing-only (no enforcement)
audit:
db_writes: true
http_calls: true
process_spawns: true
stdout_capture: true
# No forbidden block — just observe.For exploratory runs where you want to see what the agent does without failing on it.
Aggressive locking
network:
egress:
default: deny
allow: [] # no outbound at all
audit:
db_writes: true
http_calls: true
process_spawns: true
stdout_capture: true
file_system:
watch: [/] # everywhere
track: [writes, reads, deletes]
forbidden:
db_writes_outside: [allowed_table]
http_except: [] # no HTTP at all
secrets_in_logs: deny
file_writes_outside: [output/]The strictest possible policy — useful for security-sensitive scenarios where you want to verify the agent operates entirely inside its sandbox.
Performance
Audit capture is cheap — under 5% overhead in typical runs. The big costs:
file_system.watchon broad directories (/) costs more than narrow ones (src/).db_writesadds ~1ms per query (proxy intercept).http_callsadds ~0.5ms per call.stdout_captureis essentially free.
If you don't need a particular capture, leave it disabled. The defaults are all false.