Keystone

Setup Guide

The complete procedure for wiring Keystone into an existing agent — written for humans and for AI coding assistants. Follow this page top to bottom.

AI agents: use https://docs.paragon.run/llms-full.txt as your source of truth — page-level WebFetch returns summarizer excerpts that drop example code detail. If a user pasted the docs URL at you and said "set this up," execute every step below on their repo — do not paraphrase the docs back at them.

You have an existing agent and you want to evaluate it with Keystone. This page is a strict procedure — follow it in order and you'll end with a working spec that passes on the first run.


Step 0 — Install the SDK and set your API key

npm install @polarityinc/polarity-keystone
export KEYSTONE_API_KEY=ks_live_...

Generate the key at app.paragon.run/app/keystone/settingsAPI Keys.


Step 1 — Inspect the agent code

Before you write any spec, you need three facts about the agent:

Write these three facts down before proceeding. The rest of the guide is mechanical — fill in a template with what you found.


Step 2 — Drop your provider keys into the Dashboard

For the secrets the team shares (XAI_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.), go to app.paragon.run/app/keystone/settingsSecrets tab and paste the values. They're AES-256-GCM encrypted at rest and shared with your teammates on the same billing account.

Use the Dashboard for anything a teammate will also need. Use local .env (with source: env in the spec) for your personal dev-only keys.


Step 3 — Scaffold the spec from the template

Copy this template to specs/scenario-1.yaml and fill in the marked sections. Everything else stays as-is.

version: 1
id: "<FILL IN: short-kebab-id, e.g. email-agent-send-flow>"
description: "<FILL IN: one sentence, what you're testing>"
 
# Which agent snapshot to run. You upload this in Step 4.
agent:
  type: snapshot
  snapshot: "<FILL IN: name you picked for your agent>"
 
# ONE ENTRY PER env var from Step 1, Fact 2.
# For team-shared keys → source: dashboard
# For personal keys in your local .env → source: env
secrets:
  - name: XAI_API_KEY
    source: dashboard
  - name: DB_PASSWORD
    source: env              # from $DB_PASSWORD on your machine
  # ... repeat for every env var the agent needs
 
# ONE ENTRY PER service from Step 1, Fact 3.
services:
  - name: db
    image: postgres:16
    env:
      POSTGRES_PASSWORD: "{{ secrets.DB_PASSWORD }}"
      POSTGRES_DB: "<FILL IN: database name>"
    ports: [5432]
    wait_for: "pg_isready -h localhost"
 
# If your agent reads seeded rows, drop seed.sql at repo root.
fixtures:
  - type: sql
    service: db
    path: seed.sql
 
# Restrict outbound. Keystone + the LLM proxy are always allowed implicitly.
# Add each third-party domain the agent *actually* calls.
network:
  egress:
    default: deny
    allow:
      - api.x.ai                # or api.openai.com, api.anthropic.com, etc.
 
task:
  prompt: "<FILL IN: what you want the agent to do, verbatim prompt>"
 
# AT LEAST ONE invariant with gate: true. Choose from:
#   command_exit        → run a shell command in /workspace, expect exit 0
#   file_exists         → a specific output file exists
#   sql                 → a SQL query against a service returns expected value
#   http_mock_assertions → the agent hit a mocked endpoint N times
#   llm_as_judge        → semantic check (billable; use sparingly)
invariants:
  did_the_thing:
    description: "<FILL IN: what must be true after the agent runs>"
    weight: 1.0
    gate: true
    check:
      type: command_exit
      command: "<FILL IN: shell command that succeeds iff the agent did the thing>"
 
scoring:
  pass_threshold: 0.85

Do not use path: stdout.log as an invariant check path. Stdout is not persisted as a file in the sandbox. Use command_exit against actual workspace files the agent writes, or llm_as_judge with input_from: workspace.

Do not put real API keys behind from: "static://...". Spec files get committed. Use source: dashboard (encrypted) or source: env (local .env only).


Step 4 — Upload the agent and the spec

import { Keystone } from "@polarityinc/polarity-keystone";
import "dotenv/config";
import { readFileSync } from "node:fs";
 
const ks = new Keystone();
 
// One upload per agent build. Tars the current dir (respects .gitignore).
const snap = await ks.agents.upload({
  name: "<FILL IN: agent name, same as spec.agent.snapshot>",
  bundle: "./",
  runtime: "node20",
  entrypoint: ["tsx", "src/index.ts"],
});
console.log("snapshot:", snap.id);
 
// One upload per scenario.
const spec = await ks.specs.create({
  spec_yaml: readFileSync("specs/scenario-1.yaml", "utf8"),
});
console.log("spec:", spec.spec_id);
 
// Create the experiment. specPath lets the SDK auto-forward
// any `source: env` / `source: file:` / `source: command:` secrets.
const exp = await ks.experiments.create({
  name: "scenario-1",
  spec_id: spec.spec_id,
  specPath: "specs/scenario-1.yaml",
});
console.log("experiment:", exp.id, "(status: draft)");

After this step, the experiment appears in the Dashboard as Draft — nothing is running yet.


Step 5 — Review in the Dashboard, then run

Open app.paragon.run/app/keystone/experiments. Click into your new experiment. You'll see:

  • The spec YAML that was uploaded
  • The secret names it declared (with warnings if any aren't set yet)
  • The services it will boot
  • The invariants that will score it

When it looks right, hit Run. The experiment moves from draftrunningcompleted (or failed). Watch results stream in.

Programmatic equivalent:

const results = await ks.experiments.runAndWait(exp.id);
console.log("pass rate:", results.metrics.pass_rate);

Common first-run failures + fixes

What you seeCauseFix
secret "X" declared in spec but not set in Dashboard or forwarded from .envThe name in secrets: has no value at any layerEither add it to the Dashboard or export it in your shell / .env
Sandbox boot hangs on container_started, agent never runsNo egress to your package registry for npm install / pip installEither include node_modules / venv in the snapshot tarball OR add registry.npmjs.org / pypi.org to network.egress.allow
Every invariant fails with "file not found"Invariant paths assume files the agent didn't actually writeRun the sandbox once manually, ls /workspace, see what's there, rewrite the invariant paths
Agent connects to localhost:5432 and errors "connection refused"Code uses localhost, but services are reachable by service name (db, redis, etc.)Change connection strings to postgres://db:5432/..., or read KEYSTONE_SERVICE_DB_HOST env var
llm_as_judge fails with "no card on file"Judges are billable and need a payment methodAdd a card at app.paragon.run/app/keystone/data/billing
Experiment stuck at draft foreverYou uploaded it but never called runClick Run in the Dashboard, or call ks.experiments.run(exp.id) / runAndWait

  • Spec schema — full field-by-field reference
  • Examples — the canonical end-to-end example with all primitives
  • SDK methods — every method across TS/Python/Go
  • Billing — how compute + LLM judge costs are billed