Run

Datasets

Versioned collections of input/expected pairs that drive specs across many cases.

A dataset is a versioned collection of (input, expected) pairs. Use them when you have one spec that should run against many cases — e.g., 50 customer emails, each with its own expected subject line. Instead of writing 50 specs, you upload the dataset once and reference its records from a single spec.

When to use a dataset

PatternUse a dataset?
One scenario, run many times for flakinessNo — use parallelism.replicas
One spec across N parameter combinationsNo — use parallelism.matrix
Many test cases each with their own expected outputYes
Building a regression corpus that grows over timeYes
Storing reference Q&A pairs for RAG evalYes

datasets.create(name, description?)

const ds = await ks.datasets.create(
  "customer-emails",
  "Renewal email scenarios across customer tiers",
);
// {
//   id: "ds_abc123...",
//   user_id: "...",
//   name: "customer-emails",
//   description: "Renewal email scenarios across customer tiers",
//   version: 1,
//   created_at: "2026-04-28T...",
// }

What this does: POST /v1/datasets. Creates an empty dataset with version 1.

datasets.add_records(dataset_id, records)

await ks.datasets.addRecords(ds.id, [
  {
    input:    { customer_id: "alice@co", plan: "pro", days_left: 7 },
    expected: { subject_contains: "Renewal", body_mentions: ["alice", "pro plan"] },
    metadata: { tier: "pro", region: "us" },
    tags:     ["pro", "us"],
  },
  {
    input:    { customer_id: "ben@co", plan: "free", days_left: 0 },
    expected: { subject_contains: "Trial expired" },
    tags:     ["free"],
  },
]);
// Returns: { added: 2 }

What this does: POST /v1/datasets/<id>/records. Auto-increments the dataset version. Each record gets a unique ID and is stamped with the new version.

FieldRequiredNotes
inputyesInputs to the agent. Free-form JSON.
expectednoExpected outputs. Used by scorers like ExactMatch(expectedKey: "...").
metadatanoFree-form annotations.
tagsnoStrings used for filtering on read.

datasets.get_records(dataset_id, opts?)

// Latest version
const all = await ks.datasets.getRecords(ds.id);
 
// Specific version
const v3 = await ks.datasets.getRecords(ds.id, { version: 3 });
 
// Filtered by tags (any-of match)
const proRows = await ks.datasets.getRecords(ds.id, { tags: ["pro"] });

What this does: GET /v1/datasets/<id>/records?version=<n>&tags=<csv>. Returns the records as DatasetRecord[].

datasets.list() / datasets.get(id) / datasets.delete(id)

const all = await ks.datasets.list();
const one = await ks.datasets.get("ds_abc123");
await ks.datasets.delete("ds_abc123");      // deletes dataset + all records

Standard CRUD on the dataset itself.

Versioning

Every addRecords() call increments the version. The previous version's records remain queryable forever — old experiments always see the rows they ran against.

// Initial: 10 records at v1
await ks.datasets.addRecords(ds.id, initialRecords);
 
// Later: add 5 more — now v2 has 15 records, v1 still has 10
await ks.datasets.addRecords(ds.id, moreRecords);
 
// Query v1 → 10 records
const v1 = await ks.datasets.getRecords(ds.id, { version: 1 });
 
// Query v2 (or no version, defaults to latest) → 15 records
const v2 = await ks.datasets.getRecords(ds.id);

Pin a specific version in your spec/code if you want experiments to be reproducible against a frozen dataset.

Driving a spec from a dataset

A spec doesn't natively iterate over a dataset — you do that in your experiment code. Pattern:

const records = await ks.datasets.getRecords("ds_emails", { tags: ["pro"] });
 
for (const row of records) {
  // Spec is shared; experiment name varies per row
  const exp = await ks.experiments.create({
    name: `email-${row.input.customer_id}`,
    spec_id: "email-agent-spec",
  });
 
  // Pass the row's input as task parameters via the spec's matrix
  // (or via a custom orchestrator that templates the input into the spec)
  await ks.experiments.runAndWait(exp.id);
}

For tighter integration, the Eval() helper takes a list of records as data: directly:

import { Eval, Factuality, AnswerRelevancy } from "@polarityinc/polarity-keystone";
 
const records = await ks.datasets.getRecords("ds_emails");
 
const result = await Eval("email-quality-eval", {
  data: records.map((r) => ({ input: r.input, expected: r.expected })),
  task: async (input) => myAgent.run(input),
  scores: [new Factuality(), new AnswerRelevancy()],
});
 
console.log(result.summary);
// { factuality: { mean, p50, p95, count }, answer_relevancy: { ... } }

Eval() runs the task per row in parallel, scores with each scorer, returns aggregate stats. If KEYSTONE_API_KEY is set, the result is also reported to the Keystone dashboard as a single trace event so it shows up in the experiment list.

Patterns

Regression corpus

Add new records when bugs are reported — every customer-found-bug becomes a permanent test case:

await ks.datasets.addRecords("ds_regressions", [
  {
    input: { customer_id: "alice@co", plan: "pro" },
    expected: { subject_contains: "Renewal" },
    metadata: { reported_in: "ZENDESK-1234", date: "2026-04-28" },
    tags: ["bug-fix"],
  },
]);

Run the regression dataset on every release. The corpus grows; stale failures stay in the test set.

RAG eval set

const records = [
  {
    input: { question: "What's our refund policy?", context: ["..."] },
    expected: { answer_contains: "30 days" },
    tags: ["policy"],
  },
  // ... 100 more Q&A pairs
];
 
await ks.datasets.addRecords("ds_rag_eval", records);
 
// Score with the RAG preset
await Eval("rag-quality", {
  data: records,
  task: async (input) => ragAgent.answer(input.question, input.context),
  scores: presets.rag(),
});

A/B dataset versions

Use tags to bucket records:

await ks.datasets.addRecords("ds_emails", [
  { input: { ... }, tags: ["a"] },
  { input: { ... }, tags: ["b"] },
]);
 
const aRecords = await ks.datasets.getRecords("ds_emails", { tags: ["a"] });
const bRecords = await ks.datasets.getRecords("ds_emails", { tags: ["b"] });
 
// Run two experiments — one per bucket

Limits

LimitDefault
Records per dataset10,000
Versions per dataset100
Record size64 KB
Datasets per tenant1,000

Larger datasets? Split by tag and query with tags: [...] to keep response payloads sane.