Datasets

Versioned collections of input/expected pairs that drive specs across many cases.

A dataset is a versioned collection of (input, expected) pairs. Use them when you have one spec that should run against many cases — e.g., 50 customer emails, each with its own expected subject line. Instead of writing 50 specs, you upload the dataset once and reference its records from a single spec.

When to use a dataset

Pattern	Use a dataset?
One scenario, run many times for flakiness	No — use `parallelism.replicas`
One spec across N parameter combinations	No — use `parallelism.matrix`
Many test cases each with their own expected output	Yes
Building a regression corpus that grows over time	Yes
Storing reference Q&A pairs for RAG eval	Yes

`datasets.create(name, description?)`

const ds = await ks.datasets.create(
  "customer-emails",
  "Renewal email scenarios across customer tiers",
);
// {
//   id: "ds_abc123...",
//   user_id: "...",
//   name: "customer-emails",
//   description: "Renewal email scenarios across customer tiers",
//   version: 1,
//   created_at: "2026-04-28T...",
// }

What this does: POST /v1/datasets. Creates an empty dataset with version 1.

`datasets.add_records(dataset_id, records)`

await ks.datasets.addRecords(ds.id, [
  {
    input:    { customer_id: "alice@co", plan: "pro", days_left: 7 },
    expected: { subject_contains: "Renewal", body_mentions: ["alice", "pro plan"] },
    metadata: { tier: "pro", region: "us" },
    tags:     ["pro", "us"],
  },
  {
    input:    { customer_id: "ben@co", plan: "free", days_left: 0 },
    expected: { subject_contains: "Trial expired" },
    tags:     ["free"],
  },
]);
// Returns: { added: 2 }

What this does: POST /v1/datasets/<id>/records. Auto-increments the dataset version. Each record gets a unique ID and is stamped with the new version.

Field	Required	Notes
`input`	yes	Inputs to the agent. Free-form JSON.
`expected`	no	Expected outputs. Used by scorers like `ExactMatch(expectedKey: "...")`.
`metadata`	no	Free-form annotations.
`tags`	no	Strings used for filtering on read.

`datasets.get_records(dataset_id, opts?)`

// Latest version
const all = await ks.datasets.getRecords(ds.id);
 
// Specific version
const v3 = await ks.datasets.getRecords(ds.id, { version: 3 });
 
// Filtered by tags (any-of match)
const proRows = await ks.datasets.getRecords(ds.id, { tags: ["pro"] });

What this does: GET /v1/datasets/<id>/records?version=<n>&tags=<csv>. Returns the records as DatasetRecord[].

`datasets.list()` / `datasets.get(id)` / `datasets.delete(id)`

const all = await ks.datasets.list();
const one = await ks.datasets.get("ds_abc123");
await ks.datasets.delete("ds_abc123");      // deletes dataset + all records

Standard CRUD on the dataset itself.

Versioning

Every addRecords() call increments the version. The previous version's records remain queryable forever — old experiments always see the rows they ran against.

// Initial: 10 records at v1
await ks.datasets.addRecords(ds.id, initialRecords);
 
// Later: add 5 more — now v2 has 15 records, v1 still has 10
await ks.datasets.addRecords(ds.id, moreRecords);
 
// Query v1 → 10 records
const v1 = await ks.datasets.getRecords(ds.id, { version: 1 });
 
// Query v2 (or no version, defaults to latest) → 15 records
const v2 = await ks.datasets.getRecords(ds.id);

Pin a specific version in your spec/code if you want experiments to be reproducible against a frozen dataset.

Driving a spec from a dataset

A spec doesn't natively iterate over a dataset — you do that in your experiment code. Pattern:

const records = await ks.datasets.getRecords("ds_emails", { tags: ["pro"] });
 
for (const row of records) {
  // Spec is shared; experiment name varies per row
  const exp = await ks.experiments.create({
    name: `email-${row.input.customer_id}`,
    spec_id: "email-agent-spec",
  });
 
  // Pass the row's input as task parameters via the spec's matrix
  // (or via a custom orchestrator that templates the input into the spec)
  await ks.experiments.runAndWait(exp.id);
}

For tighter integration, the Eval() helper takes a list of records as data: directly:

import { Eval, Factuality, AnswerRelevancy } from "@polarityinc/polarity-keystone";
 
const records = await ks.datasets.getRecords("ds_emails");
 
const result = await Eval("email-quality-eval", {
  data: records.map((r) => ({ input: r.input, expected: r.expected })),
  task: async (input) => myAgent.run(input),
  scores: [new Factuality(), new AnswerRelevancy()],
});
 
console.log(result.summary);
// { factuality: { mean, p50, p95, count }, answer_relevancy: { ... } }

Eval() runs the task per row in parallel, scores with each scorer, returns aggregate stats. If KEYSTONE_API_KEY is set, the result is also reported to the Keystone dashboard as a single trace event so it shows up in the experiment list.

Patterns

Regression corpus

Add new records when bugs are reported — every customer-found-bug becomes a permanent test case:

await ks.datasets.addRecords("ds_regressions", [
  {
    input: { customer_id: "alice@co", plan: "pro" },
    expected: { subject_contains: "Renewal" },
    metadata: { reported_in: "ZENDESK-1234", date: "2026-04-28" },
    tags: ["bug-fix"],
  },
]);

Run the regression dataset on every release. The corpus grows; stale failures stay in the test set.

RAG eval set

const records = [
  {
    input: { question: "What's our refund policy?", context: ["..."] },
    expected: { answer_contains: "30 days" },
    tags: ["policy"],
  },
  // ... 100 more Q&A pairs
];
 
await ks.datasets.addRecords("ds_rag_eval", records);
 
// Score with the RAG preset
await Eval("rag-quality", {
  data: records,
  task: async (input) => ragAgent.answer(input.question, input.context),
  scores: presets.rag(),
});

A/B dataset versions

Use tags to bucket records:

await ks.datasets.addRecords("ds_emails", [
  { input: { ... }, tags: ["a"] },
  { input: { ... }, tags: ["b"] },
]);
 
const aRecords = await ks.datasets.getRecords("ds_emails", { tags: ["a"] });
const bRecords = await ks.datasets.getRecords("ds_emails", { tags: ["b"] });
 
// Run two experiments — one per bucket

Limits

Limit	Default
Records per dataset	10,000
Versions per dataset	100
Record size	64 KB
Datasets per tenant	1,000

Larger datasets? Split by tag and query with tags: [...] to keep response payloads sane.

Agent Snapshots Alerts