Datasets
Versioned collections of input/expected pairs that drive specs across many cases.
A dataset is a versioned collection of (input, expected) pairs. Use them when you have one spec that should run against many cases — e.g., 50 customer emails, each with its own expected subject line. Instead of writing 50 specs, you upload the dataset once and reference its records from a single spec.
When to use a dataset
| Pattern | Use a dataset? |
|---|---|
| One scenario, run many times for flakiness | No — use parallelism.replicas |
| One spec across N parameter combinations | No — use parallelism.matrix |
| Many test cases each with their own expected output | Yes |
| Building a regression corpus that grows over time | Yes |
| Storing reference Q&A pairs for RAG eval | Yes |
datasets.create(name, description?)
const ds = await ks.datasets.create(
"customer-emails",
"Renewal email scenarios across customer tiers",
);
// {
// id: "ds_abc123...",
// user_id: "...",
// name: "customer-emails",
// description: "Renewal email scenarios across customer tiers",
// version: 1,
// created_at: "2026-04-28T...",
// }What this does: POST /v1/datasets. Creates an empty dataset with version 1.
datasets.add_records(dataset_id, records)
await ks.datasets.addRecords(ds.id, [
{
input: { customer_id: "alice@co", plan: "pro", days_left: 7 },
expected: { subject_contains: "Renewal", body_mentions: ["alice", "pro plan"] },
metadata: { tier: "pro", region: "us" },
tags: ["pro", "us"],
},
{
input: { customer_id: "ben@co", plan: "free", days_left: 0 },
expected: { subject_contains: "Trial expired" },
tags: ["free"],
},
]);
// Returns: { added: 2 }What this does: POST /v1/datasets/<id>/records. Auto-increments the dataset version. Each record gets a unique ID and is stamped with the new version.
| Field | Required | Notes |
|---|---|---|
input | yes | Inputs to the agent. Free-form JSON. |
expected | no | Expected outputs. Used by scorers like ExactMatch(expectedKey: "..."). |
metadata | no | Free-form annotations. |
tags | no | Strings used for filtering on read. |
datasets.get_records(dataset_id, opts?)
// Latest version
const all = await ks.datasets.getRecords(ds.id);
// Specific version
const v3 = await ks.datasets.getRecords(ds.id, { version: 3 });
// Filtered by tags (any-of match)
const proRows = await ks.datasets.getRecords(ds.id, { tags: ["pro"] });What this does: GET /v1/datasets/<id>/records?version=<n>&tags=<csv>. Returns the records as DatasetRecord[].
datasets.list() / datasets.get(id) / datasets.delete(id)
const all = await ks.datasets.list();
const one = await ks.datasets.get("ds_abc123");
await ks.datasets.delete("ds_abc123"); // deletes dataset + all recordsStandard CRUD on the dataset itself.
Versioning
Every addRecords() call increments the version. The previous version's records remain queryable forever — old experiments always see the rows they ran against.
// Initial: 10 records at v1
await ks.datasets.addRecords(ds.id, initialRecords);
// Later: add 5 more — now v2 has 15 records, v1 still has 10
await ks.datasets.addRecords(ds.id, moreRecords);
// Query v1 → 10 records
const v1 = await ks.datasets.getRecords(ds.id, { version: 1 });
// Query v2 (or no version, defaults to latest) → 15 records
const v2 = await ks.datasets.getRecords(ds.id);Pin a specific version in your spec/code if you want experiments to be reproducible against a frozen dataset.
Driving a spec from a dataset
A spec doesn't natively iterate over a dataset — you do that in your experiment code. Pattern:
const records = await ks.datasets.getRecords("ds_emails", { tags: ["pro"] });
for (const row of records) {
// Spec is shared; experiment name varies per row
const exp = await ks.experiments.create({
name: `email-${row.input.customer_id}`,
spec_id: "email-agent-spec",
});
// Pass the row's input as task parameters via the spec's matrix
// (or via a custom orchestrator that templates the input into the spec)
await ks.experiments.runAndWait(exp.id);
}For tighter integration, the Eval() helper takes a list of records as data: directly:
import { Eval, Factuality, AnswerRelevancy } from "@polarityinc/polarity-keystone";
const records = await ks.datasets.getRecords("ds_emails");
const result = await Eval("email-quality-eval", {
data: records.map((r) => ({ input: r.input, expected: r.expected })),
task: async (input) => myAgent.run(input),
scores: [new Factuality(), new AnswerRelevancy()],
});
console.log(result.summary);
// { factuality: { mean, p50, p95, count }, answer_relevancy: { ... } }Eval() runs the task per row in parallel, scores with each scorer, returns aggregate stats. If KEYSTONE_API_KEY is set, the result is also reported to the Keystone dashboard as a single trace event so it shows up in the experiment list.
Patterns
Regression corpus
Add new records when bugs are reported — every customer-found-bug becomes a permanent test case:
await ks.datasets.addRecords("ds_regressions", [
{
input: { customer_id: "alice@co", plan: "pro" },
expected: { subject_contains: "Renewal" },
metadata: { reported_in: "ZENDESK-1234", date: "2026-04-28" },
tags: ["bug-fix"],
},
]);Run the regression dataset on every release. The corpus grows; stale failures stay in the test set.
RAG eval set
const records = [
{
input: { question: "What's our refund policy?", context: ["..."] },
expected: { answer_contains: "30 days" },
tags: ["policy"],
},
// ... 100 more Q&A pairs
];
await ks.datasets.addRecords("ds_rag_eval", records);
// Score with the RAG preset
await Eval("rag-quality", {
data: records,
task: async (input) => ragAgent.answer(input.question, input.context),
scores: presets.rag(),
});A/B dataset versions
Use tags to bucket records:
await ks.datasets.addRecords("ds_emails", [
{ input: { ... }, tags: ["a"] },
{ input: { ... }, tags: ["b"] },
]);
const aRecords = await ks.datasets.getRecords("ds_emails", { tags: ["a"] });
const bRecords = await ks.datasets.getRecords("ds_emails", { tags: ["b"] });
// Run two experiments — one per bucketLimits
| Limit | Default |
|---|---|
| Records per dataset | 10,000 |
| Versions per dataset | 100 |
| Record size | 64 KB |
| Datasets per tenant | 1,000 |
Larger datasets? Split by tag and query with tags: [...] to keep response payloads sane.