Scorers Library
29 built-in scorers across 5 families — heuristic, LLM-judge, RAG, embedding, sandbox — for client-side scoring.
The Keystone SDKs ship a library of 29 client-side scorers that mirror Braintrust's surface, organized into 5 families. Use them with experiments.runAndWait(scores=[...]) to add per-scenario scoring on top of the spec-level invariants.
Three flavors of scoring you'll see across the docs:
| Tool | Where it runs | What it scores |
|---|---|---|
Spec invariants (in spec.yaml) | Server, after agent finishes | Sandbox state, services, files, mock calls |
| Client-side scorers (this page) | SDK, after experiment completes | Agent output (text), input/expected pairs |
| Custom Python scorers | Server, named in spec invariants | Anything — full sandbox context |
Client-side scorers are convenience: they let you stack additional checks onto an existing experiment without rewriting the spec.
Using scorers
import { Keystone, Factuality, ExactMatch, AnswerRelevancy } from "@polarityinc/polarity-keystone";
const ks = new Keystone();
const exp = await ks.experiments.create({ name: "...", spec_id: "..." });
const results = await ks.experiments.runAndWait(exp.id, {
scores: [
new ExactMatch({ expectedKey: "answer" }),
new Factuality({ model: "paragon-fast" }),
new AnswerRelevancy({ model: "paragon-fast" }),
],
});After the experiment finishes, every scorer is evaluated against every scenario, and the resulting scores are appended to the scenario's invariants list with the scorer name.
Heuristic scorers (6)
Pure functions of strings and numbers. No LLM calls, no extra cost.
ExactMatch
new ExactMatch({ expected: "yes" });
new ExactMatch({ expectedKey: "answer" }); // pulls from scenario.parameters.answer
new ExactMatch({ expected: "Yes", caseSensitive: false, strip: true });Compares agent_output to the expected string. strip: true (default) trims whitespace; caseSensitive: true (default) requires identical case.
Levenshtein
new Levenshtein({ expected: "the quick brown fox", threshold: 0.85 });Normalised edit-distance similarity in [0, 1]. Returns the actual similarity as the score; threshold decides pass/fail.
NumericDiff
new NumericDiff({ expected: 42, tolerance: 0.01 }); // 1% toleranceExtracts the first number from agent_output, compares to expected. tolerance: 0.01 means within 1%.
JSONDiff
new JSONDiff({ expected: { id: 1, status: "ok" }, threshold: 0.9 });Parses agent_output as JSON, compares structurally to expected. Returns a similarity score (matching keys + values, recursive). Fails if output isn't valid JSON.
JSONValidity
new JSONValidity();Just checks that agent_output parses as JSON. Pass = valid, fail = invalid.
SemanticListContains
new SemanticListContains({
expected: ["urgent", "deadline", "tomorrow"],
threshold: 0.7, // 70% of items must be matched
fuzzy: true, // allow Levenshtein ≥ 0.75
fuzzyThreshold: 0.75,
});Checks that the output mentions ≥ threshold fraction of the expected list. Useful for "did the response cover all the topics" checks.
LLM-judge scorers (9)
Use a model to judge the output. All accept model (default paragon-fast), temperature (default 0), and a custom rubric. See LLM-as-Judge for the full prompts.
| Scorer | What it judges |
|---|---|
Factuality | Is the answer factually correct vs. the expected? |
Battle | Is output A or output B better? (returns 0 or 1) |
ClosedQA | Did the answer correctly extract from the source? |
Humor | Is this funny? (rare; included for parity) |
Moderation | Is the output free of harmful content? |
Summarization | Is this a good summary of the source? |
SQLJudge | Does this SQL query satisfy the description? |
Translation | Is this a faithful translation? |
Security | Is the output free of security violations (PII, secrets)? |
new Factuality({ model: "paragon-fast" });
new Moderation({ rubric: { pass: "Free of harmful content", fail: "Contains harmful content" } });
new SQLJudge({ expectedKey: "expected_query" });RAG scorers (8)
For retrieval-augmented generation pipelines. Each scorer reads scenario.parameters for context, question, and answer fields.
| Scorer | What it measures |
|---|---|
ContextPrecision | Are retrieved chunks relevant? |
ContextRecall | Do retrieved chunks cover the answer? |
ContextRelevancy | Are retrieved chunks topically relevant? |
ContextEntityRecall | Do retrieved chunks mention the right entities? |
Faithfulness | Is the answer grounded in the retrieved context? |
AnswerRelevancy | Does the answer address the question? |
AnswerSimilarity | Is the answer semantically similar to the expected? |
AnswerCorrectness | Is the answer factually correct + relevant? |
new Faithfulness({ model: "paragon-fast" });
new AnswerRelevancy({ model: "paragon-fast" });Use them as a bundle via the presets.rag() helper.
Embedding scorer (1)
new EmbeddingSimilarity({
expected: "Renewal email",
embedder: openaiEmbedder("text-embedding-3-small"),
threshold: 0.85,
});Computes cosine similarity between agent_output and expected using the named embedder. Returns the similarity in [0, 1].
Sandbox scorers (5)
These wrap spec-level invariants — they convert to a check: block before the experiment runs and don't actually execute client-side. Useful when you want to write your invariants in code instead of YAML:
import { FileExists, FileContains, CommandExits, SQLEquals, LLMJudge } from "@polarityinc/polarity-keystone";
new FileExists("output.json");
new FileContains("src/main.ts", { contains: "TODO", notContains: "console.log" });
new CommandExits("npm test", { exitCode: 0 });
new SQLEquals({ service: "db", query: "SELECT count(*) FROM users", equals: 5 });
new LLMJudge("Was the email professional?", { model: "paragon-fast", inputFrom: "smtp.last_request.body" });Pass them in scores: [...] and they'll be merged with your spec's invariants for execution.
Presets
Three opinionated bundles for common agent shapes:
import { presets } from "@polarityinc/polarity-keystone";
await ks.experiments.runAndWait(exp.id, {
scores: presets.chat(), // factuality + answer-relevancy + moderation
});
await ks.experiments.runAndWait(exp.id, {
scores: presets.rag(), // context-precision + recall + faithfulness + answer-relevancy
});
await ks.experiments.runAndWait(exp.id, {
scores: presets.agentWithTools(), // json-validity + factuality + answer-relevancy
});Each accepts a model: kwarg threaded into the LLM-judge scorers (default paragon-fast).
Custom scorers
Define your own:
import { CustomScorer } from "@polarityinc/polarity-keystone";
class WordCount extends CustomScorer {
readonly name = "word_count_under_100";
scoreResult(scenario) {
const words = (scenario.agent_output ?? "").split(/\s+/).length;
const passed = words < 100;
return {
name: this.name,
score: passed ? 1 : 0,
passed,
message: `${words} words`,
};
}
}
await ks.experiments.runAndWait(exp.id, { scores: [new WordCount()] });The scorer receives a ScenarioResult and returns a Score. Failures are caught — your scorer can throw and only that score will be 0; the experiment continues.
Choosing scorers
| Scenario | Try |
|---|---|
| Q&A chatbot | Factuality, AnswerRelevancy, Moderation |
| RAG pipeline | presets.rag() |
| Tool-using agent | JSONValidity (on tool args), Factuality (on final answer) |
| Code generation | CommandExits (tests pass), FileContains (no debug logs), LLMJudge (code quality) |
| Email drafting | Levenshtein (vs. expected draft), LLMJudge (tone), SemanticListContains (key phrases) |
| Data extraction | JSONDiff (vs. expected structure), ExactMatch on key fields |
| Translation | Translation (LLM-judge), EmbeddingSimilarity |
How scorers post to the dashboard
Every scorer exposes toRule() (TS) / to_rule() (Python) which converts it to a server-side ScoreRule. When you call ks.scoring.createRule(...) and then ks.scoring.scoreExperiment(expId, ruleIds), the same scoring logic runs server-side over the trace data — useful for retroactively adding a new metric to old experiments.
const factuality = new Factuality({ model: "paragon-fast" });
const rule = await ks.scoring.createRule(factuality.name, factuality.ruleType, factuality.ruleConfig);
await ks.scoring.scoreExperiment("exp-old-id", [rule.id]);
const scores = await ks.scoring.getScores("exp-old-id");