Scorers Library

29 built-in scorers across 5 families — heuristic, LLM-judge, RAG, embedding, sandbox — for client-side scoring.

The Keystone SDKs ship a library of 29 client-side scorers that mirror Braintrust's surface, organized into 5 families. Use them with experiments.runAndWait(scores=[...]) to add per-scenario scoring on top of the spec-level invariants.

Three flavors of scoring you'll see across the docs:

Tool	Where it runs	What it scores
Spec invariants (in `spec.yaml`)	Server, after agent finishes	Sandbox state, services, files, mock calls
Client-side scorers (this page)	SDK, after experiment completes	Agent output (text), input/expected pairs
Custom Python scorers	Server, named in spec invariants	Anything — full sandbox context

Client-side scorers are convenience: they let you stack additional checks onto an existing experiment without rewriting the spec.

Using scorers

import { Keystone, Factuality, ExactMatch, AnswerRelevancy } from "@polarityinc/polarity-keystone";
 
const ks = new Keystone();
const exp = await ks.experiments.create({ name: "...", spec_id: "..." });
 
const results = await ks.experiments.runAndWait(exp.id, {
  scores: [
    new ExactMatch({ expectedKey: "answer" }),
    new Factuality({ model: "paragon-fast" }),
    new AnswerRelevancy({ model: "paragon-fast" }),
  ],
});

After the experiment finishes, every scorer is evaluated against every scenario, and the resulting scores are appended to the scenario's invariants list with the scorer name.

Heuristic scorers (6)

Pure functions of strings and numbers. No LLM calls, no extra cost.

`ExactMatch`

new ExactMatch({ expected: "yes" });
new ExactMatch({ expectedKey: "answer" });    // pulls from scenario.parameters.answer
new ExactMatch({ expected: "Yes", caseSensitive: false, strip: true });

Compares agent_output to the expected string. strip: true (default) trims whitespace; caseSensitive: true (default) requires identical case.

`Levenshtein`

new Levenshtein({ expected: "the quick brown fox", threshold: 0.85 });

Normalised edit-distance similarity in [0, 1]. Returns the actual similarity as the score; threshold decides pass/fail.

`NumericDiff`

new NumericDiff({ expected: 42, tolerance: 0.01 });   // 1% tolerance

Extracts the first number from agent_output, compares to expected. tolerance: 0.01 means within 1%.

`JSONDiff`

new JSONDiff({ expected: { id: 1, status: "ok" }, threshold: 0.9 });

Parses agent_output as JSON, compares structurally to expected. Returns a similarity score (matching keys + values, recursive). Fails if output isn't valid JSON.

`JSONValidity`

new JSONValidity();

Just checks that agent_output parses as JSON. Pass = valid, fail = invalid.

`SemanticListContains`

new SemanticListContains({
  expected: ["urgent", "deadline", "tomorrow"],
  threshold: 0.7,                    // 70% of items must be matched
  fuzzy: true,                        // allow Levenshtein ≥ 0.75
  fuzzyThreshold: 0.75,
});

Checks that the output mentions ≥ threshold fraction of the expected list. Useful for "did the response cover all the topics" checks.

LLM-judge scorers (9)

Use a model to judge the output. All accept model (default paragon-fast), temperature (default 0), and a custom rubric. See LLM-as-Judge for the full prompts.

Scorer	What it judges
`Factuality`	Is the answer factually correct vs. the expected?
`Battle`	Is output A or output B better? (returns 0 or 1)
`ClosedQA`	Did the answer correctly extract from the source?
`Humor`	Is this funny? (rare; included for parity)
`Moderation`	Is the output free of harmful content?
`Summarization`	Is this a good summary of the source?
`SQLJudge`	Does this SQL query satisfy the description?
`Translation`	Is this a faithful translation?
`Security`	Is the output free of security violations (PII, secrets)?

new Factuality({ model: "paragon-fast" });
new Moderation({ rubric: { pass: "Free of harmful content", fail: "Contains harmful content" } });
new SQLJudge({ expectedKey: "expected_query" });

RAG scorers (8)

For retrieval-augmented generation pipelines. Each scorer reads scenario.parameters for context, question, and answer fields.

Scorer	What it measures
`ContextPrecision`	Are retrieved chunks relevant?
`ContextRecall`	Do retrieved chunks cover the answer?
`ContextRelevancy`	Are retrieved chunks topically relevant?
`ContextEntityRecall`	Do retrieved chunks mention the right entities?
`Faithfulness`	Is the answer grounded in the retrieved context?
`AnswerRelevancy`	Does the answer address the question?
`AnswerSimilarity`	Is the answer semantically similar to the expected?
`AnswerCorrectness`	Is the answer factually correct + relevant?

new Faithfulness({ model: "paragon-fast" });
new AnswerRelevancy({ model: "paragon-fast" });

Use them as a bundle via the presets.rag() helper.

Embedding scorer (1)

new EmbeddingSimilarity({
  expected: "Renewal email",
  embedder: openaiEmbedder("text-embedding-3-small"),
  threshold: 0.85,
});

Computes cosine similarity between agent_output and expected using the named embedder. Returns the similarity in [0, 1].

Sandbox scorers (5)

These wrap spec-level invariants — they convert to a check: block before the experiment runs and don't actually execute client-side. Useful when you want to write your invariants in code instead of YAML:

import { FileExists, FileContains, CommandExits, SQLEquals, LLMJudge } from "@polarityinc/polarity-keystone";
 
new FileExists("output.json");
new FileContains("src/main.ts", { contains: "TODO", notContains: "console.log" });
new CommandExits("npm test", { exitCode: 0 });
new SQLEquals({ service: "db", query: "SELECT count(*) FROM users", equals: 5 });
new LLMJudge("Was the email professional?", { model: "paragon-fast", inputFrom: "smtp.last_request.body" });

Pass them in scores: [...] and they'll be merged with your spec's invariants for execution.

Presets

Three opinionated bundles for common agent shapes:

import { presets } from "@polarityinc/polarity-keystone";
 
await ks.experiments.runAndWait(exp.id, {
  scores: presets.chat(),               // factuality + answer-relevancy + moderation
});
 
await ks.experiments.runAndWait(exp.id, {
  scores: presets.rag(),                // context-precision + recall + faithfulness + answer-relevancy
});
 
await ks.experiments.runAndWait(exp.id, {
  scores: presets.agentWithTools(),     // json-validity + factuality + answer-relevancy
});

Each accepts a model: kwarg threaded into the LLM-judge scorers (default paragon-fast).

Custom scorers

Define your own:

import { CustomScorer } from "@polarityinc/polarity-keystone";
 
class WordCount extends CustomScorer {
  readonly name = "word_count_under_100";
 
  scoreResult(scenario) {
    const words = (scenario.agent_output ?? "").split(/\s+/).length;
    const passed = words < 100;
    return {
      name: this.name,
      score: passed ? 1 : 0,
      passed,
      message: `${words} words`,
    };
  }
}
 
await ks.experiments.runAndWait(exp.id, { scores: [new WordCount()] });

The scorer receives a ScenarioResult and returns a Score. Failures are caught — your scorer can throw and only that score will be 0; the experiment continues.

Choosing scorers

Scenario	Try
Q&A chatbot	`Factuality`, `AnswerRelevancy`, `Moderation`
RAG pipeline	`presets.rag()`
Tool-using agent	`JSONValidity` (on tool args), `Factuality` (on final answer)
Code generation	`CommandExits` (tests pass), `FileContains` (no debug logs), `LLMJudge` (code quality)
Email drafting	`Levenshtein` (vs. expected draft), `LLMJudge` (tone), `SemanticListContains` (key phrases)
Data extraction	`JSONDiff` (vs. expected structure), `ExactMatch` on key fields
Translation	`Translation` (LLM-judge), `EmbeddingSimilarity`

How scorers post to the dashboard

Every scorer exposes toRule() (TS) / to_rule() (Python) which converts it to a server-side ScoreRule. When you call ks.scoring.createRule(...) and then ks.scoring.scoreExperiment(expId, ruleIds), the same scoring logic runs server-side over the trace data — useful for retroactively adding a new metric to old experiments.

const factuality = new Factuality({ model: "paragon-fast" });
const rule = await ks.scoring.createRule(factuality.name, factuality.ruleType, factuality.ruleConfig);
await ks.scoring.scoreExperiment("exp-old-id", [rule.id]);
const scores = await ks.scoring.getScores("exp-old-id");

Forbidden Rules LLM-as-Judge