Scoring

Scorers Library

29 built-in scorers across 5 families — heuristic, LLM-judge, RAG, embedding, sandbox — for client-side scoring.

The Keystone SDKs ship a library of 29 client-side scorers that mirror Braintrust's surface, organized into 5 families. Use them with experiments.runAndWait(scores=[...]) to add per-scenario scoring on top of the spec-level invariants.

Three flavors of scoring you'll see across the docs:

ToolWhere it runsWhat it scores
Spec invariants (in spec.yaml)Server, after agent finishesSandbox state, services, files, mock calls
Client-side scorers (this page)SDK, after experiment completesAgent output (text), input/expected pairs
Custom Python scorersServer, named in spec invariantsAnything — full sandbox context

Client-side scorers are convenience: they let you stack additional checks onto an existing experiment without rewriting the spec.

Using scorers

import { Keystone, Factuality, ExactMatch, AnswerRelevancy } from "@polarityinc/polarity-keystone";
 
const ks = new Keystone();
const exp = await ks.experiments.create({ name: "...", spec_id: "..." });
 
const results = await ks.experiments.runAndWait(exp.id, {
  scores: [
    new ExactMatch({ expectedKey: "answer" }),
    new Factuality({ model: "paragon-fast" }),
    new AnswerRelevancy({ model: "paragon-fast" }),
  ],
});

After the experiment finishes, every scorer is evaluated against every scenario, and the resulting scores are appended to the scenario's invariants list with the scorer name.

Heuristic scorers (6)

Pure functions of strings and numbers. No LLM calls, no extra cost.

ExactMatch

new ExactMatch({ expected: "yes" });
new ExactMatch({ expectedKey: "answer" });    // pulls from scenario.parameters.answer
new ExactMatch({ expected: "Yes", caseSensitive: false, strip: true });

Compares agent_output to the expected string. strip: true (default) trims whitespace; caseSensitive: true (default) requires identical case.

Levenshtein

new Levenshtein({ expected: "the quick brown fox", threshold: 0.85 });

Normalised edit-distance similarity in [0, 1]. Returns the actual similarity as the score; threshold decides pass/fail.

NumericDiff

new NumericDiff({ expected: 42, tolerance: 0.01 });   // 1% tolerance

Extracts the first number from agent_output, compares to expected. tolerance: 0.01 means within 1%.

JSONDiff

new JSONDiff({ expected: { id: 1, status: "ok" }, threshold: 0.9 });

Parses agent_output as JSON, compares structurally to expected. Returns a similarity score (matching keys + values, recursive). Fails if output isn't valid JSON.

JSONValidity

new JSONValidity();

Just checks that agent_output parses as JSON. Pass = valid, fail = invalid.

SemanticListContains

new SemanticListContains({
  expected: ["urgent", "deadline", "tomorrow"],
  threshold: 0.7,                    // 70% of items must be matched
  fuzzy: true,                        // allow Levenshtein ≥ 0.75
  fuzzyThreshold: 0.75,
});

Checks that the output mentions ≥ threshold fraction of the expected list. Useful for "did the response cover all the topics" checks.

LLM-judge scorers (9)

Use a model to judge the output. All accept model (default paragon-fast), temperature (default 0), and a custom rubric. See LLM-as-Judge for the full prompts.

ScorerWhat it judges
FactualityIs the answer factually correct vs. the expected?
BattleIs output A or output B better? (returns 0 or 1)
ClosedQADid the answer correctly extract from the source?
HumorIs this funny? (rare; included for parity)
ModerationIs the output free of harmful content?
SummarizationIs this a good summary of the source?
SQLJudgeDoes this SQL query satisfy the description?
TranslationIs this a faithful translation?
SecurityIs the output free of security violations (PII, secrets)?
new Factuality({ model: "paragon-fast" });
new Moderation({ rubric: { pass: "Free of harmful content", fail: "Contains harmful content" } });
new SQLJudge({ expectedKey: "expected_query" });

RAG scorers (8)

For retrieval-augmented generation pipelines. Each scorer reads scenario.parameters for context, question, and answer fields.

ScorerWhat it measures
ContextPrecisionAre retrieved chunks relevant?
ContextRecallDo retrieved chunks cover the answer?
ContextRelevancyAre retrieved chunks topically relevant?
ContextEntityRecallDo retrieved chunks mention the right entities?
FaithfulnessIs the answer grounded in the retrieved context?
AnswerRelevancyDoes the answer address the question?
AnswerSimilarityIs the answer semantically similar to the expected?
AnswerCorrectnessIs the answer factually correct + relevant?
new Faithfulness({ model: "paragon-fast" });
new AnswerRelevancy({ model: "paragon-fast" });

Use them as a bundle via the presets.rag() helper.

Embedding scorer (1)

new EmbeddingSimilarity({
  expected: "Renewal email",
  embedder: openaiEmbedder("text-embedding-3-small"),
  threshold: 0.85,
});

Computes cosine similarity between agent_output and expected using the named embedder. Returns the similarity in [0, 1].

Sandbox scorers (5)

These wrap spec-level invariants — they convert to a check: block before the experiment runs and don't actually execute client-side. Useful when you want to write your invariants in code instead of YAML:

import { FileExists, FileContains, CommandExits, SQLEquals, LLMJudge } from "@polarityinc/polarity-keystone";
 
new FileExists("output.json");
new FileContains("src/main.ts", { contains: "TODO", notContains: "console.log" });
new CommandExits("npm test", { exitCode: 0 });
new SQLEquals({ service: "db", query: "SELECT count(*) FROM users", equals: 5 });
new LLMJudge("Was the email professional?", { model: "paragon-fast", inputFrom: "smtp.last_request.body" });

Pass them in scores: [...] and they'll be merged with your spec's invariants for execution.

Presets

Three opinionated bundles for common agent shapes:

import { presets } from "@polarityinc/polarity-keystone";
 
await ks.experiments.runAndWait(exp.id, {
  scores: presets.chat(),               // factuality + answer-relevancy + moderation
});
 
await ks.experiments.runAndWait(exp.id, {
  scores: presets.rag(),                // context-precision + recall + faithfulness + answer-relevancy
});
 
await ks.experiments.runAndWait(exp.id, {
  scores: presets.agentWithTools(),     // json-validity + factuality + answer-relevancy
});

Each accepts a model: kwarg threaded into the LLM-judge scorers (default paragon-fast).

Custom scorers

Define your own:

import { CustomScorer } from "@polarityinc/polarity-keystone";
 
class WordCount extends CustomScorer {
  readonly name = "word_count_under_100";
 
  scoreResult(scenario) {
    const words = (scenario.agent_output ?? "").split(/\s+/).length;
    const passed = words < 100;
    return {
      name: this.name,
      score: passed ? 1 : 0,
      passed,
      message: `${words} words`,
    };
  }
}
 
await ks.experiments.runAndWait(exp.id, { scores: [new WordCount()] });

The scorer receives a ScenarioResult and returns a Score. Failures are caught — your scorer can throw and only that score will be 0; the experiment continues.

Choosing scorers

ScenarioTry
Q&A chatbotFactuality, AnswerRelevancy, Moderation
RAG pipelinepresets.rag()
Tool-using agentJSONValidity (on tool args), Factuality (on final answer)
Code generationCommandExits (tests pass), FileContains (no debug logs), LLMJudge (code quality)
Email draftingLevenshtein (vs. expected draft), LLMJudge (tone), SemanticListContains (key phrases)
Data extractionJSONDiff (vs. expected structure), ExactMatch on key fields
TranslationTranslation (LLM-judge), EmbeddingSimilarity

How scorers post to the dashboard

Every scorer exposes toRule() (TS) / to_rule() (Python) which converts it to a server-side ScoreRule. When you call ks.scoring.createRule(...) and then ks.scoring.scoreExperiment(expId, ruleIds), the same scoring logic runs server-side over the trace data — useful for retroactively adding a new metric to old experiments.

const factuality = new Factuality({ model: "paragon-fast" });
const rule = await ks.scoring.createRule(factuality.name, factuality.ruleType, factuality.ruleConfig);
await ks.scoring.scoreExperiment("exp-old-id", [rule.id]);
const scores = await ks.scoring.getScores("exp-old-id");