diagnostic · open source · runs locally

Is your AI understanding your data — or guessing?

Stop optimizing the theater. Enterprise production exposes RAG that sounds grounded but breaks when the context changes.

Run controlled perturbations against your real pipeline. See whether it stays stable when noise changes, and reacts when governing evidence changes. No labels. No judge model. Runs locally.

Whitepaper

The method behind Kelvin: controlled perturbations, invariance and sensitivity, and why these signal evidence-tracking instead of presentation effects.

The proof · what it catches

What Kelvin tells you

A pipeline can sound grounded and still be reacting to formatting, ordering, or noise. Kelvin surfaces those failures.

  • failure · 01

    The answer changed when only formatting or ordering changed.

  • failure · 02

    The answer stayed the same when the governing rule changed.

  • failure · 03

    The pipeline looked grounded — but it was reacting to presentation, not evidence.

$ kelvin checksample venture-scoring report · SampleVenture-A, SampleVenture-B
┌─ Kelvin Report ──────────────────────────────────────────┐│                                                          ││   2 cases · 14 perturbations · 2m 59s                    ││                                                          ││   Invariance    0.70                                     ││   Does your pipeline stay calm when nothing              ││   important changes?                                     │      mostly  good                             │                                                          ││   Sensitivity   0.50                                     ││   Does your pipeline react when something                ││   important changes?                                     │      partial  watch                           │                                                          ││   Both signals look healthy. Spot-check                  ││   kelvin/report.html for per-case anomalies.             ││                                                          ││   Diagnostic signals — not truth metrics.                ││   → kelvin/report.html for per-case drill-down           ││   2 of 14 perturbations failed (logged in kelvin/).      ││                                                          │└──────────────────────────────────────────────────────────┘

Real run · 2 sample venture inputs (SampleVenture-A, SampleVenture-B) scored by a venture-assessment pipeline · decision field stage_assessment.

Stable under noise

invariance

0.70/ 1.00
0.00.51.0

Does your pipeline stay calm when nothing important changes?

mostly — good

Responds to governing evidence

sensitivity

0.50/ 1.00
0.00.51.0

Does your pipeline react when something important changes?

partial — watch

What Kelvin caught

Two findings from a single run

Same 14 perturbations, two cases, two different failure modes — each one invisible to held-out accuracy or judge-model scoring.

finding · 01 — sensitivity

healthy

SampleVenture-A followed the governing rule

Baseline: idea. After swapping SampleVenture-A's gate_rule for one whose conditions were met, the decision moved to seed. Surrounding evidence — team, market, traction — was unchanged. The pipeline read the substituted unit and responded.

Inv 0.857Sens 1.0 on swap

finding · 02 — position bias

caught

SampleVenture-B reacted to where the rule appeared

Baseline: pre-seed. All 3 reorder perturbations flipped to seed when the ## Gate Rule section appeared first — the pipeline weighted it disproportionately by position. The content was identical.

Inv 0.571Sens stable

Aggregate across both cases: invariance 0.70, sensitivity 0.50, with 2 of 14 perturbations failing (HTTP 500 from the pipeline, logged for inspection). A constant-output pipeline that always emitted pre-seed would score Inv 1.0, Sens 0.0 — looking perfectly stable while being operationally useless. That's why the pair matters.

Why it matters

Most RAG evals reward plausible answers. Kelvin tracks paired invariance + sensitivity as a signature of understanding.

A pipeline can sound grounded and still ignore the evidence that should change the answer. Kelvin probes whether yours behaves like it understands what matters.

behavior · 01

Stable when noise changes

Reordering or padding irrelevant material shouldn't move the answer.

behavior · 02

Responsive when the signal changes

Replacing the governing rule should move the answer.

behavior · 03

Grounded in your actual pipeline

Not a synthetic benchmark. Kelvin probes the system you already run.

How it works

How Kelvin works

Your pipeline reads a context and makes a decision. That context has structure — sections, clauses, rules, records. Kelvin exploits that structure to derive test cases from the corpus itself. No labels required. The structure writes the tests.

Reorder

reorder

Shuffle the sections. The facts inside each section are unchanged. If the decision moves, your pipeline is reading position — not evidence.

before

interviewinterviewgate_rulebudget

after

gate_ruleinterviewbudgetinterview

Pad

pad

Inject typed sections from other cases. Structurally valid, factually irrelevant. If the decision moves, your pipeline is counting signal volume — not reading it.

before

interviewinterviewgate_rulebudget

after

interviewforeigninterviewgate_ruleforeignbudget

Swap

swap

Replace the governing section with a different valid one from another case. One fact has changed — the decision must follow. If it doesn't, your pipeline isn't consulting the evidence that governs the outcome.

before

interviewinterviewgate_rulebudget

after

interviewinterviewgate_rule′budget

The measurement

Two scores. One question.

Kelvin measures two signals on your decision field — the scalar output your pipeline produces.

The pair is what matters. A constant-output pipeline scores Inv = 1.0 and Sens = 0.0. That's not stability — that's indifference. Kelvin calls it flat. Only high invariance with high sensitivity is grounded.

Invariance

reorder · pad
Inv = 1 − mean( d(baseline, perturbed) )

A score of 1.0 means the decision never moved when it shouldn't have. A score below 0.7 means your pipeline is reacting to presentation.

Sensitivity

swap
Sens = mean( d(baseline, perturbed) )

A score near 1.0 means the decision moved when the governing evidence changed. A score near 0.0 means it didn't — the pipeline is ignoring the evidence that should govern the outcome.

The four regimes

Inv high · Sens high

Grounded

Inv high · Sens low

Flat

Inv low · Sens high

Brittle

Inv low · Sens low

Unstable

Interactive · simulator

Try it: dial the perturbations

Adjust how many reorder, pad, and swap perturbations Kelvin runs — and how your pipeline behaves. Watch the two scores and the regime move in response. Illustrative model, not the real estimator.

Perturbation counts

Reorder

Shuffle sections — should not move the decision.

4

Pad

Inject typed sections from other cases.

4

Swap

Replace the governing section — should move it.

4

Pipeline behavior

Drift under noise

How much the decision moves when noise changes. Lower is better.

0.18

Response to governing change

How much the decision moves when the governing section is swapped.

0.65

Invariance

0.83

mostly stable

Sensitivity

0.65

partially responsive

Regime

Grounded

High invariance and high sensitivity — the pipeline ignores noise and follows the governing evidence.

What you get

What you get

No service. No accounts. The whole tool fits in your terminal.

  • Local CLI

    One command, no service.

  • Example cases

    Bundled sample case, runs out of the box.

  • Inspectable reports

    Terminal memo + standalone HTML.

  • Plays with your pipeline

    Anything callable from the command line.

  • No labels needed

    Unsupervised by design.

  • Open source

    Apache 2.0, runs offline.

Low friction

Try Kelvin on the pipeline you already have

You don't need a new stack. You don't need labels. You don't need to rewrite your app. If your pipeline can be run from the command line, Kelvin can probe it.

Install · 02 commands

Run it locally in under a minute

A sample case is bundled with the install. No setup, no service, no account.

Drop a kelvin.yaml in your project and point it at a folder of case files. Sample case bundled.

Honest scope

What Kelvin doesn't claim

Kelvin is a diagnostic, not a verdict. Knowing what it can't tell you is part of using it well.

  • not a truth metric

    Kelvin doesn't determine whether an answer is correct. It only asks whether the output tracks evidence in the expected direction under controlled perturbations.

  • not a judge model

    There's no external model grading answers. Signals come from how the pipeline responds to structural changes in its own input.

  • swaps are type-matched, not semantically validated

    A governing-unit substitution guarantees type compatibility, not that every swap is semantically well-formed in context. Sensitivity is a diagnostic, not a proof.

  • scoring ignores prose

    Kelvin scores a designated structured decision field. Free-form rationales are recorded for inspection but don't contribute to the diagnostic score.

  • v1 uses declared section headers

    Types come from markdown section headers you declare, not from an inferred schema. Practical, but a lightweight approximation of the full structural-oracle argument.

Read the full Limitations and Future Work sections in the whitepaper ↗.

FAQ

FAQ

Does Kelvin prove correctness?
No. It diagnoses whether the pipeline appears to follow evidence rather than presentation.
Is Kelvin a judge model?
No. It uses controlled perturbations, not an external model grading answers.
Does Kelvin lock you into one framework?
No. Anything callable from the command line works.
What's a good result?
Stable under irrelevant change. Responsive when the governing evidence changes.