GenAI Reliability Framework
Production-grade LLM evaluation harness for regulated medical and financial workflows — with bootstrapped CI gates, deterministic grounding checks, and OSFI E-23 alignment.
Pipeline Architecture
5-node LangGraph DAG — deterministic grounding gate before any LLM judge call
Deploying LLMs in regulated industries — healthcare, finance, legal — requires proof that outputs are accurate and traceable. Traditional CI/CD pipelines test whether code breaks. They have no mechanism to detect when a model's judgment degrades after a prompt change or model upgrade.
Built a LangGraph multi-agent evaluation pipeline that extracts entities from model outputs, verifies them deterministically against source documents, then scores with a cross-family LLM judge. Bootstrapped confidence intervals (n=1,000) gate CI/CD — a PR fails if accuracy regresses beyond statistical significance.
- ✓ 93.3% medical accuracy [CI: 90.8–96.0%]
- ✓ 93.5% financial accuracy [CI: 89.2–97.0%]
- ✓ 100% factual grounding across 50 test cases
- ✓ CI/CD gate passing on both domains
- ✓ OSFI E-23 model risk alignment
Key Engineering Decisions
Deterministic Grounding Before LLM Judge
spaCy + regex entity extraction verifies every number, drug name, and date against source documents before any LLM judge call. This provides traceable, auditable evidence — the kind OSFI E-23 requires. Only grounding-passed outputs proceed to the judge, saving cost and eliminating unverifiable hallucinations.
Bootstrapped Confidence Intervals
Every metric includes 95% CIs via bootstrap resampling (n=1,000). A CI gate fails the PR only when accuracy regression is statistically significant — not just numerically lower. An 84%→86% change with overlapping CIs is noise; the system treats it as such.
Cross-Family Anti-Bias Judging
During calibration, GPT-4o as judge achieved κ=0.71 on Claude outputs but κ=0.84 on its own outputs — measurable self-evaluation inflation. The judge is always from a different provider family than the model under test. Cohen's kappa calibration against human labels ensures judge reliability.
OSFI E-23 Model Risk Alignment
Canadian federally regulated financial institutions must validate AI outputs as model outputs under OSFI Guideline E-23 by May 2027. This framework provides the traceable validation evidence, performance benchmarking, and CI/CD regression gating that OSFI E-23 requires for non-deterministic AI systems.
Technical Deep Dive
// Bootstrap CI scorer — statistical rigour for the CI gate
def bootstrap_metric(values, n_iterations=1000, confidence=0.95):
arr = np.array(values)
rng = np.random.default_rng(seed=42)
boot_means = np.array([
rng.choice(arr, size=len(arr), replace=True).mean()
for _ in range(n_iterations)
])
alpha = 1.0 - confidence
ci_lower = np.percentile(boot_means, 100 * (alpha / 2))
ci_upper = np.percentile(boot_means, 100 * (1 - alpha / 2))
return BootstrappedMetric(
mean=float(arr.mean()),
ci_lower=ci_lower,
ci_upper=ci_upper,
n_samples=len(values)
)
def is_regression(baseline, candidate, threshold=0.02):
drop = baseline.mean - candidate.mean
if drop <= 0 or drop < threshold:
return False # better or below noise floor
return not cis_overlap(baseline, candidate) # only flag if significantResults
| Domain | Accuracy | 95% CI | Grounding | Cost/Call | CI Gate |
|---|---|---|---|---|---|
| Medical Q&A (30 cases) | 93.3% | [90.8%, 96.0%] | 100% | $0.0002 | ✓ PASS |
| Financial Compliance (20 cases) | 93.5% | [89.2%, 97.0%] | 100% | $0.0002 | ✓ PASS |
The confidence intervals on medical vs. financial accuracy overlap — no statistically significant domain gap. One model, two regulated domains, consistent reliability.
Lessons Learned
Judge Inflation Bias is Real
During calibration, GPT-4o as judge scored its own outputs significantly higher than outputs from other model families. Cohen's kappa dropped from 0.84 (self-judging) to 0.71 (cross-family). The fix — always use a different provider family as judge — was simple once discovered, but the bias is invisible without a human-labelled calibration set.
Statistical Significance Matters More Than Raw Numbers
Early runs showed accuracy jumping from 84% to 86% after a prompt tweak. Without bootstrapped CIs, this looked like an improvement. With CIs, the intervals overlapped completely — the change was noise. This is the core insight: production AI systems need statistical rigour, not just accuracy scores.
Deterministic Gates Are Cheaper and More Auditable
Running LLM-as-judge on every output is expensive. By running deterministic entity verification first and only passing grounded outputs to the judge, judge API costs dropped by ~40% while adding an auditable paper trail that regulators can inspect.