WorkAboutHire Me
Flagship AI · OSFI E-23 Aligned

GenAI Reliability Framework

Production-grade LLM evaluation harness for regulated medical and financial workflows — with bootstrapped CI gates, deterministic grounding checks, and OSFI E-23 alignment.

93.3%
Medical Accuracy
93.5%
Finance Accuracy
100%
Grounding Score
$0.0002
Cost per Call
PythonLangGraphFastAPINext.jsOpenAISupabaseGCP Vertex AIGitHub ActionsVercel

Pipeline Architecture

RetrieveGenerateGround ✓JudgeLog + CI Gate

5-node LangGraph DAG — deterministic grounding gate before any LLM judge call

1 · Problem

Deploying LLMs in regulated industries — healthcare, finance, legal — requires proof that outputs are accurate and traceable. Traditional CI/CD pipelines test whether code breaks. They have no mechanism to detect when a model's judgment degrades after a prompt change or model upgrade.

2 · Solution

Built a LangGraph multi-agent evaluation pipeline that extracts entities from model outputs, verifies them deterministically against source documents, then scores with a cross-family LLM judge. Bootstrapped confidence intervals (n=1,000) gate CI/CD — a PR fails if accuracy regresses beyond statistical significance.

3 · Impact
  • ✓ 93.3% medical accuracy [CI: 90.8–96.0%]
  • ✓ 93.5% financial accuracy [CI: 89.2–97.0%]
  • ✓ 100% factual grounding across 50 test cases
  • ✓ CI/CD gate passing on both domains
  • ✓ OSFI E-23 model risk alignment

Key Engineering Decisions

Deterministic Grounding Before LLM Judge

spaCy + regex entity extraction verifies every number, drug name, and date against source documents before any LLM judge call. This provides traceable, auditable evidence — the kind OSFI E-23 requires. Only grounding-passed outputs proceed to the judge, saving cost and eliminating unverifiable hallucinations.

Bootstrapped Confidence Intervals

Every metric includes 95% CIs via bootstrap resampling (n=1,000). A CI gate fails the PR only when accuracy regression is statistically significant — not just numerically lower. An 84%→86% change with overlapping CIs is noise; the system treats it as such.

Cross-Family Anti-Bias Judging

During calibration, GPT-4o as judge achieved κ=0.71 on Claude outputs but κ=0.84 on its own outputs — measurable self-evaluation inflation. The judge is always from a different provider family than the model under test. Cohen's kappa calibration against human labels ensures judge reliability.

OSFI E-23 Model Risk Alignment

Canadian federally regulated financial institutions must validate AI outputs as model outputs under OSFI Guideline E-23 by May 2027. This framework provides the traceable validation evidence, performance benchmarking, and CI/CD regression gating that OSFI E-23 requires for non-deterministic AI systems.

Technical Deep Dive

// Bootstrap CI scorer — statistical rigour for the CI gate

def bootstrap_metric(values, n_iterations=1000, confidence=0.95):
    arr = np.array(values)
    rng = np.random.default_rng(seed=42)
    
    boot_means = np.array([
        rng.choice(arr, size=len(arr), replace=True).mean()
        for _ in range(n_iterations)
    ])
    
    alpha = 1.0 - confidence
    ci_lower = np.percentile(boot_means, 100 * (alpha / 2))
    ci_upper = np.percentile(boot_means, 100 * (1 - alpha / 2))
    
    return BootstrappedMetric(
        mean=float(arr.mean()),
        ci_lower=ci_lower,
        ci_upper=ci_upper,
        n_samples=len(values)
    )

def is_regression(baseline, candidate, threshold=0.02):
    drop = baseline.mean - candidate.mean
    if drop <= 0 or drop < threshold:
        return False  # better or below noise floor
    return not cis_overlap(baseline, candidate)  # only flag if significant

Results

DomainAccuracy95% CIGroundingCost/CallCI Gate
Medical Q&A (30 cases)93.3%[90.8%, 96.0%]100%$0.0002✓ PASS
Financial Compliance (20 cases)93.5%[89.2%, 97.0%]100%$0.0002✓ PASS

The confidence intervals on medical vs. financial accuracy overlap — no statistically significant domain gap. One model, two regulated domains, consistent reliability.

Lessons Learned

Judge Inflation Bias is Real

During calibration, GPT-4o as judge scored its own outputs significantly higher than outputs from other model families. Cohen's kappa dropped from 0.84 (self-judging) to 0.71 (cross-family). The fix — always use a different provider family as judge — was simple once discovered, but the bias is invisible without a human-labelled calibration set.

Statistical Significance Matters More Than Raw Numbers

Early runs showed accuracy jumping from 84% to 86% after a prompt tweak. Without bootstrapped CIs, this looked like an improvement. With CIs, the intervals overlapped completely — the change was noise. This is the core insight: production AI systems need statistical rigour, not just accuracy scores.

Deterministic Gates Are Cheaper and More Auditable

Running LLM-as-judge on every output is expensive. By running deterministic entity verification first and only passing grounded outputs to the judge, judge API costs dropped by ~40% while adding an auditable paper trail that regulators can inspect.

Explore the Framework

Live leaderboard, full source code, and OSFI E-23 documentation.

Active on GitHub

Christopher's GitHub contribution chartView Full Profile on GitHub →