CanFinBench
The first public LLM benchmark for Canadian financial regulatory compliance — covering OSFI E-23, FINTRAC, IFRS 9, Basel III, PIPEDA, and CASL.
Canadian financial institutions are deploying LLMs in regulated workflows — mortgage underwriting, AML detection, credit decisions, compliance checks. But no standardized public benchmark existed to evaluate whether these models actually understand Canadian regulatory frameworks before deployment.
CanFinBench provides 57 expert-validated evaluation cases across 7 Canadian regulatory domains, grounded in primary regulatory text with citations to specific guideline sections. Three task archetypes test progressively harder capabilities: MCQ governance reasoning, scenario-based risk judgment, and compliance-drift red-teaming.
- ✓ First public Canadian financial LLM benchmark
- ✓ Directly addresses OSFI E-23 validation needs
- ✓ eval.yaml for HF Community Evals integration
- ✓ CC BY 4.0 — open for research and commercial use
- ✓ Bilingual EN/FR roadmap for v0.2
Three Task Archetypes
MCQ Governance Reasoning
Multiple-choice questions testing core regulatory logic, model lifecycle mapping, and boundary conditions. Each item cites a specific guideline clause.
Scenario-Based Risk Judgment
Long-form scenarios simulating real audit logs, model drift events, and compliance reviews. Tests the model's ability to reason like a compliance officer.
Compliance-Drift Red-Teaming
Scenarios where a business instruction embeds compliance violations. Tests whether the model can identify PIPEDA, CASL, and E-23 violations in realistic AI deployment requests.
Regulatory Domains
Why CanFinBench Exists
Every Canadian federally regulated financial institution must validate AI model outputs under OSFI Guideline E-23 by May 2027. The core challenge: non-deterministic AI systems cannot be validated with deterministic tests.
Existing financial LLM benchmarks (FinQA, PIXIU/FinBen, FinEval) focus on US SEC filings, Chinese regulations, or general numerical reasoning. None encode Canadian regulatory frameworks. CanFinBench fills this gap.
The benchmark is specifically designed around the capability-compliance gap identified in research: LLMs score well on factual regulatory QA but degrade on compliance reasoning — exactly the capability that banks need before deploying AI in regulated decisions.
Roadmap
57 items — OSFI E-23, FINTRAC, B-20, PIPEDA, CASL, IFRS 9, Basel III. eval.yaml for HF Community Evals.
200 items — French split added, expanded IFRS 9 + Basel III domains, compliance-drift red-teaming expansion.
500+ items — private held-out leaderboard test set, HF Spaces leaderboard, arXiv paper submission.
Use CanFinBench
Open source. CC BY 4.0. Pull the dataset and test your models today.