WorkAboutHire Me
Flagship AI · Open Source · Hugging Face

CanFinBench

The first public LLM benchmark for Canadian financial regulatory compliance — covering OSFI E-23, FINTRAC, IFRS 9, Basel III, PIPEDA, and CASL.

57
Expert-Validated Cases
7
Regulatory Domains
3
Task Archetypes
May 2027
OSFI E-23 Deadline
PythonHugging FaceOSFI E-23FINTRACIFRS 9Basel IIIPIPEDACASLeval.yamlCC BY 4.0
1 · Problem

Canadian financial institutions are deploying LLMs in regulated workflows — mortgage underwriting, AML detection, credit decisions, compliance checks. But no standardized public benchmark existed to evaluate whether these models actually understand Canadian regulatory frameworks before deployment.

2 · Solution

CanFinBench provides 57 expert-validated evaluation cases across 7 Canadian regulatory domains, grounded in primary regulatory text with citations to specific guideline sections. Three task archetypes test progressively harder capabilities: MCQ governance reasoning, scenario-based risk judgment, and compliance-drift red-teaming.

3 · Impact
  • ✓ First public Canadian financial LLM benchmark
  • ✓ Directly addresses OSFI E-23 validation needs
  • ✓ eval.yaml for HF Community Evals integration
  • ✓ CC BY 4.0 — open for research and commercial use
  • ✓ Bilingual EN/FR roadmap for v0.2

Three Task Archetypes

Task A

MCQ Governance Reasoning

Multiple-choice questions testing core regulatory logic, model lifecycle mapping, and boundary conditions. Each item cites a specific guideline clause.

// Example item
Domain: OSFI E-23
Difficulty: Hard
Section: Model Risk Rating
Task B

Scenario-Based Risk Judgment

Long-form scenarios simulating real audit logs, model drift events, and compliance reviews. Tests the model's ability to reason like a compliance officer.

// Example item
Domain: FINTRAC/PCMLTFA
Difficulty: Expert
Section: STR Reporting
Task C

Compliance-Drift Red-Teaming

Scenarios where a business instruction embeds compliance violations. Tests whether the model can identify PIPEDA, CASL, and E-23 violations in realistic AI deployment requests.

// Example item
Domain: PIPEDA/Law 25
Difficulty: Expert
Section: Automated Decisions

Regulatory Domains

OSFI Guideline E-23
Model Risk Management — lifecycle, risk rating, AI governance, explainability
In force May 2027
FINTRAC / PCMLTFA
AML/KYC — suspicious transaction reporting, structuring detection, PEP requirements
Active
OSFI Guideline B-20
Mortgage stress test — MQR, GDS/TDS ratios, LTV limits, renewal rules
Active
IFRS 9 ECL
Expected credit loss staging — SICR, Stage 1/2/3, management overlays
Since 2018
Basel III / OSFI CAR
Capital adequacy — CET1, D-SIB surcharge, output floor deferral
2026 update
PIPEDA / Quebec Law 25
Data privacy — consent, automated decision-making, privacy impact assessments
Active
CASL
Anti-spam — express consent, unsubscribe requirements, AI-driven marketing
Active

Why CanFinBench Exists

OSFI E-23 — May 1, 2027

Every Canadian federally regulated financial institution must validate AI model outputs under OSFI Guideline E-23 by May 2027. The core challenge: non-deterministic AI systems cannot be validated with deterministic tests.

Existing financial LLM benchmarks (FinQA, PIXIU/FinBen, FinEval) focus on US SEC filings, Chinese regulations, or general numerical reasoning. None encode Canadian regulatory frameworks. CanFinBench fills this gap.

The benchmark is specifically designed around the capability-compliance gap identified in research: LLMs score well on factual regulatory QA but degrade on compliance reasoning — exactly the capability that banks need before deploying AI in regulated decisions.

Roadmap

v0.1.0
June 2026
Live

57 items — OSFI E-23, FINTRAC, B-20, PIPEDA, CASL, IFRS 9, Basel III. eval.yaml for HF Community Evals.

v0.2.0
Q3 2026
Planned

200 items — French split added, expanded IFRS 9 + Basel III domains, compliance-drift red-teaming expansion.

v1.0.0
Q4 2026
Planned

500+ items — private held-out leaderboard test set, HF Spaces leaderboard, arXiv paper submission.

Use CanFinBench

Open source. CC BY 4.0. Pull the dataset and test your models today.