WorkAboutHire Me
Back to Projects
Data Engineering

PDF-to-SQL Pipeline

AI-powered document extraction API converting unstructured financial PDFs into structured JSON ΓÇö local-first, $0.0005 per document.

The Problem

Financial institutions processing bank statements, invoices, and clinical notes manually face high error rates, slow turnaround, and data sovereignty risks when sending sensitive documents to cloud OCR services. Existing solutions cost $0.01ΓÇô$0.05 per document and require data to leave the organisation's infrastructure.

The Solution

Built a 3-layer pipeline: Docling OCR runs locally on CPU (no cloud, no GPU, original PDF never leaves the machine), Gemini Flash-Lite maps extracted text to typed structured JSON via few-shot prompting, and a deterministic validation engine enforces business rules with zero LLM involvement in the trust layer.

Key Impact

  • Achieved 95.4% accuracy across bank statements, invoices, and clinical notes
  • Cost of $0.0005 per document ΓÇö 20-100x cheaper than existing cloud OCR solutions
  • Data sovereignty by design ΓÇö original PDFs never leave the local machine
  • Supports 7 document types with domain-specific extraction models
  • FastAPI REST endpoints with Swagger UI for enterprise integration

Tech Stack

PythonDocling OCRGemini Flash-LiteFastAPIPostgreSQLDocker