Document Intelligence Engine for Audit-Ready Fintech Archives

An on-prem document intelligence engine that transforms legacy financial records into a searchable, audit-grade archive with language-adaptive OCR and instant full-text discovery.

Document Intelligence Engine for Audit-Ready Fintech Archives

About the client

A Luxembourg fintech bank that runs cross-border services for EU regulators and payment networks. Its on-premises DMS holds roughly two million historic pages in English, German, and French and absorbs eight to ten thousand new contracts, statements, and compliance certificates each month. Data-residency rules forbid any external processing, while discovery teams demand sub-second text search during audits and AML reviews.

About the Product

and Introduction:

What started as a narrow request to “make archived scans searchable” evolved into an on-prem Searchable Archive IA Engine that now anchors the bank’s legal and compliance workflow. Built under strict EU AI risk controls, the system pairs explainable OCR with human-in-the-loop review and traceable logs — ready for AML audits and KYC reporting from day one. Before the rollout, image-only PDFs and legacy TIFF files were excluded from the search index; analysts had to manually copy passages, and multilingual pages hindered investigations. 

Devox Software refactored the process around a high-accuracy on-prem OCR SDK, wrapping it in a Dockerised FastAPI service that plugs directly into the bank’s storage and Elasticsearch stack. Each file now passes through image enhancement, automatic language detection, high-accuracy OCR, confidence scoring, PDF/A packaging, and instant indexing — all within the firewall. Text that once took hours to retrieve appears in under a second, turning a static archive into a governed, discoverable asset.

Project Team

Composition:

  • Solution Architect (document intelligence & fintech compliance)
  • 2 Backend Engineers (Python / FastAPI, .NET SDK integration)
  • QA Automation Engineer
  • DevOps Engineer (Docker Swarm, CI/CD, observability)
  • Data Engineer (Elasticsearch & SQL tuning)
  • Project Manager

Challenges:

  • Image-only archive. Roughly two million legacy contracts, statements, and certificates existed as TIFF or image-PDF files with no text layer, locking critical clauses and KYC data outside the bank’s search index.
  • Manual clause retrieval. Compliance analysts spent more than 600 hours a month retyping passages for AML reviews, which extended routine investigations from minutes to days and inflated staffing costs.
  • Multilingual corpus. Documents arrived in English, German, and French without reliable language tags, so any OCR pass had to auto-detect the script and apply the correct hyphenation and dictionary rules to avoid false positives in sanctions screening.
  • Variable scan quality. Source material ranged from crisp 300 dpi exports to 150 dpi fax copies with skew, shadow, and stamp bleed-through, pushing baseline OCR accuracy below banking thresholds for evidentiary documents.
  • Strict on-prem residency. Luxembourg regulations and internal policy prohibited all external API calls or cloud storage, demanding a self-contained AI pipeline that could run, scale, and update entirely behind the firewall.
  • Sub-second discovery SLA. Legal and audit teams were expected to achieve full-text hits across the entire corpus in under one second, necessitating a solution that balances high-volume OCR throughput with real-time indexing on existing hardware.
  • No confidence metrics. Earlier pilots returned text without certainty scores or source-page anchors, leaving reviewers unsure which values to trust and weakening the bank’s audit defensibility.

Tech

Stack:

  • Frontend: React 18, i18next (multi-language UI), React Router, Ant Design (internal QA console)
  • Backend & Orchestration: FastAPI (Python 3.11), Pydantic v2, Celery task queue
  • OCR & AI: Enterprise-grade on-prem OCR SDK via .NET wrapper; deep-learning image preprocessing, spaCy NER for clause tagging
  • Search & Storage: Elasticsearch 8.x cluster, PostgreSQL 14 (metadata), on-prem NFS for originals and PDF/A outputs.
  • DevOps & Deployment: Docker images, six-node Docker Swarm, GitLab CI/CD, Ansible IaC
  • Authentication & Security: SAML SSO with Microsoft AD, JWT service tokens, HashiCorp Vault secrets, end-to-end TLS 1.3
  • Monitoring & Logging: Prometheus exporters, Grafana dashboards, Loki log aggregation, Alertmanager notifications

Solution:

Devox Software built an on-prem Searchable Archive IA Engine — a Dockerised FastAPI mesh that folds the on-prem OCR engine into the bank’s existing DMS and Elasticsearch stack.

  • Single-pass ingest pipeline. Every newly scanned contract or statement is placed in a watch folder; the orchestrator sharpens the image, deskews it, detects the dominant script, and routes the pages through language-specific OCR profiles (English, German, French).
  • Confidence-aware validation. Each token carries a certainty score; anything below 95 % surfaces instantly in a React QA console, where reviewers correct text once and push edits back to lightweight language profiles.
  • Instant PDF/A sealing. The engine merges the cleaned image with an invisible text layer, signs a PDF/A-3b wrapper, and attaches a SHA-256 hash for evidentiary integrity.
  • Real-time search index. OCR output and enriched metadata flow to Elasticsearch within seconds; queries now span the whole corpus in < 0.5 s, satisfying audit and AML teams.
  • Traceable lineage. Every clause links to its source page, OCR engine version, language model, and confidence metrics — one click shows the original scan, the extracted string, and the reviewer’s name.
  • Air-gapped deployment. Six Docker Swarm nodes run entirely behind the firewall. A secure internal GitLab CI/CD instance ships signed images; Prometheus and Grafana monitor latency; HashiCorp Vault manages secrets. The system operates with zero external traffic and is designed to meet strict data-residency and audit requirements.

Results:

BUSINESS OUTCOMES

  • 70% fewer manual hours. Compliance teams reclaimed over 400 hours every month once retyping was eliminated.
  • Discovery in 0.4s. Keyword hits that once took minutes to load now appear almost instantly, speeding up AML reviews and audits.
  • 0.4s average query time across the full archive on existing hardware. All documents submitted since launch have passed the evidentiary review without issue.
  • Regulatory-grade AI compliance. The platform meets internal audit standards and aligns with the EU AI Act’s high-risk classification for document intelligence, combining explainability, human review, and tamper-proof lineage by design.
  • 96% of the legacy archive is searchable within twelve weeks, unlocking clauses and dates that have been buried for a decade.
  • Paid back in under a year. Savings from reduced overtime and faster regulatory filings covered the project cost within nine months.

TECHNICAL OUTCOMES

  • Ten thousand pages per day were processed on the existing six-node Swarm, peaking at 15 pages per second while the CPU stayed below 60%.
  • Decoupled microservices ship twice a week with zero downtime; React console, OCR workers, and indexers update independently.
  • Confidence-scored tokens feed an overnight auto-tune loop, lifting accuracy four points without dedicated ML cycles.
  • Complete lineage on every hit — SHA-hashed PDF/A, OCR version, language model, and reviewer log bundle into one immutable record.
  • The air-gapped stack undergoes quarterly penetration testing; no data ever leaves the firewall, ensuring compliance with Luxembourg’s data-residency rules.

Sum Up:

Every scanned contract now emerges as a signed PDF/A, searchable in under a second and traced back to its pixel-level source. Compliance teams work from instant hits, regulators receive audit-ready evidence, and the archive scales on the bank’s hardware with zero data egress.

Looking to unlock your document backlog with the same on-prem precision? Let’s outline a proof-of-concept and put intelligent automation to work for your records.

Book a call

Want to Achieve Your Goals? Book Your Call Now!

Contact Us

We Fix, Transform, and Skyrocket Your Software.

Tell us where your system needs help — we’ll show you how to move forward with clarity and speed. From architecture to launch — we’re your engineering partner.

Book your free consultation. We’ll help you move faster, and smarter.

Let's Discuss Your Project!

Share the details of your project – like scope or business challenges. Our team will carefully study them and then we’ll figure out the next move together.






    By sending this form I confirm that I have read and accept the Privacy Policy

    Thank You for Contacting Us!

    We appreciate you reaching out. Your message has been received, and a member of our team will get back to you within 24 hours.

    In the meantime, feel free to follow our social.


      Thank You for Subscribing!

      Welcome to the Devox Software community! We're excited to have you on board. You'll now receive the latest industry insights, company news, and exclusive updates straight to your inbox.