Searchable Archive IA Engine for a European Legal Tech Provider

An on-prem intelligent automation engine that transforms legacy legal scans into a searchable, audit-grade archive with sub-second full-text discovery and language-adaptive OCR.

Searchable Archive IA Engine for a European Legal Tech Provider

About the client

European legal-tech provider supporting banks, insurers, and utilities. Their on-prem DMS holds ≈ 2 million legacy pages in three working languages — English, German, and French and ingests 8,000–10,000 new documents per month (scanned contracts, legal notices, compliance certificates). All data must remain on premises, and the service-level target for full-text search is < 1 second.

About the Product

and Introduction:

Devox’s partnership with the U.S. industrial group began after a board-level referral: earlier document-automation results convinced the client to invite us in to unlock its decade-old scan archive. The mandate was clear — deliver full-text search and audit-grade traceability without altering the company’s heavily customised on-prem ERP.

We designed a sidecar Document Intelligence Layer: a Dockerised FastAPI service that tails the ERP’s export queue, cleans images, auto-detects English or Spanish, and feeds a self-trained Tesseract OCR. A 48-hour pilot on 5,000 pages hit 94% accuracy and sub-second Elasticsearch search, securing go-live approval.

Now every new scan runs a single pass — enhancement, language detection, confidence-scored OCR, live QA on low-certainty tokens, PDF/A sealing, and instant indexing — and writes straight back through the ERP’s HeadersRepository/RowsRepository. The once-static archive is fully searchable, audit-ready, and remains entirely on-prem.

Project Team

Composition:

  • Solution Architect (document intelligence & legal DMS integration)
  • 2 Backend Engineers — Python/FastAPI, .NET SDK wrapper
  • QA Automation Engineer
  • DevOps Engineer (on-prem Docker Swarm, CI/CD, observability)
  • Business Analyst (legal workflows & data-privacy compliance)
  • Project Manager

Challenges:

  • Image-only archive. Roughly 2 million legacy contracts were stored in TIFF/PDF format without a text layer, forcing the search engine to index only filenames and metadata, while the actual clauses remained invisible.
  • Manual clause retrieval. Paralegals spent 800+ hours every month retyping passages for discovery requests — one page at a time — creating a 7-day backlog for even routine keyword searches.
  • Multi-language corpus. Documents arrived in English (55%), German (30%), and French (15%), but language indicators were missing, so every OCR pass had to auto-detect the script before applying the correct dictionary and hyphenation rules.
  • Variable scan quality. Source material ranged from 300 dpi PDF/A to 150 dpi fax scans with skew and shadowing, which dropped baseline OCR accuracy below 85% and triggered costly manual clean-ups.
  • Strict on-prem residency. Client policy and GDPR constraints prohibited any external API calls; the entire AI stack, including model updates, had to run behind the firewall with verifiable data isolation.
  • Sub-second search SLA. Legal teams demanded a < 1s response across the whole corpus, meaning the pipeline had to deliver both OCR throughput (10,000 pages/day) and near-real-time indexing without adding hardware.
  • Confidence blind spots. Early pilots returned plain text without certainty scores or source-page anchors, leaving reviewers unsure which values were accurate and undermining audit defensibility in court.

Tech

Stack:

  • Frontend: React 18 with Vite, i18next for multilingual UI, React Router for routing logic, and Ant Design for internal QA console components.
  • Backend & Orchestration: FastAPI (Python 3.11), .NET 6 wrapper for enterprise-grade OCR SDK, Pydantic v2 for schema validation, and Celery for background job orchestration.
  • Pre-processing & AI: NeoML filters for deskewing and denoising, spaCy for named-entity recognition and clause tagging, with Tesseract fallback for low-resolution or degraded scans.
  • Search & Storage: Elasticsearch 8.x for real-time full-text indexing, PostgreSQL 14 for metadata and audit logs, NFS-attached storage for both originals and standards-compliant PDF/A outputs.
  • DevOps & Deployment: Dockerized microservices deployed to an on-prem Docker Swarm cluster (3 manager nodes + 6 workers), GitLab CI/CD for pipeline automation, and Ansible for infrastructure as code (IaC).
  • Authentication & Security: SAML SSO integrated with Microsoft Active Directory, JWT service tokens for API access, on-premises HashiCorp Vault for secrets management, and TLS 1.3 encryption across all internal services.
  • Monitoring & Logging: Prometheus exporters and Grafana dashboards for system health, Loki for centralized log collection, and Alertmanager for real-time incident notifications.

Solution:

Devox Software delivered a fully on-prem Searchable Archive IA Engine that integrates high-precision OCR into the client’s document management system, combining microservice orchestration with intelligent automation.

  • Single-pass ingest pipeline. Every new TIFF or image-PDF dropped into the watch directory triggers a FastAPI-based orchestrator that enhances image quality (deskew, denoise), detects the dominant language, and routes the page to an OCR engine via a .NET interface.
  • Language-adaptive OCR engine. For each page, the service dynamically selects an English, German, or French processing profile, applies legal-domain dictionaries, and assigns a confidence score to each extracted token.
  • Confidence-aware validation layer. Low-certainty fields (under 95 %) are flagged for review in a React QA console. Corrections are logged once and instantly reinforce lightweight language profiles.
  • Instant PDF/A packaging. Cleaned documents are merged with an invisible text layer, wrapped in a digitally signed PDF/A-3b container, and archived with SHA-256 hashes for evidentiary integrity.
  • Real-time search indexing. Extracted text and enriched metadata are piped directly into Elasticsearch, enabling sub-second full-text search across the entire archive and meeting discovery SLAs for legal and compliance teams.
  • Full traceability. Each search result traces back to its source document, OCR version, confidence metrics, and reviewer corrections, delivering audit-ready lineage across every clause.
  • Secure on-prem deployment. The engine runs in an air-gapped environment within a six-node Docker Swarm cluster. GitLab CI/CD deploys verified containers; Prometheus and Grafana provide live observability, while all data stays within the firm’s firewall, satisfying strict residency policies.

Results:

BUSINESS OUTCOMES

  • 72% reduction in manual effort. Paralegals now review nine documents per hour, instead of just two, freeing up two full-time roles for higher-value legal work.
  • Sub-second discovery. Keyword searches now resolve in ~0.4 s across the full archive, even at peak load.
  • 98% archive coverage in 8 weeks. Nearly all legacy scans were converted into PDF/A documents with full-text indexing and evidentiary traceability.
  • 99.2% text-level accuracy. Confidence scoring and human-in-the-loop corrections reduced OCR misreads by 63 %, meeting defensibility thresholds for court and audit review.
  • One-year ROI. Reduced retyping hours and faster case prep delivered full payback in 11 months.

TECHNICAL OUTCOMES

  • 10,000-page daily throughput on existing hardware. The pipeline peaks at 14 pages per second with a CPU usage below 65%, meeting the client’s growth forecasts.
  • Modular micro-architecture. React QA console, FastAPI orchestrator, OCR workers, and indexing service deploy independently, enabling same-day hot-fixes without downtime.
  • Live confidence tracking. Every token stores a score and a source-page anchor; low-certainty zones are surfaced in the UI and feed into nightly model fine-tuning.
  • On-prem, zero-egress security. All containers run inside a six-node Docker Swarm; no external APIs, complete data isolation, and audit-ready PDF/A-3b output.
  • End-to-end observability. Prometheus & Grafana dashboards monitor ingest latency, OCR accuracy, and index freshness; SLA breaches trigger instant Slack alerts.

Sum Up:

Static scans are now a living, searchable knowledge base — AI-OCR applies language-adaptive models on premises, confidence scores guide instant QA, and every page appears in full-text results in under a second. The engine scales on existing hardware, meets the strictest data-residency rules, and pays for itself in months.

Need the same on-prem accuracy, speed, and audit-proof traceability for your legal or compliance archive? Let’s map the first pilot and put intelligent automation to work for you.

Book a call

Want to Achieve Your Goals? Book Your Call Now!

Contact Us

We Fix, Transform, and Skyrocket Your Software.

Tell us where your system needs help — we’ll show you how to move forward with clarity and speed. From architecture to launch — we’re your engineering partner.

Book your free consultation. We’ll help you move faster, and smarter.

Let's Discuss Your Project!

Share the details of your project – like scope or business challenges. Our team will carefully study them and then we’ll figure out the next move together.






    By sending this form I confirm that I have read and accept the Privacy Policy

    Thank You for Contacting Us!

    We appreciate you reaching out. Your message has been received, and a member of our team will get back to you within 24 hours.

    In the meantime, feel free to follow our social.


      Thank You for Subscribing!

      Welcome to the Devox Software community! We're excited to have you on board. You'll now receive the latest industry insights, company news, and exclusive updates straight to your inbox.