Real-world applications

What teams build with
DocTroves data

From fine-tuning frontier LLMs to training niche OCR models, here's how our datasets are put to work.

LLM Fine-tuning

Fine-tune LLMs to understand document structure

General-purpose LLMs struggle with document-specific reasoning — field extraction, table parsing, multi-page context. Fine-tuning on paired (document, JSON) datasets from DocTroves teaches models the structured patterns they need.

Instruction-tuning format: document image or text → structured JSON target
Works with vision-language models (Claude, GPT-4V, Qwen-VL) and text-only models
Domain-specific fine-tunes outperform general models by 40–70% on extraction tasks
Invoice extraction Contract review Medical NER
Example training pair
// Input: invoice scan → text
"input"
: "INVOICE #4821\nAcme Corp\nDate: 12 Mar 2024\nItem: Cloud Storage x3 $450\nItem: Support Plan $200\nTotal: $650",

// Target: structured JSON
"target"
: {
"invoice_number": "4821",
"vendor": "Acme Corp",
"date": "2024-03-12",
"total": 650.00,
"line_items": [...]
}

OCR model benchmark — invoice dataset
Model Field F1 Table F1
Baseline (no fine-tune) 61.2% 44.8%
+ 10k DocTroves pairs 84.7% 71.3%
+ 100k DocTroves pairs 93.1% 88.6%
Representative results — actual gains vary by base model and domain.
OCR Model Training

Train OCR engines that understand context, not just characters

Modern document OCR goes beyond raw character recognition. DocTroves datasets give your model the signal to understand layout, field relationships, and noisy scan conditions.

Bounding-box annotated datasets for layout-aware training
Noisy scan collections for robustness testing and training
Multilingual OCR corpora for 40+ scripts and languages
Layout detection Table extraction Handwriting

Document Intelligence Pipelines

Build production extraction pipelines faster

Enterprise teams building document automation (AP processing, contract review, onboarding) need training and evaluation data for every document class. DocTroves gives you that data without months of internal annotation effort.

Evaluation sets to benchmark your extraction model in production
Edge-case and low-quality scan sets to harden pipelines
Industry-specific schemas (HIPAA, UBL invoicing, ISO trade docs)
Talk to our team →

Accounts Payable Automation

500k invoice pairs across 120 vendor templates. Used to train field-extraction models for ERP integration.

InvoicesFinance

Contract Intelligence Platform

300k contracts with obligation tags and party extraction — reduced manual review time by 65% for a legal-tech client.

LegalNLP

Insurance Claims Processing

Medical records and claims forms to train triage classification and damage estimation models.

MedicalInsurance

Find the right dataset for your use case

Browse the catalog or contact us to scope a custom dataset for your exact pipeline.