From fine-tuning frontier LLMs to training niche OCR models, here's how our datasets are put to work.
General-purpose LLMs struggle with document-specific reasoning — field extraction, table parsing, multi-page context. Fine-tuning on paired (document, JSON) datasets from DocTroves teaches models the structured patterns they need.
| Model | Field F1 | Table F1 |
|---|---|---|
| Baseline (no fine-tune) | 61.2% | 44.8% |
| + 10k DocTroves pairs | 84.7% | 71.3% |
| + 100k DocTroves pairs | 93.1% | 88.6% |
Modern document OCR goes beyond raw character recognition. DocTroves datasets give your model the signal to understand layout, field relationships, and noisy scan conditions.
Enterprise teams building document automation (AP processing, contract review, onboarding) need training and evaluation data for every document class. DocTroves gives you that data without months of internal annotation effort.
500k invoice pairs across 120 vendor templates. Used to train field-extraction models for ERP integration.
300k contracts with obligation tags and party extraction — reduced manual review time by 65% for a legal-tech client.
Medical records and claims forms to train triage classification and damage estimation models.
Browse the catalog or contact us to scope a custom dataset for your exact pipeline.