🔥 New datasets added weekly

The Marketplace for
Document Training Data

Buy and sell curated document datasets — raw files, structured JSON extractions, and annotated corpora — purpose-built for AI, LLM, and OCR model training.

12,400+
Datasets Available
340M+
Documents Indexed
600+
AI Teams Served
98%
Quality Pass Rate
Why DocTroves

Everything your model needs to learn from documents

We handle sourcing, cleaning, formatting, and licensing — so your team can focus on training.

🗂️

Paired Datasets

Every dataset ships with both the original document (PDF, image, scan) and its structured JSON extraction — ground truth included.

⚖️

Clear Licensing

Commercial-use licenses on every dataset. No ambiguous provenance — we audit sources before listing.

Quality Guaranteed

Human-reviewed extractions with accuracy scores. Datasets that don't meet our 95% threshold are relisted for remediation.

🌍

Multilingual

Documents in 40+ languages. Ideal for training multilingual OCR and extraction models at scale.

Instant Delivery

Download via secure link immediately after purchase. Large datasets available via S3-compatible bucket handoff.

🔄

Custom Orders

Need a specific document type or domain? Submit a request and our sourcing team will respond within 48 hours.

Data Categories

Browse by document type

Spanning financial, legal, medical, logistics, and more — in dozens of formats.

Simple Process

From search to training in minutes

1

Search & filter

Use catalog filters to find datasets by document type, language, format, annotation depth, and file size.

2

Preview a sample

Every dataset includes a free 50-document sample with full extraction JSON so you can validate quality before buying.

3

License & download

Purchase a commercial license and get immediate access via download link or S3 bucket transfer for large sets.

4

Train & ship

Data arrives in your preferred format (JSONL, Parquet, HuggingFace Dataset) — plug straight into your training pipeline.

Ready to start?

Find the dataset your model needs today

Browse 12,400+ curated datasets — or tell us what you need and we'll source it.