Buy and sell curated document datasets — raw files, structured JSON extractions, and annotated corpora — purpose-built for AI, LLM, and OCR model training.
We handle sourcing, cleaning, formatting, and licensing — so your team can focus on training.
Every dataset ships with both the original document (PDF, image, scan) and its structured JSON extraction — ground truth included.
Commercial-use licenses on every dataset. No ambiguous provenance — we audit sources before listing.
Human-reviewed extractions with accuracy scores. Datasets that don't meet our 95% threshold are relisted for remediation.
Documents in 40+ languages. Ideal for training multilingual OCR and extraction models at scale.
Download via secure link immediately after purchase. Large datasets available via S3-compatible bucket handoff.
Need a specific document type or domain? Submit a request and our sourcing team will respond within 48 hours.
Spanning financial, legal, medical, logistics, and more — in dozens of formats.
Line items, totals, vendor fields — perfect for AP automation training.
NDAs, MSAs, employment agreements with clause-level annotation.
De-identified clinical notes, discharge summaries, and lab reports.
Multi-format bank and brokerage statements with transaction parsing.
Bills of lading, customs forms, delivery notes across major carriers.
Tax filings, permits, licenses, and public-sector documents.
Research articles with figure captions, tables, and citations extracted.
Technical manuals, spec sheets, and CAD-adjacent documentation.
Use catalog filters to find datasets by document type, language, format, annotation depth, and file size.
Every dataset includes a free 50-document sample with full extraction JSON so you can validate quality before buying.
Purchase a commercial license and get immediate access via download link or S3 bucket transfer for large sets.
Data arrives in your preferred format (JSONL, Parquet, HuggingFace Dataset) — plug straight into your training pipeline.
Browse 12,400+ curated datasets — or tell us what you need and we'll source it.