12,400+ datasets

Data Catalog

Browse curated document datasets ready for AI, LLM, and OCR training pipelines.

Showing 2,100 invoice & receipt datasets
US Retail Invoices — 500k docs
$1,200

Scanned and digital retail invoices from US merchants, 2019–2024. Paired with JSONL extractions covering vendor, date, line items, totals, and tax.

Invoices English JSONL Commercial
📄 500,000 docs 💾 48 GB ⭐ 4.9
Request Dataset
EU VAT Invoices Multilingual — 200k
$890

Invoice dataset spanning Germany, France, Spain, and Italy. Includes both scanned and native-digital PDFs with field-level bounding box annotations.

Invoices Multilingual Parquet Annotated
📄 200,000 docs 💾 22 GB ⭐ 4.8
Request Dataset
Restaurant Receipts — 1.2M docs
$2,400

Thermal-print and digital receipts from food & beverage, with noisy OCR challenges. Ideal for training robust low-quality scan extractors.

Receipts English JSONL Noisy
📄 1,200,000 docs 💾 91 GB ⭐ 4.7
Request Dataset
B2B Purchase Orders — 80k docs
$640

Enterprise purchase orders with supplier, SKU, quantity, unit price, and delivery fields. Clean, high-accuracy JSON ground truth included.

Purchase Orders English JSON High Quality
📄 80,000 docs 💾 9.4 GB ⭐ 4.9
Request Dataset
Medical Lab Reports — 120k docs
$1,800

De-identified clinical lab reports across pathology, radiology, and blood work. HIPAA-compliant provenance with entity-level annotations.

Medical English JSONL De-identified
📄 120,000 docs 💾 14 GB ⭐ 4.8
Request Dataset
NDA & Contract Corpus — 300k docs
$3,100

Non-disclosure and general commercial contracts with clause segmentation, party extraction, and obligation tagging. Multi-jurisdiction.

Legal English Parquet Clause-tagged
📄 300,000 docs 💾 31 GB ⭐ 4.9
Request Dataset
Don't see what you need? Request a custom dataset →