Browse curated document datasets ready for AI, LLM, and OCR training pipelines.
Scanned and digital retail invoices from US merchants, 2019–2024. Paired with JSONL extractions covering vendor, date, line items, totals, and tax.
Request DatasetInvoice dataset spanning Germany, France, Spain, and Italy. Includes both scanned and native-digital PDFs with field-level bounding box annotations.
Request DatasetThermal-print and digital receipts from food & beverage, with noisy OCR challenges. Ideal for training robust low-quality scan extractors.
Request DatasetEnterprise purchase orders with supplier, SKU, quantity, unit price, and delivery fields. Clean, high-accuracy JSON ground truth included.
Request DatasetDe-identified clinical lab reports across pathology, radiology, and blood work. HIPAA-compliant provenance with entity-level annotations.
Request DatasetNon-disclosure and general commercial contracts with clause segmentation, party extraction, and obligation tagging. Multi-jurisdiction.
Request Dataset