DocTroves exists because great document AI starts with great training data — and that data has been too hard to find.
Most of the world's information is locked inside documents — invoices, contracts, medical records, filings. Unlocking that information at scale requires AI models that can truly understand documents, not just read characters off a page.
Training those models demands massive quantities of paired data: original document images alongside structured, verified extractions. Collecting, cleaning, and licensing that data is expensive and slow. DocTroves was built to solve that problem.
We operate a focused marketplace: only document datasets, only commercially licensed, only with verified quality scores. No general-purpose data dumps. No provenance guessing.
Documents and their extractions only — we go deep, not wide.
Every dataset passes human review before it goes live.
Clear commercial-use rights on everything in the catalog.
DocTroves founded after our team spent two years building document AI systems and watching every client struggle with the same problem: not enough labeled training data.
Launched the public marketplace with 800 datasets. Grew to 200 active buyers within six months, primarily AI labs and enterprise ML teams.
Crossed 10,000 datasets. Added revenue-share seller program, bringing in hundreds of new data contributors across verticals.
12,400+ datasets, 340M+ indexed documents, and 600+ AI teams served. Expanding into multilingual and low-resource language collections.
We pay honest rates. Data contributors are partners, not an afterthought.
No dataset goes live without a PII audit. We reject anything that can't be de-identified cleanly.
We'd rather have fewer, better datasets than a massive catalog of junk.
AI shouldn't only work for English documents. We actively invest in multilingual collections.
Got a question, a dataset idea, or just want to introduce yourself? We read every message.