Our story

Built for the people
training the next wave of AI

DocTroves exists because great document AI starts with great training data — and that data has been too hard to find.

Our mission

Most of the world's information is locked inside documents — invoices, contracts, medical records, filings. Unlocking that information at scale requires AI models that can truly understand documents, not just read characters off a page.

Training those models demands massive quantities of paired data: original document images alongside structured, verified extractions. Collecting, cleaning, and licensing that data is expensive and slow. DocTroves was built to solve that problem.

We operate a focused marketplace: only document datasets, only commercially licensed, only with verified quality scores. No general-purpose data dumps. No provenance guessing.

🎯

Focused

Documents and their extractions only — we go deep, not wide.

🔍

Verified

Every dataset passes human review before it goes live.

⚖️

Licensed

Clear commercial-use rights on everything in the catalog.

2022

DocTroves founded after our team spent two years building document AI systems and watching every client struggle with the same problem: not enough labeled training data.

2023

Launched the public marketplace with 800 datasets. Grew to 200 active buyers within six months, primarily AI labs and enterprise ML teams.

2024

Crossed 10,000 datasets. Added revenue-share seller program, bringing in hundreds of new data contributors across verticals.

2025

12,400+ datasets, 340M+ indexed documents, and 600+ AI teams served. Expanding into multilingual and low-resource language collections.

Our values

🤝

Fair to sellers

We pay honest rates. Data contributors are partners, not an afterthought.

🔒

Privacy-first

No dataset goes live without a PII audit. We reject anything that can't be de-identified cleanly.

📐

Quality over quantity

We'd rather have fewer, better datasets than a massive catalog of junk.

🌐

Global perspective

AI shouldn't only work for English documents. We actively invest in multilingual collections.

We're a small team that cares a lot about this problem.

Got a question, a dataset idea, or just want to introduce yourself? We read every message.