Document Corpus

PDFs, legal filings, patents, research papers, and text documents — document data trains legal AI, research assistants, and document understanding models.

PDFDOCXTXTMarkdownHTMLLaTeX

Overview

The written world, structured for machines.

Document corpus data encompasses PDFs, legal filings, academic papers, medical records, financial reports, contracts, manuals, and any written material that AI systems must read, understand, and reason about. While web-scraped text has trained the base capabilities of large language models, high-value document data is what enables specialized AI applications in law, medicine, finance, and enterprise knowledge management. The document AI market has matured rapidly. Harvey AI raised over $200M to build legal AI trained on law firm document corpora. Thomson Reuters and LexisNexis license their massive legal databases to AI companies. In healthcare, clinical notes and discharge summaries train medical AI systems like those from Tempus, PathAI, and Babylon Health. Each of these verticals requires not just the raw text, but structured extraction — entities, relationships, section classifications, and cross-references — that transforms a PDF into machine-readable training data. OCR quality is a gating factor for document data value. A scanned contract with 99.5% character-level OCR accuracy is usable. At 95% accuracy, it introduces training noise that degrades model performance. The shift from traditional OCR (Tesseract) to AI-powered document understanding (Azure Document Intelligence, Google Document AI, AWS Textract) has made high-fidelity extraction feasible at scale, but the extraction cost — $1.50 per 1,000 pages for basic text, up to $50 per 1,000 for structured extraction — adds to the total cost of document training data. The legal landscape for document data is the most complex in AI training. Copyright, fair use, licensing, and consent requirements vary by jurisdiction and content type. Academic publishers like Elsevier and Springer Nature have begun licensing their archives to AI companies. News organizations negotiate data licensing deals worth tens of millions. The era of freely scraping document corpora for AI training is effectively over for commercial applications.

Market Intelligence

$1.50/1K pages

OCR cost (basic text extraction)

Source: Azure/AWS/Google Cloud pricing 2025

$10-50/1K pages

OCR cost (structured extraction)

Source: Azure Document Intelligence 2025

$17.06B

Optical Character Recognition market (2025)

Source: Mordor Intelligence 2025

$38.32B

OCR market projected (2030)

Source: Mordor Intelligence (17.57% CAGR)

~$5,000/book

Book licensing rate for AI training

Source: Economics of AI Training Data, arXiv 2025

$0.30-2.00/track

Music track licensing for AI

Source: EU licensing benchmarks 2025

$1-4/minute

Video licensing for AI training

Source: Economics of AI Training Data, arXiv 2025

$1B+ total

Harvey AI total funding raised

Source: Bloomberg / Crunchbase 2026

Accepted Formats

We handle
the format.

Regardless of how your document corpus is stored, we convert, clean, and structure it for AI model ingestion. Buyers get exactly what their pipelines need.

PDFDOCXTXTMarkdownHTMLLaTeX

Applications

What AI models do with it.do with it.

01

Legal Document Analysis

Contracts, case law, and regulatory filings train AI that reviews documents, extracts clauses, and flags risks. Harvey AI, CaseText, and Ironclad use massive legal corpora for their products.

02

Medical Records Processing

Clinical notes, discharge summaries, and pathology reports train extraction models for EHR systems. Epic, Cerner, and Tempus license annotated medical document datasets.

03

Financial Report Analysis

10-K filings, earnings transcripts, and analyst reports train models that extract financial metrics, sentiment, and forward-looking statements for quantitative trading and risk analysis.

04

Academic Research AI

Scientific papers with citation graphs, methodology extraction, and finding summaries train research assistants like Semantic Scholar, Elicit, and Consensus.

05

Document Classification

Labeled document corpora train models that automatically route, categorize, and prioritize incoming documents. Insurance claims, government forms, and corporate correspondence are key verticals.

06

Contract Intelligence

Annotated contracts with clause-level labels train extraction models that identify obligations, deadlines, parties, and termination conditions across thousands of agreements simultaneously.

07

Regulatory Compliance Monitoring

Regulatory text databases train models that track rule changes and flag compliance gaps. Banking, pharmaceutical, and energy companies are primary buyers.

08

Knowledge Base Construction

Technical manuals, product documentation, and internal wikis train enterprise RAG systems that answer employee questions from organizational knowledge.

09

Patent Analysis

Patent filings with claim annotations and prior art labels train models for novelty assessment, infringement detection, and technology landscape mapping.

10

Historical Document Digitization

Handwritten and historical documents with expert transcriptions train OCR models that unlock archives for digital access. Libraries, museums, and genealogy companies are active buyers.

Pricing Guide

What it's worth.worth.

Document data pricing reflects the cost of extraction, annotation, and legal clearance. Raw scans are cheap. Expert-annotated, legally licensed document corpora are among the most expensive data assets available.

Raw Scanned Documents

$0.001-0.01/page

Unprocessed scans. No OCR, no structure. Bulk archives and digitization projects.

OCR-Extracted Text

$0.01-0.05/page

Machine-extracted text with basic formatting. 95-99% character accuracy depending on document quality.

Structured Document Extraction

$0.10-0.50/page

Tables, forms, key-value pairs extracted with field labels. Standard for financial and insurance documents.

Expert-Annotated Documents

$2-15/page

Domain expert annotations — legal clause types, medical entity extraction, financial metric labeling. Requires credentialed annotators.

Licensed Publisher Content

$5,000/book equivalent

Formal licensing from publishers like Elsevier, Wiley, Springer Nature. Author royalty splits typically 50/50.

Custom Legal/Medical Corpora

$100K-2M+

Purpose-built datasets with specific document types, annotation schemas, and compliance guarantees. Multi-year licensing common.

Quality Standards

What makes it valuable.valuable.

Document data quality depends on extraction accuracy, annotation precision, and legal clearance. A beautiful PDF is worthless if the extracted text is garbled.

01

OCR Accuracy >99%

Character-level accuracy must exceed 99% for training use. Measured against human-verified ground truth on representative samples. Accuracy below 98% introduces harmful noise.

02

Layout Preservation

Table structures, headers, footnotes, and multi-column layouts must be accurately extracted. Flattened text that loses document structure loses most of its training value.

03

Entity Annotation Consistency

Named entities (people, organizations, dates, monetary amounts) must be labeled consistently using a standardized schema like OntoNotes or custom domain taxonomy.

04

Section Classification

Documents must have section-level labels — abstract, methodology, findings, clause type, diagnosis. Section structure enables targeted retrieval and fine-grained training.

05

Copyright Clearance

Every document must have documented rights for AI training use. Scraping without permission creates legal liability. Licensed content with clear terms is mandatory for commercial products.

06

De-Identification (regulated domains)

Medical and financial documents must be de-identified per HIPAA Safe Harbor or equivalent standard. Re-identification risk assessment documentation required.

07

Cross-Reference Integrity

Citations, footnotes, and internal references must be resolvable. Broken cross-references indicate extraction errors that propagate into training data.

Active Buyers

Who's buying.buying.

Harvey AI

Legal AI platform. Licenses law firm document archives, case law databases, and regulatory filings to train contract analysis and legal research models.

Thomson Reuters (Westlaw)

Legal and tax AI products. One of the largest legal document corpora globally, also licenses to third-party AI companies for training.

LexisNexis (RELX)

Legal research AI. Massive case law and regulatory archive. Increasingly licensing datasets to AI companies through formal data partnerships.

Anthropic

Claude long-context training. Acquires diverse document corpora — academic, legal, technical — to improve document understanding and summarization capabilities.

OpenAI

GPT document comprehension. Negotiates licensing deals with publishers and news organizations worth tens of millions annually for training data access.

Tempus AI

Clinical document AI for oncology. Licenses de-identified medical records, pathology reports, and clinical notes for cancer treatment prediction models.

Bloomberg

Bloomberg GPT and terminal AI features. Proprietary financial document corpus spanning decades of filings, earnings calls, and analyst reports.

Semantic Scholar (AI2)

Academic research AI. Indexes 200M+ scientific papers. Licenses publisher archives for enhanced metadata extraction and citation analysis.

Ironclad

Contract lifecycle management AI. Acquires annotated contract datasets with clause-level labels for automated review and risk detection models.

Sample Data

What this looks like.

Legal briefs, patent filings, research papers, corporate filings, policy documents

Sell yourdocument corpusdata.

If your company generates document corpus, AI companies are actively looking for it. We handle pricing, compliance, and buyer matching.

Request Valuation