Documents
Buy and sell document data — legal filings, contracts, patents, medical charts, inspection reports, and corporate filings. NLP companies need millions of real documents to train extraction, classification, and summarization models.
Available Now · 6 listings
Enterprise Codebase Migration Artifacts — 2,400 Java-to-Kotlin Conversions with Test Suites
Paired Java and Kotlin source files from 2,400 real enterprise migration projects, each with corresponding unit test suites and migration notes. Includes build configs, dependency changes, and API compatibility annotations. Powers code translation AI, automated refactoring tools, and migration planning assistants.
Federal Court Docket Filings — 3.2M Cases, PACER-Sourced, Structured + Full Text
Complete federal court docket entries from all 94 district courts and 13 circuit courts of appeals. Includes case metadata (parties, judges, case type, disposition), full-text filings, and motion outcomes. Built for litigation analytics, judicial prediction models, and legal research AI.
Clinical Radiology Reports — 8.4M Structured Reports with Matched DICOM Studies
Radiology dictation reports from a 12-hospital network paired with their source imaging studies (CT, MRI, X-ray). Reports are NLP-parsed into structured findings, impressions, and follow-up recommendations. Powers radiology AI copilots and automated report generation.
Commercial Real Estate Lease Agreements — 47K Contracts, 2015-2026, OCR-Processed, Entity-Tagged
Full-text commercial lease agreements from office, retail, and industrial properties across 38 US states. Each contract is OCR-processed, clause-segmented, and entity-tagged (landlord, tenant, guarantor, square footage, escalation terms, CAM provisions). Powers legal AI contract review and lease abstraction tools.
News Article Archive — 18M Articles, 4,200 Sources, Political Bias Scored
Full-text news articles from 4,200 English-language sources (national papers, local outlets, digital-native publications) with political bias ratings, topic tags, and named entity extraction. Each article scored on a 7-point bias scale validated against AllSides and Media Bias/Fact Check. Built for misinformation detection, media monitoring AI, and balanced content curation.
Open Source Vulnerability Patches — 47K CVEs with Before/After Code Diffs
Curated dataset of 47,000 CVE-linked vulnerability patches across Python, JavaScript, Java, Go, and C/C++ open source projects. Each entry includes the vulnerable code, the patch diff, CVE severity score, CWE classification, and exploit proof-of-concept where publicly available. Essential for training AI-powered code security scanners and automated patching systems.
Groups
Browse by group.group.
All Subtypes