FileTextData Catalog

Documents

Buy and sell document data — legal filings, contracts, patents, medical charts, inspection reports, and corporate filings. NLP companies need millions of real documents to train extraction, classification, and summarization models.

100 subtypes11 groups

Available Now · 6 listings

Enterprise Codebase Migration Artifacts — 2,400 Java-to-Kotlin Conversions with Test Suites

Paired Java and Kotlin source files from 2,400 real enterprise migration projects, each with corresponding unit test suites and migration notes. Includes build configs, dependency changes, and API compatibility annotations. Powers code translation AI, automated refactoring tools, and migration planning assistants.

2,400 projects, 18M lines of codelisted

Federal Court Docket Filings — 3.2M Cases, PACER-Sourced, Structured + Full Text

Complete federal court docket entries from all 94 district courts and 13 circuit courts of appeals. Includes case metadata (parties, judges, case type, disposition), full-text filings, and motion outcomes. Built for litigation analytics, judicial prediction models, and legal research AI.

3.2M cases, 28M individual docket entrieslisted

Clinical Radiology Reports — 8.4M Structured Reports with Matched DICOM Studies

Radiology dictation reports from a 12-hospital network paired with their source imaging studies (CT, MRI, X-ray). Reports are NLP-parsed into structured findings, impressions, and follow-up recommendations. Powers radiology AI copilots and automated report generation.

8.4M reports + matched imagingcontact

Commercial Real Estate Lease Agreements — 47K Contracts, 2015-2026, OCR-Processed, Entity-Tagged

Full-text commercial lease agreements from office, retail, and industrial properties across 38 US states. Each contract is OCR-processed, clause-segmented, and entity-tagged (landlord, tenant, guarantor, square footage, escalation terms, CAM provisions). Powers legal AI contract review and lease abstraction tools.

47,000 contracts (~1.2M pages)listed

News Article Archive — 18M Articles, 4,200 Sources, Political Bias Scored

Full-text news articles from 4,200 English-language sources (national papers, local outlets, digital-native publications) with political bias ratings, topic tags, and named entity extraction. Each article scored on a 7-point bias scale validated against AllSides and Media Bias/Fact Check. Built for misinformation detection, media monitoring AI, and balanced content curation.

18M articles, 4,200 sourceslisted

Open Source Vulnerability Patches — 47K CVEs with Before/After Code Diffs

Curated dataset of 47,000 CVE-linked vulnerability patches across Python, JavaScript, Java, Go, and C/C++ open source projects. Each entry includes the vulnerable code, the patch diff, CVE severity score, CWE classification, and exploit proof-of-concept where publicly available. Essential for training AI-powered code security scanners and automated patching systems.

47K CVEs, 128K affected fileslisted

Groups

Browse by group.group.

All Subtypes

Every data type.data type.