55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search
- Patrick Duggan
- Feb 7
- 3 min read
title: "55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search"
date: 2026-02-07
author: Patrick Duggan
tags: [epstein-files, infrastructure, engineering, open-source]
# 55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search
Today we rebuilt the entire Epstein Files search infrastructure from the ground up. Here's why, and how you can use it.
The Problem
When the DOJ released 3.5 million pages of Epstein documents, journalists needed a way to search them. CFinke's [EpsteIN tool](https://github.com/cfinke/EpsteIn) broke the story wide open - extracting text from thousands of PDFs that were otherwise unsearchable.
We indexed everything. 109,000+ documents initially. Then traffic spiked. Hard.
The original infrastructure - a container app sharing resources with our threat intelligence platform - couldn't keep up. Search queries were timing out. The index was getting corrupted during peak loads.
Something had to change.
What We Built Today
**A dedicated VM with mounted Azure Storage.**
Instead of copying 55,000 PDFs around, we mount the Azure File Share and Blob Storage directly to the search server. The data stays in one place. The compute happens where Meilisearch lives.
The numbers:
- **55,507 PDFs** across 12 DOJ datasets
- **~1.1 million pages** of searchable content
- **11 Excel files** with financial records
- **73MB of OCR'd text** from scanned documents
- **4-hour automatic re-indexing** to catch new uploads
All running on a $15/month VM with direct storage mounts. No egress costs. No data transfer delays.
The New Search: epstein.dugganusa.com
**URL:** [https://epstein.dugganusa.com](https://epstein.dugganusa.com)
How to Use It
**Web Interface:**
1. Go to epstein.dugganusa.com
2. Type any name, EFTA ID, or keyword
3. Click Search or hit Enter
4. Results show document ID, dataset, page count, and text preview
5. Click "more" to expand full document text
**Quick Search Buttons:**
Prince Andrew, Gates, Leon Black, Trump, Maxwell, Wexner, Deutsche Bank, Victims, Bannon, Lutnick
**API Access (for developers):**
No authentication required for searches. Response includes:
- `efta_id` - Document identifier
- `content` - Full extracted text
- `dataset` - Which DOJ dataset (1-12)
- `pages` - Page count
- `people` - Extracted names
- `locations` - Extracted places
What's Different
**Before:** Shared container, 35K docs, frequent timeouts, limited to datasets 1-4
**After:**
- Dedicated infrastructure
- 55,507 documents (all 12 datasets)
- Sub-second search response
- Direct links to network graphs, financial flows, timeline views
- Developer API with no rate limits
- Auto-updating every 4 hours
Thank You, CFinke
This index exists because Christopher Finke built [EpsteIN](https://github.com/cfinke/EpsteIn). When the DOJ dropped millions of pages of scanned PDFs, his tool made them searchable. The 404 Media coverage brought attention. The community did the rest.
We're just the infrastructure. The extraction tool did the hard work.
What's Next
The indexer now runs on a schedule. As the DOJ releases more documents, they get indexed automatically. We're also working on:
- Entity extraction improvements (catching more names, dates, financial figures)
- Cross-reference with flight logs and financial records
- Network visualization updates
- Export functionality for researchers
The Philosophy
We built this because completeness matters more than polish. Three hours of engineering today means journalists and researchers can search 1.1 million pages instead of 300,000.
The data is public record. We just made it searchable.
**Search the files:** [epstein.dugganusa.com](https://epstein.dugganusa.com)
**Questions?** [email protected] | [@hakksaww on Bluesky](https://bsky.app/profile/hakksaww.bsky.social)
*Her name was Renee Nicole Good.*
*His name was Alex Jeffery Pretti.*




Comments