top of page

55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search

  • Writer: Patrick Duggan
    Patrick Duggan
  • Feb 7
  • 3 min read

title: "55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search"

date: 2026-02-07

author: Patrick Duggan

tags: [epstein-files, infrastructure, engineering, open-source]



# 55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search


Today we rebuilt the entire Epstein Files search infrastructure from the ground up. Here's why, and how you can use it.


The Problem



When the DOJ released 3.5 million pages of Epstein documents, journalists needed a way to search them. CFinke's [EpsteIN tool](https://github.com/cfinke/EpsteIn) broke the story wide open - extracting text from thousands of PDFs that were otherwise unsearchable.


We indexed everything. 109,000+ documents initially. Then traffic spiked. Hard.


The original infrastructure - a container app sharing resources with our threat intelligence platform - couldn't keep up. Search queries were timing out. The index was getting corrupted during peak loads.


Something had to change.


What We Built Today



**A dedicated VM with mounted Azure Storage.**


Instead of copying 55,000 PDFs around, we mount the Azure File Share and Blob Storage directly to the search server. The data stays in one place. The compute happens where Meilisearch lives.


The numbers:

- **55,507 PDFs** across 12 DOJ datasets

- **~1.1 million pages** of searchable content

- **11 Excel files** with financial records

- **73MB of OCR'd text** from scanned documents

- **4-hour automatic re-indexing** to catch new uploads


All running on a $15/month VM with direct storage mounts. No egress costs. No data transfer delays.


The New Search: epstein.dugganusa.com



**URL:** [https://epstein.dugganusa.com](https://epstein.dugganusa.com)


How to Use It



**Web Interface:**

1. Go to epstein.dugganusa.com

2. Type any name, EFTA ID, or keyword

3. Click Search or hit Enter

4. Results show document ID, dataset, page count, and text preview

5. Click "more" to expand full document text


**Quick Search Buttons:**

Prince Andrew, Gates, Leon Black, Trump, Maxwell, Wexner, Deutsche Bank, Victims, Bannon, Lutnick


**API Access (for developers):**




No authentication required for searches. Response includes:

- `efta_id` - Document identifier

- `content` - Full extracted text

- `dataset` - Which DOJ dataset (1-12)

- `pages` - Page count

- `people` - Extracted names

- `locations` - Extracted places


What's Different



**Before:** Shared container, 35K docs, frequent timeouts, limited to datasets 1-4


**After:**

- Dedicated infrastructure

- 55,507 documents (all 12 datasets)

- Sub-second search response

- Direct links to network graphs, financial flows, timeline views

- Developer API with no rate limits

- Auto-updating every 4 hours


Thank You, CFinke



This index exists because Christopher Finke built [EpsteIN](https://github.com/cfinke/EpsteIn). When the DOJ dropped millions of pages of scanned PDFs, his tool made them searchable. The 404 Media coverage brought attention. The community did the rest.


We're just the infrastructure. The extraction tool did the hard work.


What's Next



The indexer now runs on a schedule. As the DOJ releases more documents, they get indexed automatically. We're also working on:


- Entity extraction improvements (catching more names, dates, financial figures)

- Cross-reference with flight logs and financial records

- Network visualization updates

- Export functionality for researchers


The Philosophy



We built this because completeness matters more than polish. Three hours of engineering today means journalists and researchers can search 1.1 million pages instead of 300,000.


The data is public record. We just made it searchable.




**Search the files:** [epstein.dugganusa.com](https://epstein.dugganusa.com)


**Questions?** [email protected] | [@hakksaww on Bluesky](https://bsky.app/profile/hakksaww.bsky.social)





*Her name was Renee Nicole Good.*


*His name was Alex Jeffery Pretti.*

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page