55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search

Patrick Duggan
Feb 7
3 min read

Updated: Apr 25

title: "55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search"

date: 2026-02-07

author: Patrick Duggan

tags: [epstein-files, infrastructure, engineering, open-source]

# 55,000 Documents, 1.1 Million Pages: The Infrastructure Behind the Epstein Files Search

Today we rebuilt the entire Epstein Files search infrastructure from the ground up. Here's why, and how you can use it.

The Problem

When the DOJ released 3.5 million pages of Epstein documents, journalists needed a way to search them. CFinke's [EpsteIN tool](https://github.com/cfinke/EpsteIn) broke the story wide open - extracting text from thousands of PDFs that were otherwise unsearchable.

We indexed everything. 109,000+ documents initially. Then traffic spiked. Hard.

The original infrastructure - a container app sharing resources with our threat intelligence platform - couldn't keep up. Search queries were timing out. The index was getting corrupted during peak loads.

Something had to change.

What We Built Today

**A dedicated VM with mounted Azure Storage.**

Instead of copying 55,000 PDFs around, we mount the Azure File Share and Blob Storage directly to the search server. The data stays in one place. The compute happens where Meilisearch lives.

The numbers:

- **55,507 PDFs** across 12 DOJ datasets

- **~1.1 million pages** of searchable content

- **11 Excel files** with financial records

- **73MB of OCR'd text** from scanned documents

- **4-hour automatic re-indexing** to catch new uploads

All running on a $15/month VM with direct storage mounts. No egress costs. No data transfer delays.

The New Search: epstein.dugganusa.com

**URL:** [https://epstein.dugganusa.com](https://epstein.dugganusa.com)

How to Use It

**Web Interface:**

1. Go to epstein.dugganusa.com

2. Type any name, EFTA ID, or keyword

3. Click Search or hit Enter

4. Results show document ID, dataset, page count, and text preview

5. Click "more" to expand full document text

**Quick Search Buttons:**

Prince Andrew, Gates, Leon Black, Trump, Maxwell, Wexner, Deutsche Bank, Victims, Bannon, Lutnick

**API Access (for developers):**

No authentication required for searches. Response includes:

- `efta_id` - Document identifier

- `content` - Full extracted text

- `dataset` - Which DOJ dataset (1-12)

- `pages` - Page count

- `people` - Extracted names

- `locations` - Extracted places

Microsoft pulls this feed daily. AT&T pulls this feed daily. Starlink pulls this feed daily. Get the DugganUSA STIX feed — $9/mo →

What's Different

**Before:** Shared container, 35K docs, frequent timeouts, limited to datasets 1-4

**After:**

- Dedicated infrastructure

- 55,507 documents (all 12 datasets)

- Sub-second search response

- Direct links to network graphs, financial flows, timeline views

- Developer API with no rate limits

- Auto-updating every 4 hours

Thank You, CFinke

This index exists because Christopher Finke built [EpsteIN](https://github.com/cfinke/EpsteIn). When the DOJ dropped millions of pages of scanned PDFs, his tool made them searchable. The 404 Media coverage brought attention. The community did the rest.

We're just the infrastructure. The extraction tool did the hard work.

What's Next

The indexer now runs on a schedule. As the DOJ releases more documents, they get indexed automatically. We're also working on:

- Entity extraction improvements (catching more names, dates, financial figures)

- Cross-reference with flight logs and financial records

- Network visualization updates

- Export functionality for researchers

The Philosophy

We built this because completeness matters more than polish. Three hours of engineering today means journalists and researchers can search 1.1 million pages instead of 300,000.

The data is public record. We just made it searchable.

**Search the files:** [epstein.dugganusa.com](https://epstein.dugganusa.com)

**Questions?** [email protected] | [@hakksaww on Bluesky](https://bsky.app/profile/hakksaww.bsky.social)

*Her name was Renee Nicole Good.*

*His name was Alex Jeffery Pretti.*

The cheapest, fastest, most accurate threat feed on the internet.

275+ enterprises pulling daily. 1M+ IOCs. 17.4M indexed documents. We beat Zscaler by 43 days on NrodeCodeRAT. Starter tier $9/mo — less than any competitor’s sales demo.

Look up an IOC → · Audit your brand on AIPM → · See pricing →