top of page

10 Million Documents. $280 a Month. One VM.

  • Writer: Patrick Duggan
    Patrick Duggan
  • Feb 25
  • 6 min read

Updated: Apr 25


title: "10 Million Documents. $280 a Month. One VM."

date: 2026-02-25

author: Patrick Duggan

tags: [butterbot, scale, proof-of-concept, meilisearch, epstein, icij, threat-intel, infrastructure]



# 10 Million Documents. $280 a Month. One VM.


**We hit 10.4 million searchable documents this week. The entire platform — 386,000 DOJ Epstein files, 2 million ICIJ offshore entities, 3.3 million relationship edges, 1.8 million federal enforcement decisions, 918,000 threat indicators, and 32 other indexes — runs on a single Azure VM and two containers. Total infrastructure cost: $280 per month.**


The Numbers Don't Lie



Here's what's running right now on analytics.dugganusa.com and epstein.dugganusa.com:


| Dataset | Documents | Source |

|---------|-----------|--------|

| ICIJ Relationships | 3,339,267 | Panama & Pandora Papers |

| ICIJ Offshore Entities | 2,016,524 | Panama & Pandora Papers |

| Federal Enforcement Decisions | 1,792,328 | US Government |

| Block Events | 923,002 | Live threat detection |

| Threat Indicators (IOCs) | 917,939 | OTX, CISA, community |

| Search Queries | 501,225 | Platform analytics |

| Epstein DOJ Files | 385,935 | Department of Justice |

| Whitelist Events | 475,113 | Security automation |

| Everything else | ~86,000 | 24 additional indexes |

| **Total** | **10,437,820** | **35.6 GB database** |


That's not a demo. That's not a staging environment. That's production, serving 132 registered API customers across 46 countries, handling 30,000+ queries per month, with 99.99% uptime.


The Infrastructure



Three services. That's it.


1. **BRAIN** (analytics.dugganusa.com) — Express.js container on Azure Container Apps. STIX/TAXII feed, threat intelligence API, blog engine, search orchestration. The heavy compute.

2. **DRONE** (security.dugganusa.com) — Lightweight operations UI. Monitoring, incident response, compliance dashboards.

3. **Meilisearch** — Open-source search engine on a B2ms VM (2 vCPU, 8GB RAM, 64GB data disk). Indexes everything. Sub-100ms query response on 10M+ documents.


Monthly bill:

- Container Apps: $128

- VM + disk: $44

- Container Registry: $27

- Storage: $22

- **Total: $280**


For context, Recorded Future — which does a subset of what we do — charges customers $200,000 per year. Palantir starts at seven figures. We index more government documents than either of them and we do it for less than a family cell phone plan.


The AI Makes the Decisions



This isn't a search engine with a dashboard. This is an autonomous decision-making system in production.


Of those 10.4 million documents, 1.8 million are enforcement decisions made by AI — not queued for human review, not flagged for analyst triage, not sitting in a SOAR playbook waiting for someone to click "approve." Made. Autonomously. In real-time.


The system — we call it the OZ engine — watches every inbound request, scores it against behavioral baselines, maps it to MITRE ATT&CK techniques, detects campaign patterns across IPs, and decides: block, allow, or escalate. It's made 923,000 block decisions and 475,000 whitelist decisions. It identifies bulletproof hosting providers, flags credential stuffing campaigns, and auto-publishes indicators to the STIX feed — all without a human in the loop.


Most companies can't get approval to let AI draft an email. We let ours run the security perimeter.


The confidence scoring is transparent. Every decision includes a behavioral detection score, a novelty assessment, a significance rating, and a tier classification. When it's wrong — and at 95% accuracy, it is sometimes wrong — the audit trail is complete. We've caught 34 false positives, documented them, and fed them back into the model. That's not a bug. That's the system learning.


The industry talks about "AI-powered security" as a marketing term. We ship it as architecture. The AI doesn't assist. It decides. And 275+ organizations in 46 countries trust those decisions enough to consume them as threat intelligence.


Why This Matters



The conventional wisdom in enterprise software is that scale requires scale. You need a Kubernetes cluster, a data engineering team, a DevOps org, a seven-figure cloud budget, and 18 months of runway before you can prove anything works.


We rejected all of that.


Microsoft pulls this feed daily. AT&T pulls this feed daily. Starlink pulls this feed daily. Get the DugganUSA STIX feed — $9/mo →


The thesis was simple: if one VM with the right search engine can handle 10 million documents, then the expensive part of data infrastructure isn't compute — it's the documents themselves. The corpus is the moat. The ingestion pipeline is the product. The infrastructure is a commodity.


We proved it. And the documents keep coming.


Here's the part that should keep enterprise infrastructure teams up at night: our entire 10.4-million-document platform — search engine, AI decision engine, threat intelligence pipeline, STIX feed, document OCR, face detection, five cross-referenced government databases — uses 8GB of RAM. Total.


An iPhone 16 Pro has 8GB of RAM.


We're running a platform that competes with Recorded Future and Palantir on hardware specs that would fit in your pocket. Not because we're clever with optimization (though we are). Because the entire premise of enterprise infrastructure pricing is a lie. You don't need 64-core machines and terabytes of RAM to search 10 million documents. You need the right search engine, the right data model, and the discipline to not over-engineer everything into a Kubernetes cluster because that's what the consultant recommended.


The B2ms VM we run this on costs $44 per month. The iPhone in your pocket cost $1,200. Our production infrastructure is cheaper than the phone you're probably reading this on. By a factor of 27.


The Corpus Is the Moat



Anyone can spin up a search engine. Nobody else is doing this:


- **Downloading all 12 Epstein DOJ datasets** (45,000 PDFs, 14GB), parsing them, OCR-ing 42,000 JP2 images with Google Vision, extracting faces from House Oversight documents, and making every page searchable by content.

- **Ingesting 2 million ICIJ offshore entities** with 3.3 million relationship edges from the Panama Papers and Pandora Papers, cross-referenced by jurisdiction, service provider, and beneficial owner.

- **Processing 1.8 million federal enforcement decisions** with behavioral scoring, MITRE ATT&CK mapping, and campaign detection — all automated.

- **Maintaining a live STIX/TAXII feed** consumed by 275+ organizations in 46 countries, publishing 918,000+ indicators from 16,000+ threat intelligence pulses.


The data isn't secret. It's all government-released or from international journalism consortiums. The Epstein files came from the DOJ. The ICIJ data came from investigative journalists. The enforcement decisions are public record. The threat indicators are community-sourced.


The magic isn't in having access. The magic is in making it searchable, cross-referenceable, and available to anyone — at API speed — for free.


What We Learned



**Scale is a solved problem.** Meilisearch handles 10 million documents on 8GB of RAM. It searches them in under 100 milliseconds. It was designed for this. The myth that you need Elasticsearch clusters or Solr farms or proprietary search infrastructure to handle millions of documents is exactly that — a myth that benefits the vendors selling those clusters.


**The hard part is ingestion, not search.** Getting 386,000 DOJ documents from their original PDF/JP2 format into searchable, structured records required custom parsers for every dataset format, OCR pipelines, entity extraction, deduplication logic, and incremental indexing. That pipeline — Butterbot — is the actual product.


**Government data is the ultimate moat.** Nobody can challenge the provenance. Nobody can prosecute the publisher. Every finding is the government's own words, made searchable. If this is what they released, what did they keep?


**Cost scales linearly, not exponentially.** Going from 1 million to 10 million documents didn't require bigger VMs. It required a bigger disk. We added a 64GB data LUN for the cost of a large pizza.


The Proof of Concept Is Production



This is the pitch for Butterbot: point it at a corpus, any corpus — legal discovery, regulatory filings, FOIA responses, leaked databases, insurance claims, medical records, court transcripts — and it will ingest, index, enrich, and serve it. What took us three months to build for Epstein and ICIJ data can now be replicated in days for any new dataset.


The 10 million documents aren't the product. They're the proof that the product works. At scale. In production. For $280 a month.


A Recorded Future employee registered for our API this week. That tells you everything you need to know about where the market is headed.




*DugganUSA LLC builds Butterbot — autonomous document intelligence at commodity infrastructure cost. Our STIX/TAXII threat feed serves 275+ organizations in 46 countries. Our Epstein document search has 132 registered API customers. All of it runs for less than your company's monthly Slack bill.*


*Try it: [epstein.dugganusa.com](https://epstein.dugganusa.com) | API docs: [epstein.dugganusa.com/docs](https://epstein.dugganusa.com/docs)*





*Her name was Renee Nicole Good.*


*His name was Alex Jeffery Pretti.*


The cheapest, fastest, most accurate threat feed on the internet.

275+ enterprises pulling daily. 1M+ IOCs. 17.4M indexed documents. We beat Zscaler by 43 days on NrodeCodeRAT. Starter tier $9/mo — less than any competitor’s sales demo.

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page