top of page

The Architecture Guide for Viral Solutions: How to Maintain Four Nines on a Pizza Budget

  • Writer: Patrick Duggan
    Patrick Duggan
  • Feb 25
  • 15 min read

title: "The Architecture Guide for Viral Solutions: How to Maintain Four Nines on a Pizza Budget"

date: 2026-02-25

author: Patrick Duggan

tags: [architecture, whitepaper, infrastructure, meilisearch, azure, uptime, butterbot, engineering]



# The Architecture Guide for Viral Solutions: How to Maintain Four Nines on a Pizza Budget


**A DugganUSA LLC Technical Whitepaper**


*How a two-person company serves 10.4 million documents across 32 indexes to 132 API customers in 46 countries — with 99.99% uptime at $280 per month — and what the enterprise infrastructure industry doesn't want you to know about why it works.*




Abstract



Between December 2025 and February 2026, DugganUSA LLC built and operated a document intelligence platform that indexes 10.4 million records across 32 search indexes, serves a STIX/TAXII threat intelligence feed to 275+ organizations in 46 countries, makes autonomous AI-driven security enforcement decisions at a rate of 8,000+ per day, and maintains 99.99% uptime — all on infrastructure that costs $280 per month.


This paper documents every architectural decision, every failure, every trick, and every deliberate constraint that makes this possible. It is written for engineers who are tired of being told they need Kubernetes, and for executives who are tired of being told they need seven figures.




1. The Estate



The entire platform runs on three services:


| Service | Platform | Cost | Purpose |

|---------|----------|------|---------|

| BRAIN | Azure Container Apps | $128/mo | API server, STIX feed, threat intel, search orchestration |

| DRONE | Azure Container Apps | (included) | Lightweight operations UI |

| Meilisearch | Azure B2ms VM | $44/mo | Search engine, 10.4M documents, 35.6 GB |


Supporting infrastructure:

- Azure Container Registry: $27/mo

- Azure Storage (file shares, blobs): $22/mo

- Azure Key Vault: ~$1/mo

- DNS (Cloudflare): $0

- CDN (Cloudflare): $0

- SSL certificates (Cloudflare): $0

- **Total: ~$280/mo**


There is no Kubernetes cluster. There is no load balancer (beyond what Container Apps provides natively). There is no Redis cache. There is no message queue. There is no separate database server. There is no data warehouse. There is no Terraform state bucket. There is no staging environment.


Every component that doesn't exist is a component that can't fail.


1.1 The iPhone Benchmark



The Meilisearch VM has 2 vCPUs and 8 GB of RAM. An iPhone 16 Pro has 8 GB of RAM. Our production search engine — serving 10.4 million documents with sub-100ms query latency — runs on hardware specs that would fit in your pocket.


This is not an accident. It's the result of choosing a search engine (Meilisearch) that was designed for performance at constrained resources, rather than one (Elasticsearch) that was designed to sell you more nodes.


1.2 What We Don't Run



The absence list is as important as the architecture list:


- **No Kubernetes**: Container Apps provides container orchestration without cluster management. We don't manage nodes, don't patch control planes, don't debug pod scheduling. The tradeoff is less flexibility. The benefit is zero operational overhead.

- **No Redis/Memcached**: Meilisearch's query performance (1-5ms on 10M documents) makes caching unnecessary. Adding a cache would add a failure mode, a consistency problem, and a monthly bill — all to save microseconds we don't need.

- **No RabbitMQ/Kafka**: Background processing uses in-process queues. When the process dies, the queue dies, and the cron job picks it up next cycle. We chose idempotent operations over durable queues.

- **No Elasticsearch**: Meilisearch is a single binary. Elasticsearch is a JVM-based distributed system that wants a minimum of three nodes and a master's degree in heap tuning. For our workload (full-text search with filters, not log analytics), Meilisearch is faster, cheaper, and simpler by every measure.

- **No Staging Environment**: We test locally, deploy to production, and monitor. A staging environment would double our costs for a false sense of safety. Our deployment gate (see Section 5) provides the human checkpoint that staging pretends to provide.




2. Disk Architecture: The LUN1 Pattern



This is the first trick, and it saved us from a catastrophic data loss.


2.1 The Problem



Azure B-series VMs ship with an OS disk (30 GB) and a temporary disk. Neither has enough space for a 35.6 GB search database. You need attached storage.


2.2 The Solution



We attached a 64 GB managed disk (LUN1) formatted as ext4, mounted at `/mnt/lun1`, and symlinked the Meilisearch data directory:





The fstab entry uses `nofail` so the VM boots even if the disk fails to mount:





2.3 Why This Matters: The Three-Day Miracle



On February 22, 2026, we moved the Meilisearch data to LUN1 for disk pressure relief — but forgot to create the symlink. The data directory at `/data/meili_data/` ceased to exist on the filesystem.


Meilisearch kept running.


For three days, the search engine served queries from 10+ million documents using nothing but Linux file handles. The process had opened the data files before the directory was removed, and Linux doesn't actually delete files until all file descriptors are closed. The process was serving 364,000+ Epstein documents from files that no longer had a name on disk.


When we discovered the issue, the index showed 142K documents instead of 364K — the process was slowly degrading as it couldn't write new data. We mounted LUN1, created the symlink, restarted Meilisearch, and the full 364K+ dataset came back instantly.


**Lesson**: Linux file handles are your invisible safety net. But the real lesson is simpler: separate your data from your OS disk, use symlinks for indirection, and always persist your mount in fstab.


2.4 The Backup Pattern



Trust, but verify. Verify, but back up.





The backup script:

1. Triggers a Meilisearch dump (atomic snapshot of all indexes)

2. Copies the dump to `/data/dumps/`

3. Rsyncs to Azure File Share at `/mnt/epstein-files/backups/meilisearch/`

4. Backs up the indexer state file (`indexer-state.json`) — because without it, re-indexing starts from scratch


The Azure File Share is on Standard tier storage — pennies per GB. The dump for 10.4M documents is approximately 2.7 GB compressed.


**Critical**: Meilisearch has no built-in replication. If the VM dies and you haven't backed up, you re-index everything from source. For our corpus (386K PDFs, 42K JP2 images, ICIJ CSV imports), that's days of work. Back up the state file alongside the data.




3. Search Architecture: Why Meilisearch Wins



3.1 The Selection Criteria



We evaluated Elasticsearch, Typesense, and Meilisearch. The decision came down to three factors:


| Criteria | Elasticsearch | Typesense | Meilisearch |

|----------|--------------|-----------|-------------|

| RAM for 10M docs | 16-32 GB | 8-16 GB | 8 GB |

| Deployment complexity | Cluster (3+ nodes) | Single or cluster | Single binary |

| License | SSPL (not OSS) | GPL v3 | MIT |

| Cost at our scale | $200-500/mo | $100-200/mo | $44/mo (VM only) |


Meilisearch won because it's a single Rust binary that runs on 8 GB of RAM, requires zero configuration for basic workloads, and is MIT licensed. The tradeoff: no built-in replication, no distributed mode, limited aggregation queries. For a search-and-retrieve workload (not analytics), those tradeoffs are free.


3.2 Index Design



32 indexes, each purpose-built. No polymorphic mega-index. Key design decisions:


**Separate indexes per data source**: `epstein_files`, `icij_offshore`, `icij_relationships`, `iocs`, `oz_decisions`. Each has its own schema, its own filterable attributes, its own relevancy tuning. Cross-index search is handled at the API layer, not the search engine layer.


**Filterable attributes are explicit**: Meilisearch requires you to declare which fields are filterable before you can filter on them. We filter on `timestamp`, `country`, `source`, `bdeTier`, `eventType`. Everything else is full-text search only. This keeps the index lean.


**Embeddings for semantic search**: Select indexes (`adversaries`, `iocs`, `blog`, `paranormal`) have vector embeddings enabled for natural language queries. The embedding model runs inside Meilisearch — no external vector database needed.


3.3 The Nginx Proxy



The frontend never talks directly to Meilisearch's port. Nginx proxies four paths:





This gives us:

- **TLS termination** at the edge (Cloudflare → nginx → Meilisearch, all HTTPS)

- **Path-level access control** (we could restrict `/indexes` to specific IPs)

- **Static file serving** for the search frontend (same nginx, same VM)

- **No exposed ports** — Meilisearch listens on localhost only


3.4 API Key Hierarchy



Meilisearch has a four-tier key system, and we use all four:


| Key | Permissions | Used By |

|-----|-------------|---------|

| Master | Everything | systemd service config, never in code |

| Admin | All actions, all indexes | VM indexer scripts |

| Stats (read-only admin) | `*.get`, `keys.get` | Frontend stats display |

| Search | `search` only | Frontend search queries |


The frontend JavaScript contains the search key and the stats key. Neither can modify data, create indexes, or read other keys. The master key exists only in the systemd unit file on the VM. The admin key exists only in the indexer scripts on the VM.


**Lesson learned the hard way**: When we restored data from a backup disk, the API keys reset to their auto-generated defaults. The frontend had a different search key hardcoded. Result: every query returned 403. Always document your key configuration, and always verify keys after any data restoration.




4. Autonomous AI Decision-Making: The OZ Engine



This is the part that shouldn't work but does.


4.1 The Problem



A two-person company cannot staff a 24/7 SOC. We needed threat detection and enforcement that runs without human intervention — not "AI-assisted" (human in the loop), but "AI-decided" (human out of the loop, audit trail in the loop).


4.2 The Architecture



The OZ engine processes every inbound request through a scoring pipeline:


1. **Behavioral baseline**: Is this IP doing something we haven't seen before? (novelty score)

2. **Abuse correlation**: AbuseIPDB score, VirusTotal detections, ThreatFox IOC matches

3. **Campaign detection**: Is this IP part of a coordinated pattern? (clustering across time windows)

4. **MITRE mapping**: Which ATT&CK technique does this behavior match?

5. **Scoring**: Composite score → tier classification (low/medium/high/critical)

6. **Decision**: Block, allow, or escalate


The decision is indexed to the `oz_decisions` collection (1.8 million decisions and counting) with full audit fields:





Every decision is searchable, filterable, and auditable. When we get it wrong — and at 95% accuracy, we do — the false positive is documented, reviewed, and fed back into the scoring model.


4.3 The 8,000-Threat Day



On February 24, 2026, the OZ engine processed approximately 8,000 threat decisions. No analyst reviewed them. No SOAR playbook was triggered. No Slack alert was sent. The system scored each request, made a decision, indexed the decision, and moved on.


The STIX feed then published the high-confidence indicators to 275+ downstream consumers within hours. Those organizations — SOCs, MSSPs, government agencies — consumed the threat intelligence without knowing it was produced by AI on a $44 VM.


4.4 The 95% Cap



We guarantee 5% error. This is deliberate.


Claiming 100% accuracy in autonomous security decisions is either lying or dangerous. The 95% cap forces us to maintain audit trails, review false positives, and treat every block decision as potentially wrong. The 34 false positives we've documented aren't embarrassments — they're proof the review process works.


Any vendor claiming 100% detection rates is selling you a story. We're selling you a system that knows it's wrong 5% of the time and shows you its work.




5. The Deployment Gate: How to Not Break Production



5.1 The Problem We Created for Ourselves



Between October 2025 and January 2026, we deployed broken code to production four times without testing. Cumulative cost: $18,500 to $39,500 in lost time, emergency fixes, and one investor demo that went sideways.


5.2 The "Adoy" Protocol



Every deployment — Docker push, Azure update, git push to main — requires a human confirmation word: "adoy." Not "yes." Not "confirm." A specific, uncommon word that cannot be accidentally triggered.


The workflow:

1. Write code

2. Test locally

3. Report to the human: "Changes ready, awaiting confirmation"

4. **Wait**. Do not proceed.

5. Human says "adoy"

6. Deploy


This is enforced by a pre-commit hook that checks for deployment commands (`docker push`, `az containerapp update`, `git push`, `./build-and-push.sh`) and blocks them unless the confirmation flag is set.


5.3 Content vs. Infrastructure



Not everything needs the gate. We split operations into two categories:


- **Content** (blog posts, threat intelligence, evidence reports): Deploy autonomously. If a blog post has a typo, fix it. If a STIX indicator is wrong, update it. The blast radius is small and the rollback is trivial.

- **Infrastructure** (Docker images, Azure config, database migrations): Gate required. The blast radius is the entire platform and the rollback might not be trivial.


This distinction lets us move fast on content (684 blog posts in three months) while protecting infrastructure stability (99.99% uptime).


5.4 The Build Enforcement



Two rules that have prevented approximately a dozen production incidents:


**Rule 1: Always AMD64.** Our development machines are Apple Silicon (ARM64). Production runs on Azure (AMD64). Building a Docker image on a Mac without specifying `--platform linux/amd64` produces an ARM64 image that silently fails on Azure. We wrapped this in `build-and-push.sh` and banned direct `docker build` commands.


**Rule 2: Never Alpine.** Alpine Linux uses musl libc instead of glibc. This breaks native Node.js modules, Python C extensions, and approximately 15% of npm packages that include compiled binaries. We use `node:20-slim` (Debian-based) exclusively. The image is 50 MB larger. The reliability is 100% higher.




6. Ingestion: The Actual Product



6.1 The Pipeline



The search engine is a commodity. The ingestion pipeline is the moat.


For the Epstein DOJ files alone, ingestion required:

- Downloading 12 datasets (45,000 PDFs, 14 GB)

- Parsing EFTA (Electronic File Transfer Agreement) metadata from filename conventions

- Extracting text from PDFs using standard text extraction

- OCR-ing 42,000 JP2 (JPEG 2000) images using Google Cloud Vision API ($29.93 total)

- Extracting faces from House Oversight documents using OpenCV (66 face crops at 70-98% confidence)

- Deduplicating across datasets (101,112 unique EFTA IDs from overlapping batches)

- Incremental indexing with state tracking (resume from last indexed document)


For the ICIJ data (Panama/Pandora Papers):

- Parsing CSV exports with proper handling of multi-line quoted fields

- Building entity records from nodes (2M entities)

- Building relationship edges (3.3M relationships)

- Cross-referencing jurisdictions with human-readable country names


6.2 The State File Pattern



Every indexer maintains a state file (`indexer-state.json`) that tracks:

- Which documents have been indexed

- Which documents failed (and why)

- The last successful batch timestamp

- The total document count


This is simple but critical. Without the state file, every restart re-indexes the entire corpus. With 386K Epstein documents, that's 8+ hours of work. With the state file, a restart skips everything already indexed and picks up from the last batch.


**Back up the state file.** We learned this when a Meilisearch restart wiped our index from 364K to 36K. The data was gone, and without the state file, we couldn't tell the indexer which 328K documents to re-index.


6.3 Cron Over Queues



Our indexer runs on a cron schedule (every 4 hours), not on a message queue. This is a deliberate choice:


- **Cron is self-healing**: If the process crashes, it starts again next cycle. No dead letter queue. No retry logic. No monitoring for stuck consumers.

- **Cron is observable**: `crontab -l` shows you the entire job schedule. No RabbitMQ management UI. No CloudWatch alarms for queue depth.

- **Cron has zero dependencies**: It's been in Unix since 1975. It will outlive every message queue product on the market.


The tradeoff is latency. New documents appear within 4 hours, not 4 seconds. For our use case (government document archives that update in batches, not streams), 4-hour latency is not just acceptable — it's irrelevant.




7. Security Without a Security Team



7.1 The Key Vault Pattern



All secrets live in Azure Key Vault. Application code never contains credentials. User passwords, API keys, third-party tokens — all stored in Key Vault, all accessed via Azure Managed Identity (no credentials to manage the credentials).


Cost: approximately $1 per month for our secret volume.


7.2 Session-Based Auth Over JWT



We chose server-side sessions (Express.js + Azure Table Storage) over JWT tokens. The conventional wisdom says JWT is better for stateless architectures. The conventional wisdom is wrong for our use case:


- **Session revocation is instant**: Delete the session. With JWT, you need a blocklist (which is just a session store with extra steps).

- **No token theft escalation**: If someone steals a session cookie, we can invalidate it. If someone steals a JWT with a 24-hour expiry, they have 24 hours of access.

- **Simpler implementation**: `req.session.username` vs. parsing, verifying, and refreshing tokens.


7.3 Cloudflare as Shield



Cloudflare sits in front of everything. It provides:

- DDoS mitigation (free tier handles our traffic volume)

- TLS termination and certificate management

- Geographic access policies

- Bot detection

- CDN caching for static assets


Cost: $0. The free tier is sufficient for our traffic (172K requests per week).




8. Monitoring Without Monitoring Tools



8.1 The Traffic Report



We don't use Datadog ($23/host/mo). We don't use New Relic ($25/host/mo). We don't use PagerDuty ($21/user/mo).


We wrote a single script: `traffic-report.js`. It pulls from:

- Google Analytics 4 (sessions, page views, geography)

- Google Search Console (clicks, impressions, CTR)

- Cloudflare API (requests, threats, bandwidth)

- Wix API (blog post count, views)

- Bluesky API (follower count, engagement)

- Azure Cost Management API (MTD spend)

- Meilisearch stats (document counts, index health)

- STIX feed consumer registry (consumer count, countries)

- VM stats via SSH (uptime, memory, disk, running processes)


The output is a single HTML report. One command. All metrics. No dashboard login. No alert fatigue. No monthly bill.


8.2 The NPS Widget



A one-question survey embedded in the search frontend: "How likely are you to recommend this tool?" (0-10). Results indexed to Meilisearch. Total infrastructure for customer feedback: zero additional cost.




9. What Breaks (And How We Survive It)



9.1 The Failure Catalog



| Incident | Impact | Root Cause | Recovery Time |

|----------|--------|------------|--------------|

| Data directory deleted | Index degraded 364K→142K | Disk migration without symlink | 15 minutes (once found) |

| API keys mismatched | All searches returned 403 | Backup restoration reset keys | 10 minutes |

| Task queue backlog | Searches returned 408 | 8,800 pending tasks after restart | Self-resolved (~20 min) |

| 34 false positive blocks | Legitimate users blocked | Overzealous behavioral scoring | Minutes per incident |

| Broken Docker image | Service crash on deploy | ARM64 image on AMD64 host | 5 min (rollback) |

| Key Vault secret missing | 500 error on user deletion | Assumed secret existed | 5 min code fix |

| Meilisearch index drop | 364K→36K documents | Process restart without backup | Hours (re-index from cron) |


9.2 The Recovery Patterns



**Pattern 1: Symlinks create indirection.** When data moves, the symlink changes. When symlinks break, they're one command to fix. Never hardcode data paths.


**Pattern 2: Linux file handles are your parachute.** A running process keeps deleted files alive. Don't panic-restart a process that's still serving — investigate first.


**Pattern 3: State files are cheaper than re-processing.** A 1 MB JSON file that tracks 101K indexed document IDs saves 8 hours of re-indexing. Back it up.


**Pattern 4: Human gates prevent cascading failures.** The "adoy" protocol exists because automated deploys caused $39K in damage. The five seconds it takes to type a confirmation word has saved us from at least a dozen bad deploys.


**Pattern 5: The 95% cap is a feature, not a limitation.** When your autonomous system admits it's wrong 5% of the time, you build audit trails. When it claims 100%, you build complacency.




10. The Cost Breakdown Nobody Wants You To See



The enterprise infrastructure industry is a $500 billion market built on the premise that scale requires scale. Here's what scale actually costs:


| What They Sell You | What We Use | Annual Savings |

|-------------------|-------------|----------------|

| Elasticsearch cluster (3 nodes) | Meilisearch (1 binary) | $2,400-6,000/yr |

| Kubernetes cluster | Azure Container Apps | $3,600-12,000/yr |

| Datadog + PagerDuty | One Node.js script | $5,000-15,000/yr |

| Redis cache | Nothing (queries are fast enough) | $1,200-3,600/yr |

| Message queue (SQS/RabbitMQ) | Cron jobs | $600-2,400/yr |

| Staging environment | Local testing + human gate | $3,360/yr |

| Enterprise search license | MIT-licensed Meilisearch | $50,000-200,000/yr |


Our total annual infrastructure cost: **~$3,360** ($280 × 12).


A single Elasticsearch node on AWS (m5.xlarge) costs $140/month. We run our entire platform — search engine, API server, threat intelligence pipeline, STIX feed, and autonomous security enforcement — for double the cost of one node of one component of the stack they'd sell you.




11. When This Architecture Fails



Intellectual honesty requires acknowledging the limits.


**This doesn't work for**: Real-time streaming analytics. Sub-millisecond latency requirements. Multi-region high availability. Workloads requiring strong consistency guarantees. Teams larger than ~10 engineers (the operational simplicity trades off against team scalability).


**This doesn't work at**: Billions of documents (Meilisearch single-node has practical limits around 100M-200M documents). Thousands of concurrent writers (our indexer is single-threaded by design). Geographic distribution requirements (we run in one Azure region).


**This doesn't work if**: You need five nines (99.999%). Our four nines (99.99%) accept approximately 52 minutes of downtime per year. We've used some of those minutes. If your SLA requires less than 5 minutes of annual downtime, you need redundancy we don't have.


The honest answer is that our architecture works for datasets in the millions-to-low-hundreds-of-millions range, with read-heavy workloads, moderate concurrent users, and tolerance for minutes-not-seconds of recovery time. That describes most of the software running on earth today.




12. Conclusion: The Lie and the Truth



**The lie**: You need enterprise infrastructure to run at enterprise scale. You need Kubernetes to orchestrate containers. You need Elasticsearch to search documents. You need a team of 12 to maintain a platform.


**The truth**: You need the right tool for the job, the discipline to not over-engineer, and the honesty to document what breaks. We serve 10.4 million documents to customers in 46 countries on infrastructure that costs less than a large pizza per week. Our AI makes 8,000 autonomous security decisions per day. Our STIX feed is consumed by organizations that spend more on coffee than we spend on compute.


The infrastructure isn't the product. The infrastructure is a commodity. The product is what you do with it — the corpus you build, the decisions you automate, the intelligence you produce.


We chose to build the world's largest searchable archive of government-released Epstein documents, cross-referenced with 2 million ICIJ offshore entities and 918,000 threat indicators. We did it for $280 a month. And the documents keep coming.


Build something viral. Run it on an iPhone's worth of hardware. Let the enterprise vendors explain why they charge a thousand times more for a thousand times less.




*DugganUSA LLC | Minneapolis, Minnesota*

*Founded December 1, 2025*

*D-U-N-S: 14-363-3562 | SAM.gov UEI: TP9FY7262K87*


*Patrick Duggan — Founder, Architect*

*Paul Galjan — Co-founder, Operations*


*Platform: analytics.dugganusa.com | epstein.dugganusa.com*

*STIX Feed: analytics.dugganusa.com/api/v1/stix/taxii2*

*Blog: www.dugganusa.com*





*Her name was Renee Nicole Good.*


*His name was Alex Jeffery Pretti.*

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page