We Audited Our Own Platform This Week. Here Are 10 Bugs We Found.
- Patrick Duggan
- 31 minutes ago
- 7 min read
# We Audited Our Own Platform This Week. Here Are 10 Bugs We Found.
The defensive-security industry has a discipline it rarely practices on itself. Vendors audit their customers. Auditors audit the vendors. Compliance frameworks audit the auditors. The thing nobody audits is the platform you actually run on, with the assumptions you actually made when you built it. So twice a year we run what we call a self-examination week. Last 48 hours we ran one against our own stack. Ten findings. Six shipped fixes. Three documented deferrals with rollback paths. One honest "we won't touch that until we have a maintenance window we control." The receipts are below.
The Ten Findings
1. The BDE engine wasn't actually scoring per-IOC
Our Bayesian Decision Engine writes a record into oz_decisions for every threat-intelligence event we ingest. Every record carries a bdeNovelty, bdeSignificance, and bdeConfidence field. We've been telling the world the engine produces graded per-IOC novelty using a Bloom filter against six external sources. We checked. The records were carrying constant values per epoch — every queued event from a given month landed at exactly nov=65, sig=20, conf=0 regardless of the actual IOC. The Bloom check was firing; the score wasn't being assembled from the per-event metadata. The fix landed in lib/bde-publisher.js — significance now reads from kev_listed, cvss, threat_type, country, and isp fields on the event object; confidence reads from abuseScore, confidence_level, vtDetections, and references.length. Synthetic test cases now spread across 0 to 100 instead of clustering on three discrete buckets.
2. The GitHub hunt cron was authenticating with a dead PAT
We built a daily hunter that searches GitHub for fresh malware-staging repos using eighteen high-signal queries. Scheduler entry was registered. Cron was firing at 08:15 UTC. Index was zero docs. The cause: the container app's GITHUB_TOKEN env var was bound to the secret github-pat-new, which returned HTTP 401 to every API call we tested it against — the token had been revoked or expired upstream. Three other tokens in the same vault (github-api-token, github-pat-butterbot, the docker-image one) all returned 401 too. The only working one was github-harvester-token. Bound it as a new env var, redeployed, manually triggered the cron via the scheduler-admin API. First catch within 30 seconds: three GitHub repos, two of them exploit POCs for the cPanel auth-bypass we had indexed yesterday morning, uploaded within seven and fifteen hours of the CVE disclosure.
3. The cool-shit-notifier had been silent for weeks because of one wrong require path
The notifier is the system that emails Patrick when interesting things happen — nation-state actors landing on our infra, paying customers signing up, zero-days affecting our stack. Sixteen check functions across three tiers, each scheduled. We grepped the notification dedup index. Zero records. Ever. The notifier had never fired a single email since deploy. Cause: line 39 of lib/cool-shit-notifier.js did require('./meilisearch-client'). That filename doesn't exist in the codebase. The actual module is ./meilisearch. The try-catch around the require swallowed the error silently, leaving the meilisearch reference undefined, leaving every check returning null, leaving the inbox empty. Changed one string. The next 5-min tick produced telemetry; tomorrow's morning-coffee digest will produce the first real email.
4. The Tor correlation engine was 500ing on three independent field-name typos
We have eight hundred forty thousand hourly Tor relay snapshots, growing by roughly ten thousand per hour. The correlation engine cross-references these against our IOC index and our block-events index. The endpoint had been returning HTTP 500 to every consumer since before we noticed. Cause: three independent typos. snapshot_date instead of snapshotDate in the sort field — Meilisearch returns 500 because that attribute isn't sortable; the camelCase one is. or_address instead of address in the IP-extraction helper — there is no or_address field on a Tor relay record, so the function silently returned null. first_seen instead of firstSeen in the suspicion-hunt filter — same issue, the underscore version doesn't exist in the index. All three fixed in a single commit. The endpoint now returns 1,096 IOC matches across 922 unique relay IPs.
5. The STIX feed and OPNsense feed were silently dropping our highest-value IOCs
We E2E-tested the funnel. Started with the IOCs we'd manually ingested earlier in the day from vendor research — the LiteLLM CVE-2026-42208 attacker IPs, the Forest Blizzard router-DNS-hijack tracking entry, the CISA KEV adds for ConnectWise and Windows. None of them appeared in the STIX bundle. None of them appeared in the OPNsense IP blocklist. The cause was a type-string convention mismatch. Our ingest scripts have been writing type: "ipv4" for years (matches the OTX/STIX 2.1 standard). Our consumer endpoints have been filtering on type = "ip" for the same years. The two conventions never agreed and nobody noticed because the feed's volume was dominated by docs that did match. Fifteen thousand manually-curated IPv4 IOCs from vendor research were getting dropped on the floor every day. Fixed both consumer endpoints to accept either type-string. STIX bundle grew from 10,003 objects to 16,919 (+69%). OPNsense feed, with the budget-split tune, now seats every confirmed attacker IP from this week's vendor disclosures.
6. epstein.dugganusa.com was missing GA4, accessibility landmarks, and data-driven nav order
The property hosts our most-visited content — four hundred thousand DOJ Epstein documents, two million ICIJ offshore entities, the network-graph and financial-flow visualizations. We had stripped the navigation down at some point and never put it back to standard. The visualizations were getting thirty-seven to fifty-one page-views per month each because nobody could find them — they weren't in the nav. We pulled the actual page-views data for the property, ranked the visualizations by traffic, and rebuilt nav order to match the data. Augmented shared-nav.js with idempotent injection of GA4 (G-BSWVJNBXDL), Microsoft Clarity (project wikbwl94d2), and a skip-to-content accessibility link. Twelve pages already including shared-nav.js got the upgrade for free. Eight more pages added the include. Twenty pages now ship the same telemetry and accessibility baseline.
7. The Meilisearch master key was visible in `ps aux` to any user on the VM
Audit finding A3. The systemd unit launched Meilisearch with --master-key 6c9a9ee37b...d56 as a command-line argument. Anyone with shell on the VM could read it via ps aux | grep meili or /proc/<pid>/cmdline. Same key was hardcoded as a fallback in three of our ingest scripts in the public repo. Moved the key to a systemd Environment=MEILI_MASTER_KEY=... line. Process listing now only shows non-secret CLI flags. Same value, different surface.
8. The Meilisearch dump cadence was thirty-two days stale
Audit finding A7. The cron at 30 2 * only deleted old dumps; never created new ones. Built-in snapshot feature was not enabled. Last dump on disk was March 29. Twenty-six million documents of work would have been lost if /dev/sdd1 had failed in the intervening month. Added --schedule-snapshot 86400 --snapshot-dir /data/snapshots to the systemd unit. First daily auto-snapshot fires within twenty-four hours.
9. Port 7700 was open to the entire internet
Audit finding A11. The Network Security Group rule on meilisearch-prodNSG had Source: * for the inbound port-7700 allow. Confirmed externally — nc -z -v 20.29.105.97 7700 succeeded from a coffee-shop Wi-Fi. Anyone on the internet could fingerprint the Meilisearch version and probe the index list. Tightened to Source: AzureCloud service tag. Post-change nc -z times out; analytics-dashboard's internal calls still resolve.
10. The Epstein search key was hardcoded in client HTML, with full-index scope
The frontend at microservices/meilisearch-vm/index.html had var MEILI_KEY = 'a177860bd5...' baked into the JavaScript source. Anyone viewing source could lift the key and use it to query every one of our forty-eight indexes — search-only, no writes, but every customer record, every threat-intel index, every behavioral session. Replaced with server-side injection via nginx proxy_set_header Authorization "Bearer __MEILI_SEARCH_KEY__", where the placeholder gets substituted at deploy time from an Azure Key Vault secret that holds a new scoped key. New scope: epstein_files, icij_offshore, icij_relationships — the three indexes the public Epstein search actually needs. Old key revoked at Meili. Browser source no longer carries any auth value.
What We Deferred Honestly
Three things we did not ship in this window. Each has a documented rollback path and a sized maintenance estimate.
The Meilisearch 1.35 → 1.42 engine upgrade. We tried. The 1.42 binary refuses to start against a 1.35 data directory because the on-disk format changed across the seven minor versions we'd missed. Engine emits the explicit error message and exits non-zero. Rolled back to the 1.35 binary in thirty seconds via the .bak we'd kept. Proper upgrade is a two-to-three hour maintenance window with a full dump-and-restore: stop Meili, dump (probably forty-five minutes on the current twenty-million-doc corpus), wipe the data dir, restart with --import-dump, wait for restore to complete (could be hours), verify the forty-eight indexes load cleanly. Scheduled for next maintenance window.
Scoped per-operation API keys for the analytics dashboard. Today the dashboard uses the Meili master key for every operation — search, ingest, settings. The right shape is one key per operation class with index-scoped permissions. The work is bounded but it touches every script that talks to Meili. Estimated at one focused afternoon. Deferred until the dashboard service has its next planned restart.
Vector-search pilot on the IOC index. Audit finding A5 noted zero embedders configured across all forty-eight indexes; zero documents embedded across twenty million indexed. We have a corpus that nobody else with our IOC volume is running hybrid semantic search on. The pilot would cost two-to-five thousand dollars to backfill OpenAI embeddings on the IOC index alone, then more to keep it warm. Strategic, not urgent.
Why Self-Examination Weeks Matter For A One-Person Engineering Team
The platform is twenty-something microservices, four production VMs, a Container Apps deployment, six Azure managed identities, fourteen scheduled crons, eight ingest pipelines feeding forty-eight Meilisearch indexes, and roughly four million dollars worth of accumulated platform value. Every one of those surfaces accumulates entropy at a rate proportional to how often the deployer touches them. Bugs of this shape do not announce themselves. The BDE engine had been writing constant scores for weeks, possibly months, and the dashboard kept showing high-confidence numbers because the numbers themselves looked plausible. The cool-shit-notifier had a working systemd schedule, a working Graph API account, sixteen check functions across three tiers — and one wrong import path made all of it produce zero output for the entire deployment lifetime.
The discipline is reading your own code as if you didn't write it. Auditing your own infrastructure as if a hostile reviewer were going to publish their findings. Writing the receipts down even when nobody asked for them. Six fixes shipped and three deferrals documented in forty-eight hours is what you can produce when the audit is something you do to yourself instead of something a customer pays for and gets one slide.
The platform is more honest tonight than it was yesterday morning. Tomorrow we look at three more things.
Her name was Renee Nicole Good.
His name was Alex Jeffery Pretti.
