How We Built a Threat Feed That's Faster and More Accurate Than the Billion-Dollar Vendors. The Short Version.
- Patrick Duggan
- 10 minutes ago
- 5 min read
Download the PDF: How We Built a Threat Feed That's Faster and More Accurate Than the Billion-Dollar Vendors — The Short Version (4 pages, bone cardstock)
Today we ship threat intelligence to 275+ organizations in 46 countries, running on about $500 a month of Azure compute, with an internal site-level false-positive rate under 0.004%. CrowdStrike's cheapest Falcon Intelligence tier is around $100,000 per year. Recorded Future's enterprise plan is $50,000+ per seat. Mandiant Advantage starts at $75,000. We charge zero for the free tier and $45 per month for Starter. Our ingestion-to-publication latency is under ten minutes. Theirs is 24 to 72 hours. The difference is not scale. It is architecture.
This is the short version. There is a full methodology document that runs about 8,500 words with flow diagrams for every detection, classification, enrichment, exemption, and alert path in the platform. We share it under NDA with customers, investors, and security teams evaluating us for partnership. The email is at the bottom of this post. What follows is the compressed version you can read in three minutes and walk away with the five principles that do the work.
The whole pipeline in one diagram
flowchart LR
A[Detect] --> B[Classify]
B --> C[Enrich]
C --> D[Exempt]
D --> E[Emit]
E --> F[Customer block]
G[False positive<br/>reported] --> D
H[Accountability<br/>loop] --> BFive stages, each running on its own cadence, each correctable in place, each with its own measurable accuracy number. Most commercial feeds conflate two or three of these stages into a single monolithic pipeline, which is why correcting a false positive in a commercial feed takes weeks. When detection, classification, enrichment, exemption, and emission are separate layers, a bad signal can be corrected at any one of them without tearing down the others.
Five principles that do the actual work
1. Separate the layers. Detection, classification, enrichment, exemption, and emission are five distinct stages. Each one has its own inputs, its own outputs, its own cadence, and its own error budget. When you conflate them, every correction becomes a rebuild. When you separate them, every correction is a config change.
2. Parallelize everything. Every ingestion source, every classifier, every enrichment call, every AI model query, every cron job runs concurrently. The 15-vendor AIPM audit we published this morning completed in 33.2 seconds end-to-end because five models queried four dimensions across fifteen domains in parallel. Sequential code in a threat pipeline is a performance bug, not a design choice.
3. Use AI only where AI is best. Large language models are good at language, ambiguity, and research. They are bad at deterministic scoring, edge-case rules, and fast lookups. A threat pipeline that uses an LLM for every decision is guaranteed to be both slower and less accurate than a pipeline that uses LLMs only at the decision points where nothing else works. Discipline about where you do not use AI is as important as cleverness about where you do.
4. The exemption layer is where the accuracy lives. Detection and classification always produce false positives. The difference between a feed with a 5% false positive rate and a feed with a sub-0.01% false positive rate is the quality and maintenance of the exemption layer. We put as much engineering into the "definitely not a threat" path as we put into the "definitely a threat" path, and that is the single biggest reason our customers trust the feed in production.
5. Write down every mistake. Every confirmed false positive, every missed detection, every deploy failure, every customer complaint becomes a structured incident record, an automated compliance pattern, and a standing lesson-learned that loads into every future engineering session. Accountability is not a marketing virtue; it is an architectural feature that compounds over time.
Two math tricks nobody else in the category is doing
Bloom filter novelty check. A Bloom filter is a data structure that answers one question fast: "have I ever seen this thing before?" It costs a handful of bytes per million items and returns the answer in a microsecond. We keep a continuously-updated Bloom filter over the full IOC space — over a million indicators as of this writing — and every new candidate indicator gets a novelty check before classification. Known-bad returning after a dormancy period routes one way. First-ever-seen routes another way. The distinction is load-bearing because the biggest single source of false positives in any feed is re-scoring known-bad indicators as known-good after a tenant reassignment. The Bloom filter is what prevents that class of error, and it runs in O(1).
Cross-index correlation in a single Meilisearch query. The Butterbot platform uses Meilisearch as the primary data store across 42 separate indexes covering IOCs, block events, threat-intel pulses, offshore entity relationships, Epstein files, blog content, behavioral sessions, adversary profiles, and a dozen others. When a single indicator — an IP address, a domain, a company name — appears in multiple otherwise-unrelated indexes within a short time window, that is itself a signal. Cross-index correlation is one of the eight inputs to our Markov-based precursor detection system, and it catches attack campaigns in their earliest phases because the attacker's infrastructure shows up across multiple data surfaces before it shows up as a confirmed threat in any single one of them. Most commercial threat platforms store their data in one index per product line (threats here, research there, customer data elsewhere) and physically cannot correlate across them without multi-minute batch jobs. We correlate in milliseconds because we built the whole thing in one search engine on purpose.
The receipts
This week we published our first quarterly report, the Q2 2026 State of AI Brand Perception in Cybersecurity. Fifteen named cybersecurity vendors. Five AI models. Four verbatim fabrications we caught in 33 seconds — OpenAI insisting CrowdStrike is headquartered in Sunnyvale (it's Austin since 2022), Gemini inventing a Rapid7 founder named "Alan Chhabra," Gemini mutating Snyk's Danny Grander into a completely different security researcher named Danny Gruss, DeepSeek confusing Wiz's Roy Reznik with monday.com's Roy Mann. Every named error is reproducible. Every audit is re-runnable. The full report is at aipmsec.com and the PDF is linked on the DugganUSA blog.
That quarterly report exists because the architecture this post describes makes it cheap to generate. Fifteen vendors, five models, seventy-five audits, in 33 seconds flat. Nobody else in the category is doing that on a weekend, and nobody else is publishing the receipts with the vendor names attached.
The long version
The full whitepaper covers nine parallel ingestion sources, eight independent classifiers, eight enrichment cross-references, ten false-positive prevention mechanisms, the AI integration philosophy in detail, the accountability loop, a comparison table against the three biggest commercial vendors, and flowcharts for each. We share it under NDA with customers considering a paid tier, investors running due diligence, analysts validating our claims, and security teams evaluating us for partnership or procurement.
If you want it, email `[email protected]` with the subject line "methodology" and a sentence about who you are and why. We read every one and respond within 24 hours. The NDA is one page and your legal team will not hate it.
Read the Q2 report: State of AI Brand Perception in Cybersecurity — Q2 2026
Audit your own brand: aipmsec.com
— Patrick
