Meta's AI Is Training on Our Threat-Intel Site — We Watched It Happen

Patrick Duggan
Apr 15
5 min read

Updated: Apr 25

Tonight we ran our end-of-day net sweep and something jumped out of Microsoft Clarity's session feed: 127 "Unknown browser / Unknown device / Desktop" sessions, all from ASN 32934 — Facebook.

That didn't smell like a person.

We cross-checked against Cloudflare's firewall logs and got the answer in under sixty seconds: the 127 sessions weren't sessions at all. They were hits from `meta-externalagent/1.1` — Meta's AI-training web crawler — pulling 200 requests in the last 23 hours from our public content, from 65 different IPv6 addresses in Meta's San Jose / Menlo Park data center block (2a03:2880:f802::/48).

What Meta was trying to take

Meta's crawler hit three categories of our content.

Category 1 — our Hall of Shame blog posts. Old series, factual-looking structured content: IP address + country + ASN + attack pattern, one post per attacker. Meta's crawler made *500 attempts at `/post/hall-of-shame-` URLs in 23 hours.** 494 of those got 404 from our site because Meta was generating URL variants that don't exist — literally guessing which IP-octet combinations might be a post we'd written. Five of them succeeded and got ingested with a 200.

Hall of Shame is "ancient history" as a series — we stopped adding to it months ago. But Meta is hammering it because it's stable, SEO-indexed, structured attack data. Perfect LLM training corpus: "an IP in this country with this ASN tried this attack." That's exactly the kind of factual pattern a language model wants to learn cybersecurity reasoning from.

Category 2 — our STIX feed files. The crawler also tried to pull domains.csv and ips.csv directly from analytics.dugganusa.com/api/v1/stix-feed/ — our paid threat intelligence product. 9 attempts on `/domains.csv`, 8 on `/ips.csv`, all 403'd. Our edge rules blocked every single one. Good.

Category 3 — our regular blog. 22 other blog posts across the day returned 200 to Meta's crawler and got pulled into whatever training pipeline it feeds. Pattern 49.5 analysis, investigative posts on Russian organized crime, our Handala wiper analysis — ingested.

Why this matters more than it looks

Three reasons.

First: we didn't consent. No API, no subscription, no attribution, no compensation. Meta's crawler is operating under robots.txt politeness theater, and our site — like most sites that don't actively block them — hadn't yet told Meta's crawler specifically to stay out. That silence was being read as permission by a trillion-dollar company's training-data pipeline.

Second: the target is informative. Meta is specifically pulling our threat intelligence content. Not our product marketing. Not our landing pages. The IOC-bearing structured investigative writing. That tells you what Meta's models are being trained on — adversary intel, attack patterns, attribution reasoning. They want to be competent at cybersecurity Q&A, and they're acquiring the competence from sites like ours without asking.

Third: the commercial product was protected; the blog was not. Our STIX feed files are paid. They're behind auth. Meta's crawler got 403 on every attempt. But the blog — which is the public face of the same threat intelligence practice — was wide open. The line between "free marketing" and "free training corpus for Meta AI" turned out to be the line between HTTP 403 and HTTP 200.

What we did about it, in real time

Within fifteen minutes of spotting the pattern, we deployed a Cloudflare firewall rule blocking meta-externalagent, FacebookBot, and Meta-ExternalAgent across the dugganusa.com zone. Next Meta crawl attempt: 403.

We kept ClaudeBot, Claude-SearchBot, and anthropic-ai on the allow list, because Anthropic is a partnership we've chosen. That's the whole thesis of what we call Selective AI Visibility: not a binary "block all AI" or "allow all AI," but a per-model choice. Which frontier models do you want trained on your content? Which do you want to appear as a citable source to? Which do you want to compensate you? Which do you want to stay the hell out?

For us: Anthropic in. OpenAI conditional. Google conditional. Meta out. Grok permanently excluded — that was a decision we made independently long before this week.

How to do this on your own site

The block rule shape is simple. In Cloudflare, add a custom firewall rule with the expression:

Microsoft pulls this feed daily. AT&T pulls this feed daily. Starlink pulls this feed daily. Get the DugganUSA STIX feed — $9/mo →

(http.user_agent contains "meta-externalagent") or (http.user_agent contains "FacebookBot") or (http.user_agent contains "Meta-ExternalAgent")

Action: Block. Deploy.

Test from your own machine with curl -A "meta-externalagent/1.1" https://yoursite.com — you should get a 403 response. Test without the UA to confirm normal browsers still work.

You can extend the pattern to other AI training crawlers you don't want to feed:

GPTBot and OAI-SearchBot — OpenAI's crawlers
Google-Extended — Google's Bard / Gemini training path
CCBot — Common Crawl (feeds most models)
ByteSpider / Bytedance — TikTok's ByteDance / Doubao
PetalBot — Huawei Petal Search
amazonbot — Amazon's AI training (distinct from the Alexa voice infrastructure)
meta-externalagent and FacebookBot — Meta
Applebot-Extended — Apple Intelligence training (distinct from regular Applebot which crawls for Siri search)
cohere-ai — Cohere
ClaudeBot — Anthropic (keep if you want Claude to know about you)

If you want a more nuanced posture — "block Meta but allow Google because Google sends us traffic" — that's the whole point of AIPM Defense. The old robots.txt was a binary. The new reality is a menu, and every frontier model lab is now a separate line item on it.

The receipts, in numbers

200 requests from meta-externalagent/1.1 hit our site in 23 hours
65 unique Meta IPv6 addresses, all in 2a03:2880:f802::/48 (San Jose / Menlo Park data center)
494 of 500 attempts on /post/hall-of-shame-* were Meta's crawler guessing URL variants
22 blog posts successfully ingested with a 200 response before we blocked
17 of 17 attempts on our paid STIX feed returned 403 (commercial product was protected)
0 attribution, compensation, or consent requested
15 minutes from discovery to block deployed
1 AI training pipeline now short one unauthorized data source

The quiet part said out loud

The AI training corpora being built right now, by labs with market caps larger than most countries' GDPs, are being assembled from content they did not pay for, did not ask permission for, and cannot be compelled to credit. The legal framework has not caught up. The economic framework has not caught up. The cultural framework has not caught up.

The only framework that has caught up is the technical one — you can say no, right now, today, on your own infrastructure, in a rule that takes fifteen minutes to ship.

That's what AIPM Defense is. That's why we built it. And tonight, we got to use it on ourselves, watching live as Meta's crawler hit a brand-new 403 on a path it had been freely pulling from for who-knows-how-long.

Our blog posts are not training data. They are our writing, and whoever is going to learn from it should earn the right to do so.

This post itself will probably get scraped by every major AI crawler within hours. The irony is not lost on us. Consider this your notice: `dugganusa.com` has now joined the growing list of sites that require explicit per-model consent before training ingestion. If you are from a lab and would like your crawler added to our allow list, or if you would like to propose a commercial arrangement, `[email protected]`.

If you are from `meta-externalagent`: welcome. The 403 is specifically for you. We are happy to talk.

The cheapest, fastest, most accurate threat feed on the internet.

275+ enterprises pulling daily. 1M+ IOCs. 17.4M indexed documents. We beat Zscaler by 43 days on NrodeCodeRAT. Starter tier $9/mo — less than any competitor’s sales demo.

Look up an IOC → · Audit your brand on AIPM → · See pricing →