top of page

Microsoft Clarity Is Not an Analytics Tool. It's a Behavioral Training Corpus.

  • Writer: Patrick Duggan
    Patrick Duggan
  • 3 minutes ago
  • 7 min read

I installed Microsoft Clarity on our infrastructure yesterday. Three subdomains, four product templates, an aipmsec.com landing page, the Epstein search tool, the Ops dashboard. All of it. I did the same thing a million other developers have done — added two lines of JavaScript to get heatmaps and session recordings, for free.


Within twenty-four hours I realized what I had agreed to. This essay is what I think every developer who installs Clarity should know before they finish the integration.





The Stated Story


Microsoft Clarity is described as a free behavioral analytics tool. Heatmaps. Session recordings. Rage clicks. Dead clicks. Quickback clicks. Scroll depth. A generous, fast, well-engineered alternative to Hotjar for teams that don't want to pay $1,000-$3,000 a month for the same capability.


That story is not wrong. It's just incomplete in a way that matters enormously.


The official numbers (per Microsoft's own marketing material as of early 2026): roughly 4 million websites have installed the tag. Adoption velocity is accelerating. The tool genuinely works. Teams that install it solve real UX problems with it.


That much is documented. Now the part that isn't.





The Aggregator Play in Plain Sight


The real value of Clarity isn't what it shows users. It's what it shows Microsoft.


Every installed instance is a data collection node. Every session replay, every heatmap, every scroll pattern and hesitation and exit and copy-paste event flows upstream. Across roughly four million websites. Across what is almost certainly tens of billions of user interactions per month. At negative marginal cost to Microsoft, because the website operators pay for it themselves — by installing the tag voluntarily to solve a legitimate business problem.


This is the aggregator play. It's hiding in plain sight precisely because the tool is genuinely useful.


Hotjar, FullStory, and Contentsquare are optimizing their pricing tiers and feature sets. They're competing on the wrong dimension entirely. They're in a tools race. Microsoft is in a data accumulation race. Different sport. Different finish line.


The asymmetry is structural and unfixable by any of those competitors:


  • Hotjar's marginal cost to acquire your behavioral data: positive. You pay them.

  • Microsoft's marginal cost to acquire your behavioral data: negative. They give you tooling worth a $20K/year subscription, for free, because the data is worth more to them than the subscription would be.

A negative-cost acquisition channel against a positive-cost competitor is checkmate by default.





Data Gravity


In physics, gravity increases with mass. In data, the same principle applies — large datasets attract applications, integrations, and infrastructure. The bigger the dataset, the more things orbit it.


Microsoft Clarity is building mass quietly. Clarity launched October 2020, so the dataset is currently about 5.5 years deep. By 2030 it will be a decade deep. Unlike compute — which can be rented from any cloud provider — proprietary behavioral data at web scale is not for sale. It cannot be replicated by a competitor who wakes up late to the opportunity. It can only be accumulated, slowly, over time, by being present everywhere.


By the time the market fully internalizes what's been built, the dataset will be unreachable. The gravity well will already be open and pulling.





The Training Data Triad


What an AI agent needs to actually use the web — not just read it, but use it the way a human uses it — comes in three layers:


  1. Semantic layer. What the page says. Common Crawl. The open web.

  2. Conversational layer. How humans talk about the page and to each other. Reddit (Google paid them ~$60M for training-data access in 2024). Stack Exchange. Discord servers.

  3. Behavioral layer. What humans actually do on the page. Where they hesitate. What they click. Where they abandon. How they move a cursor across a checkout flow versus how the designer assumed they would.

Google has the first two via Search and the Reddit deal. Meta has the conversational at scale via WhatsApp/Messenger. Anthropic and OpenAI have neither, and certainly not the third.


Microsoft, alone, has the behavioral layer at meaningful scale.


That's the asymmetric piece nobody has named yet. Behavioral data is the missing training-data category — the one that converts a model that describes the web into an agent that navigates it.





How Behavioral Data Becomes Training Data


The bridge is mechanical, not magical. Here's the simplified version:


An agent watching ten thousand humans hesitate before clicking a "Cancel Subscription" button — and then half of them abandon the flow — learns dark patterns better than a model trained on screenshots ever could. The model sees a button that says Cancel Subscription. The model trained on Clarity-grade behavioral data learns which buttons humans actually click, which they hesitate over, which they rage-click after the page failed to respond, and which they abandon.


That's the difference between an agent that can describe a checkout page and an agent that can complete one without getting stuck on the dark-patterned bits. It's the difference between an LLM that summarizes a SaaS dashboard and an agent that successfully cancels your unwanted SaaS subscription on your behalf.


The semantic layer alone never gets you there. The conversational layer doesn't get you there either. Only the behavioral layer does.





The Play Behind the Play: Recall


Web behavior is half of human-computer interaction. The other half is what humans do on the desktop — across applications, file managers, screen captures, copy-paste actions, IDEs, design tools, spreadsheets.


Microsoft Recall — the controversial Windows 11 feature that takes continuous screenshots of everything you do — is the desktop-side complement to Clarity's web-side. Together they form a complete behavioral corpus from keystroke to scroll, from terminal command to checkout button.


When the Recall rollout caused a privacy backlash in 2024, the public-facing framing was "users want their own searchable activity history." The strategic framing — the one that explains why Microsoft pushed forward despite the backlash and re-enabled it under a new name in 2025 — is the same one that explains Clarity. Both are behavioral training corpora. One captures how humans use the web. The other captures how humans use everything else.


Microsoft is not acquiring behavioral data via Recall and Clarity to enable analytics products. They are acquiring it because no AI agent that wants to act on a human's behalf can do so credibly without it.





Why It Works


The strategy succeeds because it demands almost nothing from Microsoft internally. There's no monetization deadline. No product roadmap dependency. No need for a champion to defend it in a quarterly business review.


The data accumulates passively while the tool does its job. Microsoft's other ambitious platform plays have historically required execution, and Microsoft has a long history of fumbling exactly that.


Clarity requires only that they not kill it.


That's a much lower bar. And the asset compounds either way.





The Counter-Arguments, Honestly Considered


A few objections worth pre-empting:


"Apple's iOS privacy stance must blind Clarity on mobile." Partly true. Safari's Intelligent Tracking Prevention does limit some Clarity functionality. But Clarity still captures behavioral signals from the page side (DOM mutations, scroll velocity, click positions) that don't require cross-site tracking. iOS reduces the corpus quality for mobile Safari users; it doesn't eliminate the harvest. And Android plus desktop is plenty.


"GDPR and the EU AI Act prohibit this kind of behavioral profiling." GDPR requires consent and a lawful basis, which most Clarity-enabled sites get via cookie banners. The EU AI Act regulates high-risk AI systems, which behavioral training corpora aren't classified as (yet). The legal frame focuses on output discrimination, not input collection. The rules will likely tighten — but the corpus accumulated under today's rules is not erasable retroactively.


"Microsoft has all the world's behavioral data and still ships Bing." Two things can be true. Bing is a product execution problem. The behavioral corpus is a strategic asset that becomes more valuable as agents become the primary interface. They're independent bets. Microsoft has been historically bad at the first and historically patient with the second.





What This Means for You


If you operate a website, you should know the following:


  • The choice to install Clarity is not just a UX-tools decision. It's a data-supply-chain decision.

  • You are voluntarily contributing to a behavioral training corpus you cannot query, audit, opt out of cleanly, or export.

  • The corpus you're contributing to will likely become load-bearing for the agentic web. That includes agents from competitors, regulators, and threat actors who eventually access it through Microsoft's developer platform.

  • The only behavioral-analytics offerings without this property are paid (Hotjar, FullStory, Contentsquare), self-hosted (Plausible Analytics with custom event tracking), or absent entirely.

This isn't a recommendation to remove Clarity. We didn't remove ours. It's a recommendation to install it consciously, to treat the behavioral data flowing out of your domain as a real cost paid in real currency, and to assume that cost will rise as the corpus's strategic value rises.


The Sligo rule applies here as much as anywhere: you can't credibly warn people about something you're not also doing without owning the contradiction. I'm doing it. I'm telling you why I'm doing it. I'm also telling you what the price is in language that the person who sold it to me wouldn't use.


Make your own call.





The Insight


The companies most threatened by this are reading Clarity's market share in the analytics category. They should be reading it in the AI training data and web navigation intelligence category.


By the time that reframe becomes consensus — and it will — it will already be too late to close the gap.


The gravity well is forming. Most people just haven't felt the pull yet.




I'm collecting this and the related patterns under a working name: Pattern 50, "Aggregator Camouflage" — the pattern of free tools whose stated user-facing purpose differs strategically from their actual data-aggregation purpose. AIPM (our AI presence audit tool at [aipmsec.com](https://aipmsec.com)) is being extended this week to detect and score behavioral-telemetry-stack tools as a separate signal from general analytics. Because if you're going to make the trade, you should at least know you're making it.


— Patrick Duggan, DugganUSA LLC


bottom of page