Tracking brand visibility in AI search has become one of the loudest categories in martech. Tools launch every month, each promising to tell you whether ChatGPT recommends you, whether Perplexity cites you, whether Gemini knows you exist. Most of them are giving you bad data, and the reason is almost embarrassingly simple: they're tracking AI the way they used to track Google.
AI search doesn't work that way.
Google is deterministic. The same query at 9:00 and 9:01 returns the same SERP, give or take ad rotation. AI assistants aren't. The same prompt to ChatGPT in two consecutive runs can name a different list of brands, in a different order, with different reasoning attached. Track once, you get noise. Track three times, you start to see signal. Track five and the picture is mostly stable. That's not a quirk to design around — it's the whole problem your tracking has to acknowledge.
This article walks through what AI visibility tracking actually has to measure, why single-shot results mislead, why ChatGPT-only tools miss most of the story, and how to read AI brand mentions in a way that produces decisions rather than vanity metrics. It maps onto the AI Tracker tool we built at Algorithm — six engines, multi-pass stability scoring, position-weighted share of voice — but the principles apply regardless of which tracker you use.
Why AI search is non-deterministic, and why it matters
The mechanics are straightforward once you've seen them. AI assistants don't pull from a fixed index. They generate answers token by token, sampling from a probability distribution at every step. Even with temperature near zero, retrieval order, context window state, and tool-use timing introduce variance. The same query produces statistically similar but not identical answers — and "statistically similar" can absolutely mean five different brands recommended across five runs.
I tested this on a single query for "best CRM for small business" through ChatGPT three times in a row, fresh chats each time. First run named HubSpot, Salesforce, Zoho. Second run named HubSpot, Pipedrive, Monday.com. Third named Salesforce, HubSpot, Freshsales. HubSpot showed up all three times. Salesforce showed up twice. Pipedrive, Zoho, Monday.com, and Freshsales each appeared exactly once. Same query, same time window, same account.
If a tracker had run only the first query, it would have told me Pipedrive isn't visible in ChatGPT. If it had run only the second, it would have said Salesforce isn't visible. Both conclusions would be wrong. Both conclusions would be in someone's quarterly report.
The single-shot fallacy
Most AI tracking tools query an engine once and print the result. They sell themselves on speed and price — a check that takes ten seconds and one credit. The trade they don't disclose is that the data point they produce is approximately as reliable as a single weather observation. Useful as one input. Meaningless on its own.
The result is a category of tools producing reports nobody can act on. Brand A "ranked #3 in ChatGPT this week, dropped to #7 next week" — except the brand didn't change anything. The model just sampled differently. The marketing team spends a meeting interpreting noise as signal, and the next week it inverts again.
What actually matters isn't whether AI mentioned your brand in a given run. It's how often AI mentions your brand across runs of the same query. That ratio is a stability score, and it's the closest analog to "ranking position" that AI search has.
What stability scores actually tell you
A few patterns I see often when running multi-pass on real queries.
5/5 mentions. AI structurally recommends this brand. Whatever combination of training data, retrieval signals, and response patterns drives the model's output, this brand consistently makes the cut. That's the AI-search equivalent of holding a top-3 Google position — real, durable, defensible. When you find a competitor at 5/5 in your category, that's the threat you build against.
3/5 mentions. Conditional. The brand makes it in when phrasing or context lines up, drops out when it doesn't. Worth understanding which two runs missed it — usually there's a pattern in the prompt language, the geography, or the framing that explains it. 3/5 brands are often one content investment away from being 5/5, or one careless brand decision away from being 1/5.
1/5 mentions. Statistical noise. The model surfaced this brand once, probably because of a particular sampling path. Not a real recommendation. Tools that report only single-shot results would call this brand "ranked" — and they'd be wrong. It got lucky once, and a different sample would have produced a different answer.
0/5 mentions. Structurally invisible. The model never names this brand across multiple runs for the dominant phrasing of the query. Could be a brand that's too new, too small, too SEO-thin, or simply outside the model's training distribution. The fix is signals over time, not a quick prompt tweak.
The stability layer is what separates AI tracking from AI tracking that produces decisions. Without it, you're reporting on coin flips and presenting them as rankings.
ChatGPT-only tools miss most of AI search
The other half of getting this wrong is treating ChatGPT as a stand-in for AI search overall. ChatGPT has the largest user base and the loudest brand recognition, so most tracking tools start there and don't move. The actual AI search landscape is more fragmented, and the engines behave differently enough that tracking only one gives you a distorted picture.
Perplexity is structurally different from ChatGPT. It's built around citations — every claim links to a specific source, with positions you can track precisely. For brands where AI-driven referral traffic matters (B2B, research-heavy, comparison-driven categories), Perplexity is often the more important platform, because Perplexity actually sends clicks to cited URLs. ChatGPT mostly doesn't, even when it cites.
Claude reflects a different signal entirely. Routed through the Anthropic API, it surfaces brands from training-data recognition rather than active web search. What Claude "knows" about a category at its training cutoff is what it recommends. Brands well-documented across the web when the model was trained show up reliably. Brands that emerged after, or have thin online presence, don't — regardless of current SEO performance. Tracking Claude tells you about brand awareness in the model's memory, which is orthogonal to web visibility and surprisingly stable over time.
Gemini is Google's AI, and it uses Google's own search infrastructure for retrieval. That makes it partly a sanity check on whether your AI presence aligns with your Google ranking. It's also the most prone to timeouts on complex queries, which is its own data point about how Google treats it internally.
Grok pulls from X (Twitter). Its ranking signals are shaped by what gets discussed on that platform, which makes it useful for consumer brands, lifestyle products, or anything tied to internet culture. For pure B2B SaaS with no social presence, less useful — but worth running occasionally just to confirm you're not invisible there.
Google AI Mode is the most situational of the six. It only triggers on informational queries — "how to", "what is", "guide to" — and won't surface for commercial intent at all. That makes it a pure content marketing visibility signal, not a bottom-of-funnel one.
Across the six engines, the same brand can have a 5/5 stability score on ChatGPT and 0/5 on Perplexity. That's a real, actionable finding — and ChatGPT-only tracking would never reveal it. The brand looks healthy on the dashboard while being invisible to half its potential AI traffic.
Position-weighted share of voice, not mention counts
Counting mentions is necessary but not enough. Whether your brand was named first or seventh in an AI's response affects how much of that response is actually about you. AI assistants generate answers as ordered lists, and the first few names get most of the user's attention. Position-weighted share of voice scales each mention by where it appeared, decaying for each lower position. The result is a percentage that reflects how much of an AI's answer is genuinely centered on your brand versus mentioning it in passing.
Two brands both get mentioned in 5 of 5 runs across 10 queries. Brand A is consistently first or second. Brand B is consistently sixth or seventh. By naive count, both have 100% mention rate. By position-weighted share, Brand A might come out at 35%, Brand B at 8%. The second number is the more honest one. Brand B is technically present in the answers — but most readers will never reach the line where its name appears.
One Claude run for one query in a real estate category surfaced 13 different domains. Even at that depth, the top three names (raywhite.co at 15%, century21bali.com at 13%, coldwellbanker.co at 12%) captured 40% of the weighted attention. The long tail was technically visible, but their share was rounding error from the customer's perspective.
What this tool deliberately doesn't do
A few things we left out, on purpose.
No query quotas. Tracking AI visibility is project-stage work for most teams — heavy when benchmarking competitors or auditing a campaign, light the rest of the time. Charging by monthly query cap penalizes the actual usage pattern. We charge per use, which lines up with how teams run these checks in practice.
No single-shot default. The 1x option exists for quick exploration, but the recommended default is 3x, and 5x is the right setting for any finding you're going to brief into a strategy doc. Anyone who runs only 1x and reports the result as "your AI ranking" is reporting noise. Stability score has to be in every output, not buried in an upgrade tier.
No query sharing with AI vendors. When your team queries ChatGPT or Claude directly to see how the brand shows up, those queries route through personal accounts. Over time, the tracking targets — the brands you're auditing, the categories you're researching, the competitive intelligence you're building — accumulate in someone's chat history at OpenAI or Anthropic. Our queries route through dedicated infrastructure instead. Your competitive research stays your competitive research.
No ChatGPT-only plans. All six engines on every plan. The pricing pattern of "ChatGPT included, others on Pro tier" exists because tracking ChatGPT is cheaper for the vendor — not because ChatGPT is more valuable than the others. Splitting access is a margin choice. We made the opposite one, because the engines tell different stories, and getting only one of them isn't really tracking AI visibility. It's tracking ChatGPT visibility while pretending.
What to do in the next hour
Pick three buyer-intent queries from your category. Not keywords — actual questions a customer would type into ChatGPT. "Best [your category] for [your customer profile]". "How to choose between [you] and [main competitor]". "What's the leading [your category] tool for [use case]".
Open ChatGPT in a fresh chat. Run the first query. Note which brands are mentioned and in what positions. Open another fresh chat — important, same chat introduces context bias — and run the same query again. Note again. Do it a third time.
If your brand showed up the same way in all three runs, you have stable AI presence in this query. If it appeared once and disappeared twice, you have noise that other tools have probably been reporting as a ranking. If it didn't appear at all, you have a content and signal gap to work on, and no amount of prompt rephrasing will fix it.
Now do the same exercise on Perplexity. Note how the cited URLs differ — Perplexity will show you the exact pages it's referencing. That's where AI traffic is actually coming from in your category, and the gap between Perplexity's citations and your own ranking pages is a content brief waiting to be written.
The hard part of AI visibility isn't running the queries. The hard part is accepting that your brand's representation in AI search is a distribution, not a position — and that any tool reporting it as a single number is hiding more than it's showing.