Measurement guide

How to Track Your Brand's AI Search Visibility

There is no rank tracker for ChatGPT the way there was for Google. That does not mean AI visibility is unmeasurable, it means you need a different method: a repeatable manual check, a read on your analytics, and a way to catch mentions across the open web. Here is how to actually do it.

10 min readUpdated 2026

In this guide

Why there is no single AI rank tracker
A manual spot-check routine you can actually keep up
Reading your analytics for AI-driven traffic
Brand-mention monitoring as a leading indicator
Turning what you learn into action

Why there is no single AI rank tracker

With Google, you could pay for a tool, type in a keyword, and get a position number back. Rank three today, rank five tomorrow. That number was stable enough, and the algorithm consistent enough across users, that tracking it meant something. People built entire businesses on that stability.

AI answers do not work that way, and it is worth being clear-eyed about why, because it changes what a reasonable measurement practice looks like. First, there is no single model. ChatGPT, Claude, Perplexity, and Gemini are different systems built by different companies, trained on different data, updated on different schedules. Being mentioned favorably in one tells you very little about the others. Second, there is no stable "position" inside an answer. A model might name you first, might name you third, might not name you at all on a rerun of the identical prompt five minutes later, because these systems are non-deterministic by design — the same input can produce a different output depending on sampling, live retrieval results at that moment, and model updates you were never notified about. Third, most of these products do not expose a ranking to check in the first place. There is no results page with ten slots. There is one paragraph of prose, and you are either woven into it or you are not.

The honest conclusion is that anyone selling you a single "AI rank" number is selling you false precision. What you can measure is directional: how often you show up across a consistent set of realistic questions, how accurately you are described when you do, and whether that is trending up or down over time. That is a different kind of measurement than a keyword rank — closer to a survey than a scoreboard — but it is still measurement, and it is still useful if you run it consistently.

A manual spot-check routine you can actually keep up

Because there is no automated ranking feed to check, the baseline method is manual, and it works fine as long as you do it the same way every time. The point is not cleverness. The point is consistency, so that changes you observe are actually changes in the answers and not changes in how you asked.

Write a fixed list of prompts and do not improvise on the day. Aim for eight to fifteen prompts that mirror how a real buyer would ask, not how you would ask as the founder. Mix categories: broad discovery ("what's the best tool for X"), comparison ("X vs Y vs Z"), problem-first ("how do I solve Y without doing Z manually"), and direct ("what do you know about [your product]"). Save the exact wording in a doc — you will reuse it verbatim every cycle.
Run the same prompts across ChatGPT, Claude, Perplexity, and Gemini. Use fresh or logged-out sessions where the product allows it, since personalization and chat history can skew what gets surfaced. If a model asks a clarifying question before answering, answer it the same way each time so the comparison stays apples to apples.
Log the raw answer, not just your impression of it. Copy the full response text into a spreadsheet or doc. You want the actual wording later, not a memory of "I think it went okay."
Record whether you were mentioned at all. A binary yes or no, per model, per prompt. This is the coarsest and most important signal — everything else is detail on top of it.
Record where in the answer you appeared. Named first, named partway through a list, mentioned only in passing, or buried in a "some other options include" afterthought. Position within the answer is the closest proxy you have to the old idea of ranking, even though it is far looser.
Record accuracy. Is what the model says about you true? Wrong pricing, a discontinued feature, or a category mismatch is common and worth flagging separately from whether you were mentioned at all — being mentioned inaccurately is sometimes worse than not being mentioned.
Record who else got named. The competitor set a model reaches for tells you how it currently categorizes you, and it will shift over time as the market and the model's sources shift.
Repeat on a fixed schedule. Monthly is enough for most early-stage teams; weekly if you are actively pushing content out to influence this and want faster feedback. What matters is that the interval is regular, not tight.

Keep every run in the same spreadsheet, one row per prompt per model per date, so you can look back after two or three cycles and see actual movement instead of guessing from memory. This is tedious to do by hand every cycle, which is one of the reasons a tool like Wally is useful here — it can run a structured set of prompts across models on a schedule, keep the log consistent, and hand you the trend instead of you copy-pasting into a spreadsheet every month.

Reading your analytics for AI-driven traffic

Spot checks tell you what the models say. Your own analytics tell you whether any of that is translating into people actually landing on your site — a signal spot checks cannot give you on their own.

Start with referrer traffic. Most AI products that link out will show up as a referrer domain in your analytics tool, distinct from organic search and direct traffic. Look for referrer patterns from sources such as chat.openai.com, perplexity.ai, gemini.google.com, and similar domains for the assistants your buyers are likely to use. This traffic is usually small relative to organic search today, but it is worth isolating as its own segment so you can watch the trend rather than writing it off as noise because the absolute numbers are low.

A few practical notes on getting this right:

Segment AI referrers separately from generic "referral" traffic — most analytics tools lump all non-search, non-direct traffic together by default, which buries the signal you actually want.
Watch direct traffic too, not just referrals. Someone who reads a recommendation inside a chat window will often open a new tab and type your domain from memory rather than clicking a link, which analytics tools will record as direct traffic with no attribution to the AI answer that actually drove it. A subtle uptick in direct traffic alongside better spot-check results is a real, if imprecise, signal.
Apply UTM parameters to any links you control that might get pulled into an AI answer — your own comparison pages, your changelog, guest posts, anywhere a model might quote a URL back to a user. This will not catch everything, since you cannot control how models format or preserve links, but it catches what it catches, and cheap attribution beats none.
Check landing pages, not just referrer source. If AI-driven visitors consistently land on a specific comparison page or pricing page, that tells you which piece of content the models are actually pulling from or pointing to, which is useful for deciding what to reinforce.

None of this gives you a clean, complete picture, because a large share of AI-influenced decisions never produce a trackable click at all — someone reads an answer, forms an impression, and later just signs up with no referrer information attached. Treat analytics as a lower bound on AI influence, not the whole story.

Brand-mention monitoring as a leading indicator

The spot-check method tells you what models say right now. Mention monitoring across the open web tells you what is likely to shape what they say next, since these models draw heavily on the same public conversations you can search yourself: forum threads, comparison posts, review sites, Q&A pages, roundups.

Set up recurring searches — through a mention-monitoring tool if you have one, or manually through search and site-specific queries if you do not — for your product name, close variants and common misspellings, and your core category terms paired with "alternative," "vs," or "best." Run these on a regular cadence and note where you show up, where competitors show up instead of you, and where a relevant thread exists with no mention of you at all. That last category is often the most actionable: a live "best tools for X" thread that never named you is a concrete, fixable gap, not an abstract visibility problem.

Treat this as a leading indicator rather than a trailing one. A wave of new, accurate, third-party mentions this month is unlikely to shift a model's answer tomorrow, but it is the kind of signal that tends to show up in your spot checks a few cycles later, once that content gets indexed and, for retrieval-augmented systems, actually surfaced during live lookups. Watching both together — mentions now, model answers later — is how you build a sense of the lag between publishing and being cited.

Turning what you learn into action

Measurement only pays off if it changes what you do next week. The most common failure mode is treating this as a dashboard to admire rather than a punch list to work through, so once you have a couple of cycles of data, sort what you found into a short priority order.

Start with the prompts where you were not mentioned at all but a close competitor was. That is the clearest gap: the model has formed an opinion about who belongs in that answer, and it is not you yet. Look at what the competitor has that you likely do not — comparison content, forum presence, review volume — and treat that as the work item, not a vague resolution to "improve AI visibility." Next, fix inaccuracies you found during spot checks. If a model is confidently telling people you do not have a feature you actually shipped, or describes your pricing wrong, that is often traceable to outdated or missing content that gives the model nothing current to correct itself with. Finally, work the mention gaps you found in the open-web pass — the live threads and comparison pages where you have a legitimate case to be included but currently are not.

This is also where it is worth being honest about capacity. Running spot checks, reading analytics, monitoring mentions, and then actually producing the replies, comparison pages, and outreach that close the gaps is a lot of recurring work for a small team to sustain by hand every month. This is the loop Wally is built to help carry: it can help identify where the gaps are, draft the replies and posts for the specific threads and channels where you are missing, and queue everything for your approval, so the measurement actually turns into shipped work instead of a spreadsheet nobody revisits.