Technical guide

Technical SEO for AI Crawlers

Being mentioned in an AI answer depends on a model being able to read your site in the first place, and that is a lower-level problem than most GEO advice admits. This guide covers the actual mechanics: which crawlers matter, how to configure robots.txt for them, why rendering choices can make your content invisible, and how to check the logs to see who is really showing up.

12 min readUpdated 2026

In this guide

Which AI crawlers actually matter
Configuring robots.txt for AI crawlers
Rendering and JavaScript: why it matters more than you think
Crawl budget, sitemaps, and internal linking
How to verify crawlers are actually reaching your site

Which AI crawlers actually matter

Before you can configure anything, you need to know which bots are worth caring about. There is a small set of AI-related crawlers that show up repeatedly in server logs and robots.txt discussions, each tied to a different company and a different purpose. Treat the specifics below as a snapshot, not gospel — user-agent strings, IP ranges, and stated behaviors change as providers update their crawling infrastructure, so verify anything load-bearing against the provider's current published documentation before you act on it.

GPTBot

OpenAI's crawler used to gather content for training future models. It is distinct from the crawler that fetches pages in response to a live ChatGPT query. If you block GPTBot, you are opting out of having your content used in training data, not necessarily out of being cited in a live answer.

ChatGPT-User

A separate OpenAI user agent that fires when ChatGPT actively browses the web or fetches a page to answer a specific user question, including through plugins and browsing features. This is the one that matters most for showing up in a real-time answer, since it reflects on-demand retrieval rather than bulk training collection. Blocking it can mean the model simply cannot read your page when someone asks about you in the moment.

ClaudeBot

Anthropic's crawler, used for gathering training data and, depending on current product behavior, for retrieval tied to Claude's web search and browsing capabilities. As with OpenAI, it is worth checking whether training-data collection and live retrieval are governed by the same or different user agents at any given time, since that distinction affects what blocking actually accomplishes.

PerplexityBot

Perplexity is built around live retrieval by design — its whole product is answering questions by reading the current web — so this crawler tends to be more directly tied to visibility in actual answers than a pure training crawler would be. Perplexity has also published guidance distinguishing its training-oriented crawling from its answer-time fetching, so check their current documentation for the exact user agents in play.

Google-Extended

A control token in Google's robots.txt system that lets you opt out of having your content used for Gemini and AI Overviews specifically, separate from ordinary Googlebot indexing for search. This matters because you may want your pages indexed and ranked normally while still deciding separately whether they feed Google's generative features.

CCBot / Common Crawl

Common Crawl is a nonprofit that runs a general-purpose web crawl and publishes the resulting archive publicly. It is not an AI company itself, but its dataset is a widely used ingredient in training large language models across the industry, which makes CCBot relevant even though it predates the current wave of AI products.

There are other, smaller crawlers tied to specific AI products, and the list keeps growing as more companies ship AI search features. The pattern to internalize, not the exact roster, is what matters: some crawlers exist to build training data, some exist to answer a live question right now, and a single company may run more than one with different rules for each.

Configuring robots.txt for AI crawlers

robots.txt is a plain text file at the root of your domain, and AI crawlers generally check it before requesting pages, the same way traditional search engine crawlers do. It is a request, not an enforcement mechanism — a well-behaved crawler honors it, but nothing stops a scraper that ignores it entirely, so treat robots.txt as a policy statement to good actors, not a security control.

The syntax is simple. Each block starts with a User-agent line naming the crawler it applies to, followed by Disallow or Allow lines pointing at paths. A block that disallows the whole site for a specific bot looks like this as plain text: User-agent: GPTBot on one line, then Disallow: / on the next. To block several bots the same way, you either repeat the block per user agent or list one user agent per block, since robots.txt does not support comma-separated agents in a single line. To allow everything for a given crawler, you can either omit it entirely — the common convention is that no matching block means no restriction — or write an explicit block with Disallow left blank, meaning User-agent: PerplexityBot followed by Disallow: with nothing after it.

The real decision is not syntax, it is strategy, and it is a genuine tradeoff rather than an obvious call. Blocking AI crawlers protects your content from being used in training data or reproduced without attribution, which is a legitimate concern for publishers who rely on original content as their asset. But blocking also removes you from consideration in the exact channel this whole site is about — if a crawler cannot read your pages, a model has nothing of yours to cite, recommend, or corroborate when someone asks about your category. For a startup trying to build visibility in AI answers, blocking the crawlers that matter for live retrieval is usually self-defeating, even if blocking pure training crawlers is a reasonable position to take separately.

A common middle-ground approach: allow crawlers tied to live retrieval and citation, such as ChatGPT-User and PerplexityBot, since those directly affect whether you can be mentioned in an answer someone is reading right now, while making a deliberate, informed choice about GPTBot, ClaudeBot, and CCBot based on your own view of training-data use. There is no universally correct answer here, and providers periodically change how these crawlers are scoped, so revisit this file more often than you would a typical piece of technical configuration.

One more thing worth checking: some sites accidentally block AI crawlers through overly broad rules meant for something else, such as a catch-all Disallow under User-agent: * that was written to stop aggressive scrapers years ago and has been silently blocking legitimate crawlers ever since. Read your existing robots.txt literally, line by line, rather than assuming it does what you remember writing.

Rendering and JavaScript: why it matters more than you think

A browser and a crawler do not necessarily see the same page. When a human visits a JavaScript-heavy site, the browser downloads an initial HTML shell, then runs JavaScript that fetches data and builds the actual content in the DOM. Modern search engines like Google have invested heavily in rendering JavaScript before indexing, so this gap matters less than it used to for traditional search. AI crawlers are a different story, and a less forgiving one: many of them are built as lightweight fetchers that request a URL and read the raw HTML response, without executing JavaScript the way a browser or a full rendering-capable indexer would.

The practical consequence is that a page built with client-side rendering — where the initial HTML is mostly an empty shell and a JavaScript framework fills in the content after load — can look nearly blank to a crawler that does not execute scripts. The crawler fetches the page, gets a skeleton with no meaningful text, and moves on with nothing to show a model at answer time. This is not a hypothetical edge case; it is a direct, mechanical consequence of how single-page application frameworks work by default, and it affects any AI crawler that does not run a full headless browser as part of its fetch pipeline.

Server-side rendering or static site generation addresses this directly by producing complete HTML on the server, before the response ever reaches the crawler. Whether you render on each request or pre-build static pages at deploy time, the content exists in the HTML document itself rather than depending on script execution in the requesting client. This is not a new idea — it is the same rendering discipline good SEO has recommended for over a decade — but it has renewed importance now that a wider and less predictable set of crawlers is reading your site, several of which you cannot easily test against the way you can test against Googlebot.

If a full rewrite to server-side rendering is not realistic right now, the more targeted fix is to make sure the specific content you want cited — product descriptions, pricing, documentation, comparison pages — is present in the initial server response rather than injected client-side after the fact. You do not need to solve rendering for the entire site at once; you need the pages doing the work of representing you to AI systems to render fully without JavaScript.

Crawl budget, sitemaps, and internal linking

Crawl budget is the practical limit on how much of your site a given crawler will fetch in a given window, based on factors like your site's size, server response times, and the crawler's own resourcing decisions about how much of the web it can afford to visit. Most small sites never bump into this limit in a meaningful way — a startup with forty pages does not need a crawl-budget strategy. It becomes relevant once a site has thousands of pages, frequent content changes, or a lot of low-value machine-generated pages diluting the ones that matter.

An XML sitemap is the most direct lever you have. It is a structured list of URLs you consider worth crawling, and it gives a crawler a map instead of forcing it to discover everything purely by following links. This does not guarantee any particular bot fetches every URL in the sitemap, particularly for AI crawlers whose respect for and use of sitemaps is less standardized and less documented than it is for Google, but it costs little to maintain and removes any excuse for your important pages being undiscoverable.

Internal linking matters for a related but distinct reason: crawlers, like human visitors, tend to weight pages by how reachable and how referenced they are within the site's own structure. A page that is only reachable through a search box or buried four clicks deep with no incoming links from anywhere else on the site is effectively invisible to a crawler working through link discovery, sitemap or no sitemap. Keep the pages that carry your actual claims about what you do — your homepage, your key comparison or category pages, your documentation — well linked from other pages on the site, not just from the navigation bar. This is ordinary information architecture discipline, and it happens to be exactly what helps both traditional and AI crawlers find and prioritize the content that represents you.

How to verify crawlers are actually reaching your site

Configuration is a guess until you check the evidence. Server logs are the ground truth for whether AI crawlers are actually visiting, and checking them is more accessible than it sounds even if you are not primarily a backend person.

Get access to your raw server logs or access logs. Depending on your hosting setup this might mean a log file on the server itself, a logging dashboard in your hosting provider's console, or a CDN's request-log export. If you are on a platform that abstracts this away entirely, look for its analytics or request-log feature before assuming logs are unavailable.
Search the logs for known AI crawler user-agent strings — GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended alongside regular Googlebot, and CCBot are the ones worth searching for first. A simple text search or grep for each string across a recent window of logs is enough to start.
Note which URLs each crawler is actually requesting, not just whether it showed up at all. A crawler that visits your homepage once a month but never touches your pricing or documentation pages is not doing much for you, regardless of whether it is technically "reaching your site."
Check response codes alongside the requests. A crawler hitting a page and getting a 404, a 500, or a redirect chain is not successfully reading your content even though it did make contact, and this is a common silent failure mode worth ruling out.
Cross-reference against your robots.txt rules to confirm the crawlers you intended to allow are the ones showing up, and the ones you intended to block are actually absent. A mismatch here usually means either a robots.txt mistake or a crawler that is not respecting the file, both of which are worth knowing about.
Repeat this check periodically rather than once. Crawler behavior, naming, and frequency change over time as providers update their systems, and a site redesign or CDN change can quietly break something that used to work.

If manually parsing logs is more than you want to take on regularly, this is also the kind of ongoing monitoring and research work that a tool like Wally can help carry as part of a broader GEO effort — keeping an eye on whether your site is actually reachable is only useful as a repeated habit, not a one-time audit, and it is easy for it to quietly fall off a founder's plate once the initial setup is done.