How ChatGPT, Claude, and Perplexity actually crawl your site

If you understand how an AI search engine’s retrieval pipeline works, the AEO checklist stops feeling like cargo-cult tactics and starts feeling like obvious consequences of how the system is built.

This post walks through the actual fetch path for each of the three major AI search engines — ChatGPT (OpenAI), Claude (Anthropic), and Perplexity — and traces what they do with the response.

The four stages every AI engine has

Every modern generative search engine has the same skeleton, even though the implementations differ:

Query understanding. The user’s question is transformed into one or more search queries.
Retrieval. Some combination of (a) a pre-built search index, (b) live HTTP fetches, and (c) the model’s internal training data is consulted to generate candidate sources.
Reading & ranking. The candidate pages are read, summarized, and scored for relevance to the original question.
Synthesis & citation. The model writes the answer, picking 1–3 sources to cite explicitly.

Where AEO matters most is stages 2 and 3. Stage 2 determines whether your site appears in the candidate set; stage 3 determines whether it gets selected for citation.

ChatGPT (OpenAI)

OpenAI runs three distinct crawlers that intersect with AEO:

Bot	When it runs	What it does
`GPTBot`	Periodic (training crawl)	Fetches pages to use as model training data. Cached and reused across model versions.
`OAI-SearchBot`	Continuous (search index)	Builds and refreshes the ChatGPT Search index. Equivalent to Googlebot for ChatGPT’s “search” surface.
`ChatGPT-User`	On-demand (per query)	Fetches a specific URL when ChatGPT decides to “browse” during an answer.

The path during a real ChatGPT query that triggers browsing:

User asks “What’s the weather in Charlotte right now?”
The model determines this is a query that benefits from live data.
ChatGPT issues a search query against its internal index (OAI-SearchBot’s output).
The model picks 1–5 candidate URLs.
ChatGPT-User fetches each URL with a normal HTTP GET.
Each response is converted to clean text (HTML stripped, navigation removed, similar to Mozilla Readability).
The text is fed to the model as context, along with the URL.
The model writes an answer and emits inline citations linking to the URLs it used.

Three implications for AEO:

ChatGPT-User is the live-fetch bot. If your robots.txt blocks ChatGPT-User, you can never appear in ChatGPT’s browse results, regardless of GPTBot.
The index is OAI-SearchBot. If your robots.txt blocks OAI-SearchBot, you don’t appear in the candidate set at step 3.
JS rendering is not part of the fetch path. The response body is processed as text. <div id="root"> becomes the empty string after extraction.

Claude (Anthropic)

Anthropic uses three bots in a similar pattern:

Bot	Purpose
`ClaudeBot`	Training crawl.
`Claude-SearchBot`	Search index for Claude’s web tools.
`Claude-User`	Live fetch when Claude uses its `web_search` or `web_fetch` tools during a response.

Claude’s pipeline differs from ChatGPT’s in one notable way: the web_fetch tool returns structured content with separate fields for title, headings, body, and structured data. The model receives:

The URL
HTTP status
Page title
Meta description
Heading outline
Cleaned body text
Any extracted JSON-LD blocks
Outbound links

This means JSON-LD is more directly useful in Claude than in ChatGPT — Claude’s tool surface gives the model first-class access to structured data, which it then uses for both citation choice and answer fidelity.

Perplexity

Perplexity is the most aggressive of the three when it comes to live fetching. Almost every query triggers a real-time search → fetch → cite pipeline.

Bot	Purpose
`PerplexityBot`	Search index crawl.
`Perplexity-User`	Live fetch on user query.

Perplexity’s Perplexity-User does some JavaScript rendering — they run a partial render for some pages. But it’s heavily rate-limited (a few seconds budget per page) and unreliable. Designing for the JS-rendering path is risky; designing for the no-JS path always works.

Perplexity also publicly shows the sources it considered in its answer card, so its retrieval and ranking are observable in a way that ChatGPT’s aren’t. Useful for AEO debugging — run your site through a Perplexity query relevant to your content and see if it appears.

Google AI Overviews

Google AI Overviews is the AI-summary box that appears at the top of some Google search results. It uses:

Googlebot (the same crawler as classic Google search)
Google-Extended — a separate User-Agent that publishers can use to opt out of training and AI Overview synthesis without leaving Google search.

If you allow Googlebot and disallow Google-Extended, you appear in classic search but not in AI Overviews. This is the precise trade-off Google built for publishers who wanted to maintain SEO traffic while opting out of AI synthesis.

The AI Overview retrieval path is different from ChatGPT/Claude/Perplexity in that it reuses the existing Google search index rather than fetching live. This means:

AEO score for Google AI Overviews is largely your existing SEO score.
The biggest AEO-specific lever for Overviews is not blocking Google-Extended.
Beyond that, the same content patterns that boost Google rankings boost Overview citations.

What this all means for AEO

A few takeaways that fall out of the architecture:

There is no universal “AI bot.” Each engine has its own crawlers. Allowlist all of them.
Live-fetch bots matter more than training bots. *-User and *-SearchBot are the ones that decide whether your site gets cited today.
Raw HTML quality is the upper bound. Once your page is fetched, the cleaned-text view of your raw HTML is what the model sees. Make sure that view is good.
JSON-LD bridges the gap. When the model gets typed structured data, it has higher confidence in your page’s content type and is more likely to cite. JSON-LD is your way of telling the model “here’s what this page is, in machine-readable form.”
Front-loaded content wins because the context budget is small. None of these engines feed your entire page to the model. They feed the first ~1000–2000 tokens of cleaned text. If your direct answer isn’t in that window, you don’t get cited.

A debugging recipe

If your site isn’t getting cited where you’d expect:

Check robots.txt. Use the AEO Site Checker to verify all 12 AI bots are allowed.
Check the WAF. curl your site with each bot’s User-Agent and look for cf-mitigated or 403.
Check the raw HTML. View source on your most important pages. The text content should be visible without JS.
Search Perplexity. It’s the most observable of the AI engines. Search for the kind of question your page answers and see if you appear in its sources.
Check ChatGPT Search. Use the search-engine surface inside ChatGPT to look for your content directly.
Re-audit. Run our checker every few weeks; the bots and their behaviors evolve.