How ChatGPT, Claude, and Perplexity actually crawl your site
A technical breakdown of how each major AI search engine fetches, parses, and selects sources to cite. From request headers to retrieval pipelines.
If you understand how an AI search engine’s retrieval pipeline works, the AEO checklist stops feeling like cargo-cult tactics and starts feeling like obvious consequences of how the system is built.
This post walks through the actual fetch path for each of the three major AI search engines — ChatGPT (OpenAI), Claude (Anthropic), and Perplexity — and traces what they do with the response.
The four stages every AI engine has
Every modern generative search engine has the same skeleton, even though the implementations differ:
- Query understanding. The user’s question is transformed into one or more search queries.
- Retrieval. Some combination of (a) a pre-built search index, (b) live HTTP fetches, and (c) the model’s internal training data is consulted to generate candidate sources.
- Reading & ranking. The candidate pages are read, summarized, and scored for relevance to the original question.
- Synthesis & citation. The model writes the answer, picking 1–3 sources to cite explicitly.
Where AEO matters most is stages 2 and 3. Stage 2 determines whether your site appears in the candidate set; stage 3 determines whether it gets selected for citation.
ChatGPT (OpenAI)
OpenAI runs three distinct crawlers that intersect with AEO:
| Bot | When it runs | What it does |
|---|---|---|
GPTBot | Periodic (training crawl) | Fetches pages to use as model training data. Cached and reused across model versions. |
OAI-SearchBot | Continuous (search index) | Builds and refreshes the ChatGPT Search index. Equivalent to Googlebot for ChatGPT’s “search” surface. |
ChatGPT-User | On-demand (per query) | Fetches a specific URL when ChatGPT decides to “browse” during an answer. |
The path during a real ChatGPT query that triggers browsing:
- User asks “What’s the weather in Charlotte right now?”
- The model determines this is a query that benefits from live data.
- ChatGPT issues a search query against its internal index (
OAI-SearchBot’s output). - The model picks 1–5 candidate URLs.
ChatGPT-Userfetches each URL with a normal HTTP GET.- Each response is converted to clean text (HTML stripped, navigation removed, similar to Mozilla Readability).
- The text is fed to the model as context, along with the URL.
- The model writes an answer and emits inline citations linking to the URLs it used.
Three implications for AEO:
ChatGPT-Useris the live-fetch bot. If yourrobots.txtblocksChatGPT-User, you can never appear in ChatGPT’s browse results, regardless ofGPTBot.- The index is
OAI-SearchBot. If yourrobots.txtblocksOAI-SearchBot, you don’t appear in the candidate set at step 3. - JS rendering is not part of the fetch path. The response body is processed as text.
<div id="root">becomes the empty string after extraction.
Claude (Anthropic)
Anthropic uses three bots in a similar pattern:
| Bot | Purpose |
|---|---|
ClaudeBot | Training crawl. |
Claude-SearchBot | Search index for Claude’s web tools. |
Claude-User | Live fetch when Claude uses its web_search or web_fetch tools during a response. |
Claude’s pipeline differs from ChatGPT’s in one notable way: the web_fetch tool returns structured content with separate fields for title, headings, body, and structured data. The model receives:
- The URL
- HTTP status
- Page title
- Meta description
- Heading outline
- Cleaned body text
- Any extracted JSON-LD blocks
- Outbound links
This means JSON-LD is more directly useful in Claude than in ChatGPT — Claude’s tool surface gives the model first-class access to structured data, which it then uses for both citation choice and answer fidelity.
Perplexity
Perplexity is the most aggressive of the three when it comes to live fetching. Almost every query triggers a real-time search → fetch → cite pipeline.
| Bot | Purpose |
|---|---|
PerplexityBot | Search index crawl. |
Perplexity-User | Live fetch on user query. |
Perplexity’s Perplexity-User does some JavaScript rendering — they run a partial render for some pages. But it’s heavily rate-limited (a few seconds budget per page) and unreliable. Designing for the JS-rendering path is risky; designing for the no-JS path always works.
Perplexity also publicly shows the sources it considered in its answer card, so its retrieval and ranking are observable in a way that ChatGPT’s aren’t. Useful for AEO debugging — run your site through a Perplexity query relevant to your content and see if it appears.
Google AI Overviews
Google AI Overviews is the AI-summary box that appears at the top of some Google search results. It uses:
Googlebot(the same crawler as classic Google search)Google-Extended— a separate User-Agent that publishers can use to opt out of training and AI Overview synthesis without leaving Google search.
If you allow Googlebot and disallow Google-Extended, you appear in classic search but not in AI Overviews. This is the precise trade-off Google built for publishers who wanted to maintain SEO traffic while opting out of AI synthesis.
The AI Overview retrieval path is different from ChatGPT/Claude/Perplexity in that it reuses the existing Google search index rather than fetching live. This means:
- AEO score for Google AI Overviews is largely your existing SEO score.
- The biggest AEO-specific lever for Overviews is not blocking
Google-Extended. - Beyond that, the same content patterns that boost Google rankings boost Overview citations.
What this all means for AEO
A few takeaways that fall out of the architecture:
- There is no universal “AI bot.” Each engine has its own crawlers. Allowlist all of them.
- Live-fetch bots matter more than training bots.
*-Userand*-SearchBotare the ones that decide whether your site gets cited today. - Raw HTML quality is the upper bound. Once your page is fetched, the cleaned-text view of your raw HTML is what the model sees. Make sure that view is good.
- JSON-LD bridges the gap. When the model gets typed structured data, it has higher confidence in your page’s content type and is more likely to cite. JSON-LD is your way of telling the model “here’s what this page is, in machine-readable form.”
- Front-loaded content wins because the context budget is small. None of these engines feed your entire page to the model. They feed the first ~1000–2000 tokens of cleaned text. If your direct answer isn’t in that window, you don’t get cited.
A debugging recipe
If your site isn’t getting cited where you’d expect:
- Check
robots.txt. Use the AEO Site Checker to verify all 12 AI bots are allowed. - Check the WAF.
curlyour site with each bot’s User-Agent and look forcf-mitigatedor403. - Check the raw HTML. View source on your most important pages. The text content should be visible without JS.
- Search Perplexity. It’s the most observable of the AI engines. Search for the kind of question your page answers and see if you appear in its sources.
- Check ChatGPT Search. Use the search-engine surface inside ChatGPT to look for your content directly.
- Re-audit. Run our checker every few weeks; the bots and their behaviors evolve.
Further reading
- robots.txt for AI crawlers
- Why AI crawlers don’t run JavaScript
- JSON-LD for AI search
- Cloudflare bot protection and AEO
Ready to score your site? Run an audit →