Whitepaper — AEO Site Checker

What "AEO" means in this tool

When somebody asks ChatGPT (or Claude, Perplexity, or Google AI Overviews) for a recommendation, the response includes one to three cited sources. The model picks those citations from a different mix of signals than Google's PageRank-era SEO checklist:

Crawler reachability. The model has to be able to fetch the page in the first place. Cloudflare bot challenges, 403 walls, and JS-only single-page apps are the most common reasons a site is invisible in AI answers despite ranking on Google.
Permission to crawl. The AI bot has to be allowed in robots.txt. Many sites accidentally block OAI-SearchBot, Claude-User, or PerplexityBot while leaving Googlebot allowed.
Parseable structure. Semantic HTML, JSON-LD, and Mozilla-Readability-friendly article structure are easier for an LLM to extract from.
The /llms.txt convention. A small markdown file at the site root that gives an AI agent a curated index of important pages — like sitemap.xml but optimized for LLMs to read instead of crawlers to follow.
Content shape. Princeton's GEO paper measured what actually moves the needle in generative answers — front-loaded direct answers, statistics density (+41% citation lift), quotations (+28%), citations to authorities. Heuristic, but directionally correct.

The tool produces a single 0–100 score and a letter grade so you can track changes over time.

Site-type detection

Every audit auto-classifies the page from JSON-LD types first (highest confidence), then visible signals — tel: links, <address> blocks, listing keywords, article-style bylines — with a generic fallback. The detected type is surfaced in the result and controls which type-conditional checks run.

Type	Signals
`local_business`	`LocalBusiness` JSON-LD, or visible `tel:` link + address + hours
`real_estate_listing`	`RealEstateListing` JSON-LD, or "for sale / bedrooms / sqft" patterns
`article`	`Article` / `NewsArticle` / `BlogPosting` schema, or a byline + dateline
`product`	`Product` / `SoftwareApplication` JSON-LD or pricing copy
`person`	`Person` JSON-LD or "About me" patterns
`generic`	Anything else — most marketing sites land here

Scoring rubric

27 always-on checks and 1 type-conditional check, summed and normalized to 0–100. Total weight is denormalized — the runner sums earned and possible points across whichever checks ran for this page and divides at the end.

Fetchability — 43 points

The largest category, deliberately. If AI crawlers can't fetch your page in the first place, nothing else matters.

fetch_direct is worth 18 of 100 points — more than every other single check. If ChatGPT-User gets a 403 from your origin, none of the other 26 checks matter.

Check	Weight	What it measures
`fetch_direct`	18	Direct (no-proxy) fetch returns 2xx/3xx with no bot challenge. Falling back to BrightData costs 16 of 18 points.
`https`	4	URL is HTTPS. No mixed content, no expired cert.
`page_size`	3	Decompressed HTML is under 4 MB. Anything larger is silently truncated by most crawlers.
`robots_ai_bots`	10	`robots.txt` allows the seven critical AI bots: `ChatGPT-User`, `OAI-SearchBot`, `Claude-User`, `Claude-SearchBot`, `PerplexityBot`, `Perplexity-User`, `Google-Extended`. ~1.43 pts per critical bot allowed.
`ssr_content`	8	Server-rendered HTML contains at least 200 words of clean content per Mozilla Readability extraction. Fails for pure SPAs and content hidden behind cookie walls.

Core SEO — 21 points

Classic SEO foundations. Inherited from the Google era; AI search still relies on them.

Check	Weight	What it measures
`title_tag`	4	`<title>` between 25 and 65 characters. Too short = no signal; too long = truncated in citations.
`meta_description`	4	`<meta name="description">` between 80 and 175 characters.
`canonical_url`	3	`<link rel="canonical">` present and points to a valid HTTPS URL.
`og_tags`	3	OpenGraph: `og:title`, `og:description`, `og:url`, `og:image`, `og:type`.
`twitter_card`	2	At minimum `twitter:card` and `twitter:title`.
`html_lang`	1	`<html lang="…">` is set.
`sitemap`	4	`/sitemap.xml` (or sitemap-index) returns 200 with at least one valid URL.

Semantic HTML — 13 points

Structure the page so the model can parse it like a human reads it.

Check	Weight	What it measures
`single_h1`	3	Exactly one `<h1>`. Multiple confuse entity extraction.
`heading_hierarchy`	3	No skipped levels — `h1` → `h2` → `h3`, never `h1` → `h3`.
`landmarks`	4	At least 3 of `<header>`, `<nav>`, `<main>`, `<article>`, `<aside>`, `<footer>`.
`alt_text_coverage`	3	At least 80% of `<img>` tags have non-empty `alt`.

Answer Engine signals — 26 points

The category most specific to AEO. This is where AEO diverges from classic SEO.

Check	Weight	What it measures
`llms_txt`	5	`/llms.txt` exists and parses as the llmstxt.org format.
`llms_full_txt`	1	`/llms-full.txt` exists. Bonus point.
`json_ld_types`	6	JSON-LD with recognized `@type`. Bonus weight for FAQPage, Article, Organization, LocalBusiness, Person, Product, RealEstateListing.
`author_provenance`	3	Article schema has an `author` object with at least one `sameAs` link to a recognizable profile (LinkedIn, GitHub, ORCID, Twitter).
`date_modified`	2	`dateModified` is present and within the last 12 months.
`readability_extraction`	5	Mozilla Readability extracts a non-empty article with > 100 words and confidence above 50.
`site_breadth`	4	Sitemap depth + presence of a `/blog` or `/articles` path + at least one URL with a 2024+ `lastmod`.

Content quality — 6 points

Encodes findings from the Princeton GEO study (Aggarwal et al., 2024) on what measurably lifts citation rate.

Check	Weight	What it measures
`front_loaded_answer`	2	The first 200 words contain a direct, declarative answer-shaped sentence.
`question_headings`	1	At least one heading is question-shaped ("What is X?", "How do I Y?").
`statistics_density`	2	At least 3 statistics or numeric facts. GEO study: +41% citation lift.
`quotations`	1	At least one `<blockquote>` or named quotation. +28% citation lift.

Type-conditional checks

Some checks only count for certain page types. They contribute to both earned and possible points only when they apply, so a SaaS landing page is neither penalized nor rewarded for them.

Check	Weight	Applies to
`aeo_contact_signals`	5	`local_business`, `real_estate_listing`, `person`, generic-with-business-signals

For most pages the denominator is 109 (the contact check applies). For SaaS / article / generic pages the denominator is 104. Either way the score is normalized to 0–100:

score = round(100 * sum(check.earned) / sum(check.possible))

Letter grades

Grade	Range	What it means
A	90–100	Site is correctly set up for AI search. AI engines can fetch, parse, and cite it.
B	80–89	Solid AEO. Maybe one or two checks failing — usually a missing schema or a small content fix.
C	70–79	Mostly there but missing a meaningful signal. Common pattern: good SEO, no llms.txt, no JSON-LD.
D	60–69	At least one critical failure. Often a JS-only render, a partial bot block, or absent structured data.
F	<60	Page is invisible or near-invisible to AI search. Usually a Cloudflare challenge, SPA shell, or `Disallow: /` against AI bots.

In practice, a well-built marketing site with no AEO work scores around C+ (70–78). A site that does the basics — llms.txt, JSON-LD, sitemap, no bot blocks — scores B+ (85–90). An A is achievable with a single afternoon of focused work on most sites.

How an audit runs

smartFetch(url). undici GET with a desktop-Chrome User Agent, follows up to 5 redirects, decompresses gzip/brotli/deflate manually (undici's request() doesn't auto-decompress), reads up to 4 MB of body.
Bot-block detection. See the next section.
Fallback. If direct fetch fails or is blocked, retry through BrightData Web Unlocker. Sites that needed the unlocker lose the major fetch_direct credit.
Site-type detection. Classify the page from JSON-LD types and visible signals. The detected type controls which type-conditional checks run.
Run all checks in parallel. robots.txt, llms.txt, llms-full.txt, sitemap.xml fetched concurrently. HTML parsed once with cheerio, once with jsdom (for Readability). The sitemap is followed one level deep if it's a sitemap-index.
Score. Total earned ÷ total weight, normalized to 0–100, letter grade applied.
Persist. Every audit is saved to SQLite with a cuid, so results have a shareable permalink at /audit/:id.

Bot-block detection

The auditor flags a fetch as blocked if any of these match:

Response header cf-mitigated is present
HTTP 403 with Server: cloudflare
HTTP 503 or 429 with a cf-ray header
Body matches a known challenge signature: Just a moment..., /cdn-cgi/challenge-platform/, __cf_chl_*, cf-browser-verification, Akamai Reference, _pxCaptcha, You have been blocked, etc.
Tiny body from a Cloudflare-fronted host (fallback heuristic)

If anything matches, the fetcher falls back to BrightData Web Unlocker and the audit is recorded as fallback mode rather than direct.

API

The hosted instance is open and unauthenticated. There is no rate limiting at the application layer.

Method	Path	Body / params	Returns
`GET`	`/api/health`	—	`{ status: "ok" }`
`POST`	`/api/audits`	`{ "url": "https://…" }`	Full `AuditResult` with permalink `id`
`GET`	`/api/audits/:id`	—	A previously-saved audit
`GET`	`/api/audits/recent`	—	Last 25 audits

Limitations and non-goals

No JavaScript rendering. The auditor reads server-rendered HTML only. Sites that need JS to populate content fail ssr_content — intentional, because LLM crawlers also don't run JS reliably. If you need a Lighthouse-style headless-browser audit, this isn't that tool.
Single URL per audit. No site-wide crawl, no <a> link following.
Heuristic content checks. "Front-loaded answer" and "question-shaped heading" are regex heuristics, not language understanding. Useful directional signals, not absolute truth.
No historical comparisons. Saved audits are independent rows; no diff view.
No authentication. Public, no rate limiting beyond what nginx and undici provide.

AEO Site Checker — full reference