Whitepaper · v2.0 · · Single-page reference

AEO Site Checker — full reference

Every check, weight, and method behind the 0–100 score. Site-type detection, scoring rubric, audit pipeline, API, and limitations — one page, no fluff.

What "AEO" means in this tool

When somebody asks ChatGPT (or Claude, Perplexity, or Google AI Overviews) for a recommendation, the response includes one to three cited sources. The model picks those citations from a different mix of signals than Google's PageRank-era SEO checklist:

  • Crawler reachability. The model has to be able to fetch the page in the first place. Cloudflare bot challenges, 403 walls, and JS-only single-page apps are the most common reasons a site is invisible in AI answers despite ranking on Google.
  • Permission to crawl. The AI bot has to be allowed in robots.txt. Many sites accidentally block OAI-SearchBot, Claude-User, or PerplexityBot while leaving Googlebot allowed.
  • Parseable structure. Semantic HTML, JSON-LD, and Mozilla-Readability-friendly article structure are easier for an LLM to extract from.
  • The /llms.txt convention. A small markdown file at the site root that gives an AI agent a curated index of important pages — like sitemap.xml but optimized for LLMs to read instead of crawlers to follow.
  • Content shape. Princeton's GEO paper measured what actually moves the needle in generative answers — front-loaded direct answers, statistics density (+41% citation lift), quotations (+28%), citations to authorities. Heuristic, but directionally correct.

The tool produces a single 0–100 score and a letter grade so you can track changes over time.

Site-type detection

Every audit auto-classifies the page from JSON-LD types first (highest confidence), then visible signals — tel: links, <address> blocks, listing keywords, article-style bylines — with a generic fallback. The detected type is surfaced in the result and controls which type-conditional checks run.

TypeSignals
local_business LocalBusiness JSON-LD, or visible tel: link + address + hours
real_estate_listing RealEstateListing JSON-LD, or "for sale / bedrooms / sqft" patterns
article Article / NewsArticle / BlogPosting schema, or a byline + dateline
product Product / SoftwareApplication JSON-LD or pricing copy
person Person JSON-LD or "About me" patterns
generic Anything else — most marketing sites land here

Scoring rubric

27 always-on checks and 1 type-conditional check, summed and normalized to 0–100. Total weight is denormalized — the runner sums earned and possible points across whichever checks ran for this page and divides at the end.

Fetchability — 43 points

The largest category, deliberately. If AI crawlers can't fetch your page in the first place, nothing else matters.

fetch_direct is worth 18 of 100 points — more than every other single check. If ChatGPT-User gets a 403 from your origin, none of the other 26 checks matter.

CheckWeightWhat it measures
fetch_direct18Direct (no-proxy) fetch returns 2xx/3xx with no bot challenge. Falling back to BrightData costs 16 of 18 points.
https4URL is HTTPS. No mixed content, no expired cert.
page_size3Decompressed HTML is under 4 MB. Anything larger is silently truncated by most crawlers.
robots_ai_bots10robots.txt allows the seven critical AI bots: ChatGPT-User, OAI-SearchBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended. ~1.43 pts per critical bot allowed.
ssr_content8Server-rendered HTML contains at least 200 words of clean content per Mozilla Readability extraction. Fails for pure SPAs and content hidden behind cookie walls.

Core SEO — 21 points

Classic SEO foundations. Inherited from the Google era; AI search still relies on them.

CheckWeightWhat it measures
title_tag4<title> between 25 and 65 characters. Too short = no signal; too long = truncated in citations.
meta_description4<meta name="description"> between 80 and 175 characters.
canonical_url3<link rel="canonical"> present and points to a valid HTTPS URL.
og_tags3OpenGraph: og:title, og:description, og:url, og:image, og:type.
twitter_card2At minimum twitter:card and twitter:title.
html_lang1<html lang="…"> is set.
sitemap4/sitemap.xml (or sitemap-index) returns 200 with at least one valid URL.

Semantic HTML — 13 points

Structure the page so the model can parse it like a human reads it.

CheckWeightWhat it measures
single_h13Exactly one <h1>. Multiple confuse entity extraction.
heading_hierarchy3No skipped levels — h1h2h3, never h1h3.
landmarks4At least 3 of <header>, <nav>, <main>, <article>, <aside>, <footer>.
alt_text_coverage3At least 80% of <img> tags have non-empty alt.

Answer Engine signals — 26 points

The category most specific to AEO. This is where AEO diverges from classic SEO.

CheckWeightWhat it measures
llms_txt5/llms.txt exists and parses as the llmstxt.org format.
llms_full_txt1/llms-full.txt exists. Bonus point.
json_ld_types6JSON-LD with recognized @type. Bonus weight for FAQPage, Article, Organization, LocalBusiness, Person, Product, RealEstateListing.
author_provenance3Article schema has an author object with at least one sameAs link to a recognizable profile (LinkedIn, GitHub, ORCID, Twitter).
date_modified2dateModified is present and within the last 12 months.
readability_extraction5Mozilla Readability extracts a non-empty article with > 100 words and confidence above 50.
site_breadth4Sitemap depth + presence of a /blog or /articles path + at least one URL with a 2024+ lastmod.

Content quality — 6 points

Encodes findings from the Princeton GEO study (Aggarwal et al., 2024) on what measurably lifts citation rate.

CheckWeightWhat it measures
front_loaded_answer2The first 200 words contain a direct, declarative answer-shaped sentence.
question_headings1At least one heading is question-shaped ("What is X?", "How do I Y?").
statistics_density2At least 3 statistics or numeric facts. GEO study: +41% citation lift.
quotations1At least one <blockquote> or named quotation. +28% citation lift.

Type-conditional checks

Some checks only count for certain page types. They contribute to both earned and possible points only when they apply, so a SaaS landing page is neither penalized nor rewarded for them.

CheckWeightApplies to
aeo_contact_signals 5 local_business, real_estate_listing, person, generic-with-business-signals

For most pages the denominator is 109 (the contact check applies). For SaaS / article / generic pages the denominator is 104. Either way the score is normalized to 0–100:

score = round(100 * sum(check.earned) / sum(check.possible))

Letter grades

GradeRangeWhat it means
A90–100Site is correctly set up for AI search. AI engines can fetch, parse, and cite it.
B80–89Solid AEO. Maybe one or two checks failing — usually a missing schema or a small content fix.
C70–79Mostly there but missing a meaningful signal. Common pattern: good SEO, no llms.txt, no JSON-LD.
D60–69At least one critical failure. Often a JS-only render, a partial bot block, or absent structured data.
F<60Page is invisible or near-invisible to AI search. Usually a Cloudflare challenge, SPA shell, or Disallow: / against AI bots.

In practice, a well-built marketing site with no AEO work scores around C+ (70–78). A site that does the basics — llms.txt, JSON-LD, sitemap, no bot blocks — scores B+ (85–90). An A is achievable with a single afternoon of focused work on most sites.

How an audit runs

  1. smartFetch(url). undici GET with a desktop-Chrome User Agent, follows up to 5 redirects, decompresses gzip/brotli/deflate manually (undici's request() doesn't auto-decompress), reads up to 4 MB of body.
  2. Bot-block detection. See the next section.
  3. Fallback. If direct fetch fails or is blocked, retry through BrightData Web Unlocker. Sites that needed the unlocker lose the major fetch_direct credit.
  4. Site-type detection. Classify the page from JSON-LD types and visible signals. The detected type controls which type-conditional checks run.
  5. Run all checks in parallel. robots.txt, llms.txt, llms-full.txt, sitemap.xml fetched concurrently. HTML parsed once with cheerio, once with jsdom (for Readability). The sitemap is followed one level deep if it's a sitemap-index.
  6. Score. Total earned ÷ total weight, normalized to 0–100, letter grade applied.
  7. Persist. Every audit is saved to SQLite with a cuid, so results have a shareable permalink at /audit/:id.

Bot-block detection

The auditor flags a fetch as blocked if any of these match:

  • Response header cf-mitigated is present
  • HTTP 403 with Server: cloudflare
  • HTTP 503 or 429 with a cf-ray header
  • Body matches a known challenge signature: Just a moment..., /cdn-cgi/challenge-platform/, __cf_chl_*, cf-browser-verification, Akamai Reference, _pxCaptcha, You have been blocked, etc.
  • Tiny body from a Cloudflare-fronted host (fallback heuristic)

If anything matches, the fetcher falls back to BrightData Web Unlocker and the audit is recorded as fallback mode rather than direct.

API

The hosted instance is open and unauthenticated. There is no rate limiting at the application layer.

MethodPathBody / paramsReturns
GET/api/health{ status: "ok" }
POST/api/audits{ "url": "https://…" }Full AuditResult with permalink id
GET/api/audits/:idA previously-saved audit
GET/api/audits/recentLast 25 audits

Limitations and non-goals

  • No JavaScript rendering. The auditor reads server-rendered HTML only. Sites that need JS to populate content fail ssr_content — intentional, because LLM crawlers also don't run JS reliably. If you need a Lighthouse-style headless-browser audit, this isn't that tool.
  • Single URL per audit. No site-wide crawl, no <a> link following.
  • Heuristic content checks. "Front-loaded answer" and "question-shaped heading" are regex heuristics, not language understanding. Useful directional signals, not absolute truth.
  • No historical comparisons. Saved audits are independent rows; no diff view.
  • No authentication. Public, no rate limiting beyond what nginx and undici provide.

References