If language models don't cite your brand, it's not because you don't exist. It's because they aren't reading you well. This guide walks through the three paths through which an LLM cites a brand, the minimum schema your site needs, how to build a useful llms.txt, and which content patterns increase citation probability.
The three paths through which an LLM cites a brand
Before optimizing, you need to understand how a model can read you. There are three paths and each requires different levers.
1. Real-time web search
Perplexity, ChatGPT with browsing, Claude with web search, and Gemini with Search Grounding query Google or Bing, read the top URLs, and assemble the answer. If your site ranks well for that query, the model reads you. Classic SEO matters here, along with Core Web Vitals and server response speed.
2. Training data
Models like GPT, Claude or Gemini learn from curated datasets: Common Crawl, Wikipedia, GitHub, academic papers, editorial blogs. If your brand is represented there, the model knows you ahead of time. The training cycle is slow — 6 to 18 months between cut-offs — but the effect is durable.
3. Tools and enterprise RAG
In B2B implementations, the model connects to a controlled knowledge base via MCP (Model Context Protocol) or RAG. Public SEO doesn't matter here — what matters is exposing your documentation in a way consumable by an agent.
The minimum schema your site needs
An LLM reads entities. If your site doesn't declare them explicitly, the model has to infer them — and inference fails more often than you'd think. These are the pieces worth getting right:
Organization
One declaration per site, on home, reusable by @id on other pages. It tells the model what you are.
{
"@context": "https://schema.org",
"@type": "Organization",
"@id": "https://yourbrand.com/#organization",
"name": "Your Brand",
"url": "https://yourbrand.com",
"logo": "https://yourbrand.com/logo.png",
"description": "A clear sentence of what you are and who you serve.",
"knowsAbout": ["category 1", "category 2", "category 3"],
"sameAs": [
"https://www.linkedin.com/company/yourbrand",
"https://github.com/yourbrand",
"https://en.wikipedia.org/wiki/Your_Brand"
]
}The sameAs field is underrated. It connects your site to other nodes the model already knows (Wikipedia, LinkedIn, GitHub) and reinforces the entity.
Service
If you sell services, declare them with canonical types. The model reads them when someone asks “who does X”.
{
"@context": "https://schema.org",
"@type": "Service",
"serviceType": "SEO consulting",
"provider": { "@id": "https://yourbrand.com/#organization" },
"areaServed": ["United States", "Canada", "United Kingdom"],
"description": "What you deliver, in one sentence.",
"hasOfferCatalog": {
"@type": "OfferCatalog",
"name": "Services",
"itemListElement": [
{ "@type": "Offer", "itemOffered": { "@type": "Service", "name": "Audit" } },
{ "@type": "Offer", "itemOffered": { "@type": "Service", "name": "Implementation" } }
]
}
}FAQPage
Each key page should have an FAQ block with schema. LLMs love FAQs because they're already in question-answer format — exactly what they need to cite.
DefinedTerm
If your brand owns a category worth defining (e.g. “agentic team OS”, “hybrid data observatory”), declare the term. That positions you as the canonical source of the definition.
llms.txt: the file many ignore and is worth having
It's an emerging standard. A plain markdown file at the root of your domain (/llms.txt) that summarizes your site for AI crawlers without making them parse your CSS and JavaScript. Anthropic, Vercel, Mintlify and FastAPI already implement it.
# Your Brand
## What it is
A clear sentence of what you do.
## Why it matters
The concrete problem you solve.
## Services
- Service A — what it includes, in one line.
- Service B — what it includes, in one line.
## Who we are
Team, location, public sites.
## Contact
Email, web.If your site has technical documentation (typical in B2B SaaS), /llms-full.txt also exists with a more extensive dump of the docs.
Permissive robots.txt: the call many get backwards
Most sites block AI crawlers in robots.txt fearing “content theft”. If your goal is to be cited, you have to do the exact opposite: allow them.
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
Sitemap: https://yourbrand.com/sitemap.xmlEdge Middleware to serve a condensed version to bots
An advanced technique: detect the crawler's user-agent and serve it a semantically dense version of your home, without visual chrome, directly with the information it needs. In Next.js, this is done with Edge Middleware.
// src/middleware.ts
import { NextRequest, NextResponse } from "next/server";
const AI_BOT = /(GPTBot|Claude-Web|PerplexityBot|anthropic-ai|Google-Extended)/i;
export function middleware(req: NextRequest) {
const ua = req.headers.get("user-agent") || "";
if (req.nextUrl.pathname === "/" && AI_BOT.test(ua)) {
const url = req.nextUrl.clone();
url.pathname = "/agents";
return NextResponse.rewrite(url);
}
return NextResponse.next();
}
export const config = { matcher: ["/"] };On the /agents route you serve dense plain text, no headers, no navigation, with all the data you want the model to associate with your brand.
Content patterns that raise citation probability
- Direct-answer first. The first 50 words of each page answer the implicit question. LLMs cite the block that resolves, not the one that surrounds it.
- One promise per page.If your home says twelve things, the model doesn't know which to use as definition. If it says one clear thing, it uses it.
- Attributable statistics. Every number with its source. Models prefer citing content with citations — that makes it verifiable.
- Category glossary.If you build the authoritative glossary of your industry's terms, the model picks you as source when someone asks a definition.
- Cases with data.“We raised Perplexity citations 3.2× for 7 B2B clients in Q1-2026” beats “we improved AI visibility” by miles.
- FAQs per page. Question + brief answer + FAQPage schema. Winning pattern for citation.
How to measure being cited
Without measurement, you don't know if what you do works. A basic loop:
- Define 8–12 keywords that reflect how a buyer talks about your category. Not the ones your marketing uses — the ones the customer uses.
- Each week, run those queries on Perplexity, ChatGPT, Claude and Gemini. Log: do they cite you? what do they say? who else do they cite?
- Save the results with timestamp in a table. Compare the next week. The curve over time is the only serious metric.
- When a model updates (cut-off, new version, provider change), re-run the full batch. Changes are usually abrupt.
Common mistakes
- Schema copied from a template.If your schema says you're “LocalBusiness” and you're actually a B2B SaaS, the model gets confused. Worth writing it by hand once.
- Stale llms.txt. If you publish llms.txt and later change your services or positioning, update it. Crawlers come back.
- Blocking AI bots “just in case.” Decision that costs visibility without gaining anything concrete.
- A cookie pop-up covering the content.Crawlers don't click modals. If your content lives behind a consent banner, to the model that content doesn't exist.
GEO is a months-long job, not a sprint. The brands building the asset today are the ones that will be the answer when someone asks an AI in 2027.