Making Your Content Discoverable by AI | Hillcraft

A practical guide for organizations sitting on years of content. What's happening when AI skips your work, and what to do about it: robots.txt, format, topic clusters, structured data, and llms.txt kept honest.

Most organizations sitting on years of content have an access and format problem, not a quality problem. AI tools cite generic blogs because they cannot read what you published, or were never allowed to. This guide walks through the fixes, in the order that matters.

SEO and GEO

SEO helps people find your page. GEO (generative engine optimization) helps an AI quote your page. ChatGPT search leans on Bing's index; Google's AI Overviews use Google's index. Crawlability still matters. GEO adds clear structure on top so a machine can pull a clean, quotable point.

How AI Finds and Uses Your Content

Two mechanisms. Training crawlers (GPTBot, ClaudeBot, CCBot) periodically fetch pages for model training. Retrieval crawlers (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot, Google-Extended) fetch in real time to build live answers. AI crawlers read raw server-sent HTML and do not execute JavaScript. The thirty-second test: View Page Source on an important page and search for a sentence from the body. If it is not there, AI crawlers see a blank page.

Readable vs Invisible

Invisible: scanned PDFs without text layers, JavaScript-only client-rendered apps, content behind login or paywall, video and audio without transcripts. Readable: server-rendered HTML with a sensible heading order, real selectable text, transcripts, clear author and date, summary near the top.

robots.txt First

Many sites accidentally block AI crawlers via defaults, security plugins, or stale robots.txt. Allow the AI crawlers explicitly: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, Google-Extended, CCBot. Watch for a User-agent: * Disallow: / above your allows. Check CDN and bot-protection (Cloudflare) too, which can block above the robots.txt layer.

Fix the Format

Inventory your content, then prioritize flagship and evergreen material. Convert PDFs to real HTML pages with H1, H2s, a summary, author, date, and the PDF as an optional download. OCR scanned PDFs (Google Docs does it free). For video and audio, publish a page with player on top, summary above, and transcript below. Avoid hidden summary meta tags; AI does not read invented metadata. A visible summary at the top of real content does the work.

Topic Clusters

A hub page introduces a topic and links to pillar pages; pillar pages link to supporting articles. Everything links back. A connected cluster signals authority to machines. Strong candidate topics already have 20-40 pieces; the linking and hierarchy are usually what is missing.

Structured Data

JSON-LD (schema.org) labels parts of your page so machines understand author, date, organization. Embed in HTML head, server-rendered. Use ChatGPT or Claude to draft schema, validate with Google's Rich Results Test. Schema is a supporting move, not a magic switch. Google said in May 2026 no special markup is required to appear in its AI surfaces. Fundamentals matter more.

llms.txt, Kept Honest

llms.txt is a Markdown index file proposed by Jeremy Howard in 2024. As of 2026 no major AI provider has committed to it as a ranking signal; Google does not use it. Useful for developer tooling and AI agents fetching documentation. Cheap to add, not a substitute for the fundamentals.

30-Day Path

Week one: check robots.txt and CDN, run view-source test on three pages, submit sitemap to Google Search Console, add Organization schema. Weeks two through four: rebuild one trapped PDF as a web page, build one small topic cluster (hub plus five pieces), add transcript and summary page for your best video, add Article schema and validate it. Repeat.

Book a discovery call with Hillcraft