Crawlability
Definition
Crawlability means the ability of search engine crawlers (Googlebot, etc.) and AI bots (GPTBot, ClaudeBot, etc.) to access website pages and read and parse content.
Crawlability is the most fundamental condition for SEO and AEO performance. If bots cannot crawl pages, indexing is impossible; without indexing there is no ranking; and content cannot be used as AI answer data. Even with excellent content and backlinks, blocked crawlability yields no SEO benefit.
Summary
Crawlability essentials: ①Crawlable → ②Indexable → ③Rankable in sequence → ④If stage 1 is blocked, stages 2 and 3 fail → ⑤Check 7 blocking causes → ⑥AI bots do not execute JS, so SSR/SSG required → ⑦Regular audit with GSC URL Inspection + Screaming Frog. Watch Cloudflare AI bot auto-block settings especially.
Crawlability, Indexing, and Ranking: 3 Stages
[COMPARISON_TABLE: Crawlability → Indexing → Ranking 3-stage sequential structure]
SEO operates through three sequential stages.
Stage 1: Crawlability (Accessibility)
Can bots access page URLs, download HTML, and parse content? Failure at this stage makes all later stages impossible.
Stage 2: Indexing (Storability)
Stage of storing crawled content in Google’s index. Storage may fail due to noindex tags or low-quality judgments. See Crawling vs Indexing for details.
Stage 3: Ranking (Exposure Order)
Stage where indexed pages appear at specific positions in actual search results.
All three stages must be satisfied for search traffic to occur.
7 Causes of Crawlability Blocking
1. robots.txt Disallow
The most common blocking cause. Disallow: / or path blocks in robots.txt prevent crawling all pages on those paths.
# Wrong example — block entire site
User-agent: *
Disallow: /
# AI bot block example
User-agent: GPTBot
Disallow: /
See Allowing AI Bots in robots.txt and llms.txt for details.
2. Firewall/CDN IP Blocking
Cloudflare bot defense may block some AI bots by default. WAF rules and rate limiting can block crawler requests.
3. Authentication/Login Required
Member-only content and login-required pages are inaccessible to bots. Intentional design is fine, but accidentally applying authentication to content that should be public makes indexing impossible.
4. JavaScript Rendering Dependency
CSR (client-side rendering) pages have no actual content in HTML source. Googlebot handles them with two-stage rendering (HTML parsing → JS execution) but with delay. GPTBot, ClaudeBot, PerplexityBot, etc. do not execute JavaScript and see blank pages. See JavaScript SEO and Rendering for details.
5. Deep Crawl Depth
Pages 5+ clicks from home are hard for bots to reach. See Crawl Depth for details.
6. Broken Internal Links
If linked pages do not exist (404) or redirect chains are deep, bots cannot reach the target page.
7. Server Errors (5xx)
Repeated 500/503 responses cause bots to reduce retries and eventually lower crawl frequency for those URLs.
5-Step Crawlability Audit
Step 1: Check robots.txt Directly
Access example.com/robots.txt directly and review Disallow rules. Verify important paths are not accidentally blocked. Also check AI bot-specific settings.
Step 2: Check XML Sitemap
Access example.com/sitemap.xml and confirm all important pages are included. Pages not in the sitemap are hard for bots to discover.
Step 3: Google Search Console URL Inspection
GSC "URL Inspection" tool shows crawl status, last crawl time, and crawlability for specific pages. "URL is not on Google" or "Crawl error" messages require immediate action. See Google Search Console for details.
Step 4: Screaming Frog Full Crawl Audit
Crawl the entire site with Screaming Frog to identify 404 errors, redirect chains, crawl depth, and robots.txt-blocked pages at once. Change User-Agent to simulate Googlebot or AI bots.
Step 5: Firewall/CDN Bot Policy Review
Review bot-related rules in Cloudflare, AWS CloudFront, and other CDN/firewall settings. Confirm search bots and AI bots are not on block lists; add required bots to allowlists. Server access logs showing Googlebot, GPTBot, etc. visits provide the most accurate diagnosis.
AI Bot Crawlability (AEO Core)
Googlebot JS Processing
Googlebot executes JavaScript but uses a two-stage rendering queue, so actual indexing takes time. CSR pages are indexed after HTML indexing and processed fully after some delay. See JavaScript SEO for details.
AI Bots Read HTML Only
GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot do not execute JavaScript. On React, Vue, and Angular SPAs rendered with CSR, HTML source has no actual content, so AI bots see blank pages.
# CSR page HTML source seen by AI bots
<body>
<div id="root"></div> <!-- no text -->
</body>
# SSR/SSG page HTML source seen by AI bots
<body>
<h1>What is SEO?</h1>
<p>SEO is search engine optimization...</p> <!-- content present -->
</body>
From an AEO perspective, AI bot crawlability is a prerequisite for AI answer data citation. See Rendering for details.
4 Crawlability and Indexing Combinations
Crawlable + Indexable (Normal)
Most desirable state. Eligible for search results.
Crawlable + Indexing Blocked (Intentional)
Crawling allowed but indexing blocked with noindex tag. Used for admin pages, staging environments, etc. See Using noindex Tags for details.
Crawl Blocked + Indexable (Risky Combination)
Blocked by robots.txt but URL indexed due to external backlinks. URL indexed without content lowers quality signals.
Crawl Blocked + Indexing Blocked (Full Block)
robots.txt + noindex combination or repeated server errors. Fully excluded from search results.
5 Ways to Improve Crawlability
1. Clean Up robots.txt and Explicitly Allow AI Bots
Confirm AI bots are not blocked; explicitly allow them if pursuing AEO. See Allowing AI Bots in robots.txt for details.
2. Apply SSR/SSG
Convert React, Vue, and Angular SPAs to Next.js/Nuxt.js SSG/SSR modes to include actual content in HTML source. See JavaScript SEO and Rendering for details.
3. Keep XML Sitemap Updated with Auto-Generation
Configure CMS to auto-update sitemap on content publish. Submit sitemap in GSC to encourage faster crawl requests.
4. Strengthen Internal Links
Add bidirectional links from deep pages to new content and from new content to existing related content. See Internal Linking Strategy for details.
5. Manage Firewall/CDN Allowlists
Add Googlebot, GPTBot, ClaudeBot, and PerplexityBot IP ranges to allowlists on Cloudflare, AWS, etc., or exclude them from block rules. Monitor bot access in server logs periodically.
Korea Market Application
Bot Blocking Issues on Korean Hosting
Some Korean hosting services have strong default bot blocking for security. Cafe24 provides malicious bot blocking by default; settings may block Googlebot and AI bots together. Always verify search bot allowance under Settings > Security > Bot management.
Mobile m.example.com Split Sites
When operating separate mobile sites at m.example.com, audit desktop and mobile crawlability separately. Mobile pages alone may be crawl-blocked, or desktop pages may be missing.
See Mobile-First Indexing for details.
Naver Search Bot (Yeti)
Naver search bot Yeti also respects robots.txt. Unintentional settings may allow Googlebot but block Yeti or vice versa — audit robots.txt for both bots separately. See Naver SEO and Naver Search Advisor for details.
Frequently Asked Questions
Q. Are crawlability and indexability different concepts?
A. Yes. Crawlability is whether bots can access pages and read HTML. Indexability is whether crawled pages can be stored in Google’s index. Pages may be crawlable but not indexed due to noindex tags or low-quality judgments. See Crawling vs Indexing for details.
Q. Can AI bots be blocked when using Cloudflare?
A. Yes. Cloudflare Bot Fight Mode or WAF rules may block GPTBot, ClaudeBot, and other AI bots. Check settings under Cloudflare dashboard > Security > Bots and add allowed AI bot User-Agents to allowlist rules.
Q. Do SSG-built sites have no crawlability issues?
A. SSG (static site generation) is best for crawlability. All pages are pre-rendered as HTML, so all bots can read full content without JavaScript execution. However, long build cycles may not reflect latest content — consider hybrid approaches like ISR. See Rendering for details.
Q. Why is indexing blocked even without crawl blocking?
A. Main reasons crawled but not indexed: ①noindex tag ②duplicate content (another URL is canonical) ③thin content judgment ④canonical misconfiguration ⑤Google crawl budget exhaustion. GSC URL Inspection shows status other than "Indexed" — check the reason. See Indexing Coverage Diagnosis for details.
Q. How often should crawlability be audited?
A. Small sites: quarterly; medium sites: monthly; after bulk content additions: immediately after addition. Weekly monitoring of GSC indexing coverage report catches most issues early. See Crawl Budget for details.
Related Sources
- Google Search Central (2024). Crawling and indexing. https://developers.google.com/search/docs/fundamentals/how-search-works
- Google Search Central (2024). Robots.txt specifications. https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
- Screaming Frog (2024). How to use Screaming Frog SEO Spider. https://www.screamingfrog.co.uk/seo-spider/
이 페이지를 참조하는 항목
- 📘ConceptCrawl Budget
- 📙How-toIndexing Coverage Diagnosis
- 📘ConceptGEO Master Guide: 5-Area Checklist
- 📘ConceptWhat Is AEO?
- 📙How-toNaver Search Advisor Registration Guide
- 📘ConceptHow Naver SEO Works
- 📘ConceptInternal Linking Strategy
- 📙How-toChatGPT Citation Optimization
- 📙How-toClaude Citation Optimization
- 📙How-toCopilot Citation Optimization
- 📙How-toGemini Citation Optimization
- 📙How-toGrok Citation Optimization
- 📙How-toPerplexity Citation Optimization
- 📘Concept301 Redirect
- 📘ConceptCrawl Depth
- 📘ConceptCrawling vs Indexing
- 📘ConceptHTTP Status Codes
- 📘ConceptJavaScript SEO
- 📘ConceptMobile-First Indexing
- 📘ConceptRendering
- 📙How-toHow to Allow AI Bots in robots.txt
- 📘ConceptSite Architecture
- 📙How-toSitemap (XML Sitemap)