/Crawlability
📘Concept⭐️ Pillar

Crawlability

최종 업데이트:

Definition

Crawlability means the ability of search engine crawlers (Googlebot, etc.) and AI bots (GPTBot, ClaudeBot, etc.) to access website pages and read and parse content.

Crawlability is the most fundamental condition for SEO and AEO performance. If bots cannot crawl pages, indexing is impossible; without indexing there is no ranking; and content cannot be used as AI answer data. Even with excellent content and backlinks, blocked crawlability yields no SEO benefit.


Summary

Crawlability essentials: ①Crawlable → ②Indexable → ③Rankable in sequence → ④If stage 1 is blocked, stages 2 and 3 fail → ⑤Check 7 blocking causes → ⑥AI bots do not execute JS, so SSR/SSG required → ⑦Regular audit with GSC URL Inspection + Screaming Frog. Watch Cloudflare AI bot auto-block settings especially.


Crawlability, Indexing, and Ranking: 3 Stages

[COMPARISON_TABLE: Crawlability → Indexing → Ranking 3-stage sequential structure]

SEO operates through three sequential stages.

Stage 1: Crawlability (Accessibility)
Can bots access page URLs, download HTML, and parse content? Failure at this stage makes all later stages impossible.

Stage 2: Indexing (Storability)
Stage of storing crawled content in Google’s index. Storage may fail due to noindex tags or low-quality judgments. See Crawling vs Indexing for details.

Stage 3: Ranking (Exposure Order)
Stage where indexed pages appear at specific positions in actual search results.

All three stages must be satisfied for search traffic to occur.


7 Causes of Crawlability Blocking

1. robots.txt Disallow

The most common blocking cause. Disallow: / or path blocks in robots.txt prevent crawling all pages on those paths.

# Wrong example — block entire site
User-agent: *
Disallow: /

# AI bot block example
User-agent: GPTBot
Disallow: /

See Allowing AI Bots in robots.txt and llms.txt for details.

2. Firewall/CDN IP Blocking

Cloudflare bot defense may block some AI bots by default. WAF rules and rate limiting can block crawler requests.

3. Authentication/Login Required

Member-only content and login-required pages are inaccessible to bots. Intentional design is fine, but accidentally applying authentication to content that should be public makes indexing impossible.

4. JavaScript Rendering Dependency

CSR (client-side rendering) pages have no actual content in HTML source. Googlebot handles them with two-stage rendering (HTML parsing → JS execution) but with delay. GPTBot, ClaudeBot, PerplexityBot, etc. do not execute JavaScript and see blank pages. See JavaScript SEO and Rendering for details.

5. Deep Crawl Depth

Pages 5+ clicks from home are hard for bots to reach. See Crawl Depth for details.

6. Broken Internal Links

If linked pages do not exist (404) or redirect chains are deep, bots cannot reach the target page.

7. Server Errors (5xx)

Repeated 500/503 responses cause bots to reduce retries and eventually lower crawl frequency for those URLs.


5-Step Crawlability Audit

Step 1: Check robots.txt Directly

Access example.com/robots.txt directly and review Disallow rules. Verify important paths are not accidentally blocked. Also check AI bot-specific settings.

Step 2: Check XML Sitemap

Access example.com/sitemap.xml and confirm all important pages are included. Pages not in the sitemap are hard for bots to discover.

Step 3: Google Search Console URL Inspection

GSC "URL Inspection" tool shows crawl status, last crawl time, and crawlability for specific pages. "URL is not on Google" or "Crawl error" messages require immediate action. See Google Search Console for details.

Step 4: Screaming Frog Full Crawl Audit

Crawl the entire site with Screaming Frog to identify 404 errors, redirect chains, crawl depth, and robots.txt-blocked pages at once. Change User-Agent to simulate Googlebot or AI bots.

Step 5: Firewall/CDN Bot Policy Review

Review bot-related rules in Cloudflare, AWS CloudFront, and other CDN/firewall settings. Confirm search bots and AI bots are not on block lists; add required bots to allowlists. Server access logs showing Googlebot, GPTBot, etc. visits provide the most accurate diagnosis.


AI Bot Crawlability (AEO Core)

Googlebot JS Processing

Googlebot executes JavaScript but uses a two-stage rendering queue, so actual indexing takes time. CSR pages are indexed after HTML indexing and processed fully after some delay. See JavaScript SEO for details.

AI Bots Read HTML Only

GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot do not execute JavaScript. On React, Vue, and Angular SPAs rendered with CSR, HTML source has no actual content, so AI bots see blank pages.

# CSR page HTML source seen by AI bots
<body>
  <div id="root"></div>  <!-- no text -->
</body>

# SSR/SSG page HTML source seen by AI bots
<body>
  <h1>What is SEO?</h1>
  <p>SEO is search engine optimization...</p>  <!-- content present -->
</body>

From an AEO perspective, AI bot crawlability is a prerequisite for AI answer data citation. See Rendering for details.


4 Crawlability and Indexing Combinations

Crawlable + Indexable (Normal)
Most desirable state. Eligible for search results.

Crawlable + Indexing Blocked (Intentional)
Crawling allowed but indexing blocked with noindex tag. Used for admin pages, staging environments, etc. See Using noindex Tags for details.

Crawl Blocked + Indexable (Risky Combination)
Blocked by robots.txt but URL indexed due to external backlinks. URL indexed without content lowers quality signals.

Crawl Blocked + Indexing Blocked (Full Block)
robots.txt + noindex combination or repeated server errors. Fully excluded from search results.


5 Ways to Improve Crawlability

1. Clean Up robots.txt and Explicitly Allow AI Bots

Confirm AI bots are not blocked; explicitly allow them if pursuing AEO. See Allowing AI Bots in robots.txt for details.

2. Apply SSR/SSG

Convert React, Vue, and Angular SPAs to Next.js/Nuxt.js SSG/SSR modes to include actual content in HTML source. See JavaScript SEO and Rendering for details.

3. Keep XML Sitemap Updated with Auto-Generation

Configure CMS to auto-update sitemap on content publish. Submit sitemap in GSC to encourage faster crawl requests.

4. Strengthen Internal Links

Add bidirectional links from deep pages to new content and from new content to existing related content. See Internal Linking Strategy for details.

5. Manage Firewall/CDN Allowlists

Add Googlebot, GPTBot, ClaudeBot, and PerplexityBot IP ranges to allowlists on Cloudflare, AWS, etc., or exclude them from block rules. Monitor bot access in server logs periodically.


Korea Market Application

Bot Blocking Issues on Korean Hosting

Some Korean hosting services have strong default bot blocking for security. Cafe24 provides malicious bot blocking by default; settings may block Googlebot and AI bots together. Always verify search bot allowance under Settings > Security > Bot management.

Mobile m.example.com Split Sites

When operating separate mobile sites at m.example.com, audit desktop and mobile crawlability separately. Mobile pages alone may be crawl-blocked, or desktop pages may be missing.

See Mobile-First Indexing for details.

Naver Search Bot (Yeti)

Naver search bot Yeti also respects robots.txt. Unintentional settings may allow Googlebot but block Yeti or vice versa — audit robots.txt for both bots separately. See Naver SEO and Naver Search Advisor for details.


Frequently Asked Questions

Q. Are crawlability and indexability different concepts?
A. Yes. Crawlability is whether bots can access pages and read HTML. Indexability is whether crawled pages can be stored in Google’s index. Pages may be crawlable but not indexed due to noindex tags or low-quality judgments. See Crawling vs Indexing for details.

Q. Can AI bots be blocked when using Cloudflare?
A. Yes. Cloudflare Bot Fight Mode or WAF rules may block GPTBot, ClaudeBot, and other AI bots. Check settings under Cloudflare dashboard > Security > Bots and add allowed AI bot User-Agents to allowlist rules.

Q. Do SSG-built sites have no crawlability issues?
A. SSG (static site generation) is best for crawlability. All pages are pre-rendered as HTML, so all bots can read full content without JavaScript execution. However, long build cycles may not reflect latest content — consider hybrid approaches like ISR. See Rendering for details.

Q. Why is indexing blocked even without crawl blocking?
A. Main reasons crawled but not indexed: ①noindex tag ②duplicate content (another URL is canonical) ③thin content judgment ④canonical misconfiguration ⑤Google crawl budget exhaustion. GSC URL Inspection shows status other than "Indexed" — check the reason. See Indexing Coverage Diagnosis for details.

Q. How often should crawlability be audited?
A. Small sites: quarterly; medium sites: monthly; after bulk content additions: immediately after addition. Weekly monitoring of GSC indexing coverage report catches most issues early. See Crawl Budget for details.


Related Sources

이 페이지를 참조하는 항목

관련 항목

📙How-to
llms.txt Writing Guide
llms.txt is a markdown-format metadata file that helps LLMs efficiently understand site content efficiently, placed at the site root (/) as an AI-friendly site guide.
📘Concept
Crawl Budget
Crawl budget is the number of pages Googlebot can and wants to crawl on your site within a given period — relevant for large sites where crawl allocation affects indexing speed and coverage.
📘Concept
Google Search Console
Google Search Console (GSC) is a free tool from Google for monitoring site search performance, diagnosing indexing issues, and submitting sitemaps — the essential foundation for SEO measurement.
📙How-to
Indexing Coverage Diagnosis
Indexing coverage diagnosis uses the GSC indexing report to check overall site indexing status, identify causes of unindexed pages, and fix them — a core SEO task.
📘ConceptPillar
GEO Master Guide: 5-Area Checklist
An execution guide for Generative AI Optimization covering GEO's five areas: content, structure, technical, off-site, and measurement.
📘ConceptPillar
What Is AEO?
AEO is the practice of optimizing content so AI answer engines cite it.
📙How-to
Naver Search Advisor Registration Guide
Naver Search Advisor is Naver's official free webmaster tool and an essential setup for the Korean market, providing site indexing status, sitemap submission, and search visibility analysis.
📘ConceptPillar
How Naver SEO Works
Naver SEO aims for top visibility in unified search on Naver, Korea's leading search platform, where the channel-trust-centered C-Rank algorithm differs fundamentally from Google.
📘ConceptPillar
Internal Linking Strategy
Internal linking strategy is the practice of semantically connecting pages within your own site to optimize topic authority and bot and user navigation.
📘Concept
Noindex
noindex is an on-page crawl control directive that tells search engine bots not to include a page in search results via robots meta tags or HTTP headers. It excludes pages that do not need or should not appear in search from the index, saving crawl budget and improving site quality signals.
📘Concept
301 Redirect
A 301 redirect is an HTTP status code that tells browsers and search engines a URL has permanently moved. It transfers PageRank and backlink authority from the old URL to the new one, enabling URL structure changes without SEO loss — a core technical SEO tool.
📘Concept
Crawl Depth
Crawl depth (click depth) is the number of clicks required to reach a page from the homepage. It is a core site structure metric that determines page discovery priority for search engine and AI bots and PageRank transfer efficiency.
📘Concept
Crawling vs Indexing
Crawling is the process where search engine bots follow links across the web and collect pages. Indexing is the process of analyzing collected pages and storing them in a search database. These are the first two stages of SEO’s three stages: crawling → indexing → ranking.
📘Concept
HTTP Status Codes
HTTP status codes are three-digit codes returned when a server responds to client requests. In SEO, codes such as 200 (OK), 301 (permanent redirect), 302 (temporary redirect), 404 (not found), 410 (gone), and 500 (server error) directly affect crawling, indexing, and PageRank transfer.
📘ConceptPillar
JavaScript SEO
JavaScript SEO is the technical SEO area of optimizing JavaScript-rendered web pages so search engines and AI bots recognize them correctly. The choice between SSR/SSG and CSR determines indexing feasibility.
📘ConceptPillar
Mobile-First Indexing
Mobile-first indexing is Google’s system for crawling, indexing, and ranking based on a site’s mobile version. With full rollout completed in 2024, it is now the default premise of SEO.
📘ConceptPillar
Rendering
Rendering is the process of processing HTML, CSS, and JavaScript to produce the final screen seen by users and bots. The choice among CSR, SSR, SSG, and ISR determines SEO and AEO feasibility.
📙How-to
How to Allow AI Bots in robots.txt
Allowing AI bots means explicitly permitting major AI crawlers such as GPTBot, ClaudeBot, and PerplexityBot to access your site in robots.txt, exposing your content for citation in generative AI answers.
📘ConceptPillar
Site Architecture
Site architecture is the overall design of page hierarchy, URL structure, and internal linking on a website. It simultaneously determines crawl efficiency, indexing quality, and user navigation experience — a foundational SEO element.
📙How-to
Sitemap (XML Sitemap)
An XML sitemap is an XML file listing a website’s URLs along with last-modified dates, update frequency, and priority information. It helps search engine bots understand site structure and improves crawling efficiency and indexing speed as a technical SEO foundation tool.

이런 항목도 있어요

이 페이지가 도움이 됐나요?