Crawling vs Indexing

Definition

Crawling is the process where search engine bots such as Googlebot and Naverbot (Yeti) follow links across the web and collect page HTML, CSS, and JavaScript.

Indexing is the process of analyzing pages collected through crawling, evaluating relevant keywords, structure, and quality, and storing them in a search database. Only indexed pages can appear in search results.

The two processes are separate. Being crawled does not necessarily mean being indexed, and being indexed does not guarantee search rankings.

Summary

Crawling vs indexing essentials: ①Crawling = bot visits and collects, indexing = stores in DB → ②Reasons crawled but not indexed: low quality, noindex, duplicate content, rendering failure → ③If crawling is blocked by robots.txt, noindex directives are not read → ④GSC 'Page indexing report' enables step-by-step diagnosis → ⑤JavaScript content may be crawled but indexing is delayed until after rendering.

SEO 3-Stage Framework

[DIAGRAM: SEO 3 stages — Crawling → Indexing → Ranking flow]

To appear in search results, a page must pass all three stages.

Stage 1: Crawling

Bot discovers links and visits pages
Downloads HTML, CSS, and JavaScript files
Checks robots.txt, noindex, and server response codes
Crawl failure causes: robots.txt block, server errors (5xx), access restrictions, crawl budget exhaustion

Stage 2: Indexing

Analyzes collected content after JavaScript rendering
Evaluates keywords, structure, links, and E-E-A-T quality
Stores in search database
Indexing rejection causes: low quality, noindex, duplicate content, rendering failure

Stage 3: Ranking

Determines order of indexed pages matching a query using 200+ signals
E-E-A-T, backlinks, user signals, technical quality, etc.

If any stage is blocked, later stages do not proceed. See Crawlability for details.

5 Reasons a Page Is Crawled but Not Indexed

1. Insufficient Content Quality

If content is too thin, heavily duplicated with other pages, or judged to provide no user value, crawling may succeed but indexing is rejected. Pages must meet Helpful Content System criteria.

2. noindex Directive

With <meta name="robots" content="noindex"/> or HTTP header X-Robots-Tag: noindex, crawling is allowed but indexing is excluded. See noindex for details.

3. Duplicate Handling via Canonical

When identical content exists at multiple URLs, Google indexes one canonical URL and excludes the rest. See Canonical Tag for details.

4. JavaScript Rendering Failure

SPA or client-side rendered pages separate crawling (HTML collection) from rendering (JavaScript execution). Rendering failure can make content appear as a blank page and indexing may be rejected. See JavaScript SEO for details.

5. Repeated Server Errors

Pages with repeated 5xx errors may cause crawl bots to abandon collection, or collected pages may not be indexed because normal content is unavailable.

Diagnosing Crawling and Indexing with GSC

Google Search Console’s "Page Indexing Report" shows crawling and indexing status step by step.

Path: GSC → Indexing → Pages

Main status codes

GSC Status	Meaning
Indexed	Crawling + indexing complete
Crawled - currently not indexed	Crawling complete, indexing rejected (quality issue)
Discovered - currently not indexed	Discovered but crawling not complete
Blocked by robots.txt	Crawling blocked
Excluded by noindex tag	noindex applied
Page with redirect	301/302 redirect handled
Not found (404)	URL does not exist

See Indexing Coverage for details.

AI Bot Crawling and Indexing

AI bots for ChatGPT, Perplexity, and Google AI Overviews (GPTBot, PerplexityBot, etc.) also perform crawling and learning. However, AI bot "indexing" is LLM training data collection, not search database storage.

Allowing AI bots in robots.txt is required for content to be cited in AI search answers. In AEO (Answer Engine Optimization) strategy, allowing AI bot crawling is a prerequisite. See robots.txt and AI Bots for details.

Korea Market Application

Naverbot Crawling and Indexing

Naver search bot (Yeti) operates separately from Googlebot. Naver Search Advisor shows Naverbot crawling status and indexing errors. Registering a Naver sitemap in Search Advisor improves crawling efficiency.

Naverbot characteristics:

Processes Naver Blog and Cafe content separately from its own crawling
JavaScript rendering support is more limited than Googlebot
Bot visit records can be checked in Naver Search Advisor logs

Common Indexing Issues on Korean Sites

JavaScript rendering: Indexing gaps are frequent in React apps without Next.js SSR
Login-required content: Authenticated content is inaccessible to Googlebot → not indexed
IP-based blocking: Some security solutions block Googlebot IPs and cause crawl failure

Frequently Asked Questions

Q. If a page blocked by robots.txt has noindex added, does it matter?
A. No. If crawling is blocked by robots.txt, the bot cannot access the page and cannot read noindex directives. For noindex to work, crawling must be allowed. To block indexing only, remove Disallow from robots.txt and use only the noindex meta tag.

Q. Does registering a URL in a sitemap make crawling faster?
A. A sitemap is a hint that tells Google the URL list; it does not guarantee crawl speed. However, orphan pages without internal links are hard to discover without a sitemap. Large sites or new URLs can raise crawl priority through sitemap submission. See Sitemap for details.

Q. If crawling succeeded but indexing did not, what should I check first?
A. Check the "indexing error reason" for that URL in GSC. "Crawled - currently not indexed" is usually one of: low quality, duplicate content, or canonical issue. Review content length and uniqueness, and check for duplication with other pages.

Q. When will Google recrawl a page after I edit it?
A. Google decides on its own schedule, so timing cannot be guaranteed. For faster recrawling, use GSC "URL Inspection → Request indexing." Important pages (home, categories, etc.) are crawled more often; reindexing typically occurs within days to weeks after edits.

Q. Is it a problem if indexed page count is much lower than actual page count?
A. Not necessarily. Many pages may be intentionally excluded: parameter pages, noindex pages, internal-only pages, etc. However, if core content pages are not indexed, that is a problem. Analyze GSC "Indexed" count and "Not indexed" reasons by category to check for missing core pages.

Related Sources

Google Search Central (2024). How Google Search Works: Crawling, Indexing, and Ranking. Google Developers.
Google Search Central (2024). Page indexing report. Google Search Console Help.
John Mueller, Google (2023). Crawling vs. Indexing: What you need to know. Google Search Central Blog.