Crawling vs Indexing
Definition
Crawling is the process where search engine bots such as Googlebot and Naverbot (Yeti) follow links across the web and collect page HTML, CSS, and JavaScript.
Indexing is the process of analyzing pages collected through crawling, evaluating relevant keywords, structure, and quality, and storing them in a search database. Only indexed pages can appear in search results.
The two processes are separate. Being crawled does not necessarily mean being indexed, and being indexed does not guarantee search rankings.
Summary
Crawling vs indexing essentials: ①Crawling = bot visits and collects, indexing = stores in DB → ②Reasons crawled but not indexed: low quality, noindex, duplicate content, rendering failure → ③If crawling is blocked by robots.txt, noindex directives are not read → ④GSC 'Page indexing report' enables step-by-step diagnosis → ⑤JavaScript content may be crawled but indexing is delayed until after rendering.
SEO 3-Stage Framework
[DIAGRAM: SEO 3 stages — Crawling → Indexing → Ranking flow]
To appear in search results, a page must pass all three stages.
Stage 1: Crawling
- Bot discovers links and visits pages
- Downloads HTML, CSS, and JavaScript files
- Checks robots.txt, noindex, and server response codes
- Crawl failure causes: robots.txt block, server errors (5xx), access restrictions, crawl budget exhaustion
Stage 2: Indexing
- Analyzes collected content after JavaScript rendering
- Evaluates keywords, structure, links, and E-E-A-T quality
- Stores in search database
- Indexing rejection causes: low quality, noindex, duplicate content, rendering failure
Stage 3: Ranking
- Determines order of indexed pages matching a query using 200+ signals
- E-E-A-T, backlinks, user signals, technical quality, etc.
If any stage is blocked, later stages do not proceed. See Crawlability for details.
5 Reasons a Page Is Crawled but Not Indexed
1. Insufficient Content Quality
If content is too thin, heavily duplicated with other pages, or judged to provide no user value, crawling may succeed but indexing is rejected. Pages must meet Helpful Content System criteria.
2. noindex Directive
With <meta name="robots" content="noindex"/> or HTTP header X-Robots-Tag: noindex, crawling is allowed but indexing is excluded. See noindex for details.
3. Duplicate Handling via Canonical
When identical content exists at multiple URLs, Google indexes one canonical URL and excludes the rest. See Canonical Tag for details.
4. JavaScript Rendering Failure
SPA or client-side rendered pages separate crawling (HTML collection) from rendering (JavaScript execution). Rendering failure can make content appear as a blank page and indexing may be rejected. See JavaScript SEO for details.
5. Repeated Server Errors
Pages with repeated 5xx errors may cause crawl bots to abandon collection, or collected pages may not be indexed because normal content is unavailable.
Diagnosing Crawling and Indexing with GSC
Google Search Console’s "Page Indexing Report" shows crawling and indexing status step by step.
Path: GSC → Indexing → Pages
Main status codes
| GSC Status | Meaning |
|---|---|
| Indexed | Crawling + indexing complete |
| Crawled - currently not indexed | Crawling complete, indexing rejected (quality issue) |
| Discovered - currently not indexed | Discovered but crawling not complete |
| Blocked by robots.txt | Crawling blocked |
| Excluded by noindex tag | noindex applied |
| Page with redirect | 301/302 redirect handled |
| Not found (404) | URL does not exist |
See Indexing Coverage for details.
AI Bot Crawling and Indexing
AI bots for ChatGPT, Perplexity, and Google AI Overviews (GPTBot, PerplexityBot, etc.) also perform crawling and learning. However, AI bot "indexing" is LLM training data collection, not search database storage.
Allowing AI bots in robots.txt is required for content to be cited in AI search answers. In AEO (Answer Engine Optimization) strategy, allowing AI bot crawling is a prerequisite. See robots.txt and AI Bots for details.
Korea Market Application
Naverbot Crawling and Indexing
Naver search bot (Yeti) operates separately from Googlebot. Naver Search Advisor shows Naverbot crawling status and indexing errors. Registering a Naver sitemap in Search Advisor improves crawling efficiency.
Naverbot characteristics:
- Processes Naver Blog and Cafe content separately from its own crawling
- JavaScript rendering support is more limited than Googlebot
- Bot visit records can be checked in Naver Search Advisor logs
Common Indexing Issues on Korean Sites
- JavaScript rendering: Indexing gaps are frequent in React apps without Next.js SSR
- Login-required content: Authenticated content is inaccessible to Googlebot → not indexed
- IP-based blocking: Some security solutions block Googlebot IPs and cause crawl failure
Frequently Asked Questions
Q. If a page blocked by robots.txt has noindex added, does it matter?
A. No. If crawling is blocked by robots.txt, the bot cannot access the page and cannot read noindex directives. For noindex to work, crawling must be allowed. To block indexing only, remove Disallow from robots.txt and use only the noindex meta tag.
Q. Does registering a URL in a sitemap make crawling faster?
A. A sitemap is a hint that tells Google the URL list; it does not guarantee crawl speed. However, orphan pages without internal links are hard to discover without a sitemap. Large sites or new URLs can raise crawl priority through sitemap submission. See Sitemap for details.
Q. If crawling succeeded but indexing did not, what should I check first?
A. Check the "indexing error reason" for that URL in GSC. "Crawled - currently not indexed" is usually one of: low quality, duplicate content, or canonical issue. Review content length and uniqueness, and check for duplication with other pages.
Q. When will Google recrawl a page after I edit it?
A. Google decides on its own schedule, so timing cannot be guaranteed. For faster recrawling, use GSC "URL Inspection → Request indexing." Important pages (home, categories, etc.) are crawled more often; reindexing typically occurs within days to weeks after edits.
Q. Is it a problem if indexed page count is much lower than actual page count?
A. Not necessarily. Many pages may be intentionally excluded: parameter pages, noindex pages, internal-only pages, etc. However, if core content pages are not indexed, that is a problem. Analyze GSC "Indexed" count and "Not indexed" reasons by category to check for missing core pages.
Related Sources
- Google Search Central (2024). How Google Search Works: Crawling, Indexing, and Ranking. Google Developers.
- Google Search Central (2024). Page indexing report. Google Search Console Help.
- John Mueller, Google (2023). Crawling vs. Indexing: What you need to know. Google Search Central Blog.