/Crawling vs Indexing
📘Concept

Crawling vs Indexing

최종 업데이트:

Definition

Crawling is the process where search engine bots such as Googlebot and Naverbot (Yeti) follow links across the web and collect page HTML, CSS, and JavaScript.

Indexing is the process of analyzing pages collected through crawling, evaluating relevant keywords, structure, and quality, and storing them in a search database. Only indexed pages can appear in search results.

The two processes are separate. Being crawled does not necessarily mean being indexed, and being indexed does not guarantee search rankings.


Summary

Crawling vs indexing essentials: ①Crawling = bot visits and collects, indexing = stores in DB → ②Reasons crawled but not indexed: low quality, noindex, duplicate content, rendering failure → ③If crawling is blocked by robots.txt, noindex directives are not read → ④GSC 'Page indexing report' enables step-by-step diagnosis → ⑤JavaScript content may be crawled but indexing is delayed until after rendering.


SEO 3-Stage Framework

[DIAGRAM: SEO 3 stages — Crawling → Indexing → Ranking flow]

To appear in search results, a page must pass all three stages.

Stage 1: Crawling

  • Bot discovers links and visits pages
  • Downloads HTML, CSS, and JavaScript files
  • Checks robots.txt, noindex, and server response codes
  • Crawl failure causes: robots.txt block, server errors (5xx), access restrictions, crawl budget exhaustion

Stage 2: Indexing

  • Analyzes collected content after JavaScript rendering
  • Evaluates keywords, structure, links, and E-E-A-T quality
  • Stores in search database
  • Indexing rejection causes: low quality, noindex, duplicate content, rendering failure

Stage 3: Ranking

  • Determines order of indexed pages matching a query using 200+ signals
  • E-E-A-T, backlinks, user signals, technical quality, etc.

If any stage is blocked, later stages do not proceed. See Crawlability for details.


5 Reasons a Page Is Crawled but Not Indexed

1. Insufficient Content Quality

If content is too thin, heavily duplicated with other pages, or judged to provide no user value, crawling may succeed but indexing is rejected. Pages must meet Helpful Content System criteria.

2. noindex Directive

With <meta name="robots" content="noindex"/> or HTTP header X-Robots-Tag: noindex, crawling is allowed but indexing is excluded. See noindex for details.

3. Duplicate Handling via Canonical

When identical content exists at multiple URLs, Google indexes one canonical URL and excludes the rest. See Canonical Tag for details.

4. JavaScript Rendering Failure

SPA or client-side rendered pages separate crawling (HTML collection) from rendering (JavaScript execution). Rendering failure can make content appear as a blank page and indexing may be rejected. See JavaScript SEO for details.

5. Repeated Server Errors

Pages with repeated 5xx errors may cause crawl bots to abandon collection, or collected pages may not be indexed because normal content is unavailable.


Diagnosing Crawling and Indexing with GSC

Google Search Console’s "Page Indexing Report" shows crawling and indexing status step by step.

Path: GSC → Indexing → Pages

Main status codes

GSC StatusMeaning
IndexedCrawling + indexing complete
Crawled - currently not indexedCrawling complete, indexing rejected (quality issue)
Discovered - currently not indexedDiscovered but crawling not complete
Blocked by robots.txtCrawling blocked
Excluded by noindex tagnoindex applied
Page with redirect301/302 redirect handled
Not found (404)URL does not exist

See Indexing Coverage for details.


AI Bot Crawling and Indexing

AI bots for ChatGPT, Perplexity, and Google AI Overviews (GPTBot, PerplexityBot, etc.) also perform crawling and learning. However, AI bot "indexing" is LLM training data collection, not search database storage.

Allowing AI bots in robots.txt is required for content to be cited in AI search answers. In AEO (Answer Engine Optimization) strategy, allowing AI bot crawling is a prerequisite. See robots.txt and AI Bots for details.


Korea Market Application

Naverbot Crawling and Indexing

Naver search bot (Yeti) operates separately from Googlebot. Naver Search Advisor shows Naverbot crawling status and indexing errors. Registering a Naver sitemap in Search Advisor improves crawling efficiency.

Naverbot characteristics:

  • Processes Naver Blog and Cafe content separately from its own crawling
  • JavaScript rendering support is more limited than Googlebot
  • Bot visit records can be checked in Naver Search Advisor logs

Common Indexing Issues on Korean Sites

  • JavaScript rendering: Indexing gaps are frequent in React apps without Next.js SSR
  • Login-required content: Authenticated content is inaccessible to Googlebot → not indexed
  • IP-based blocking: Some security solutions block Googlebot IPs and cause crawl failure

Frequently Asked Questions

Q. If a page blocked by robots.txt has noindex added, does it matter?
A. No. If crawling is blocked by robots.txt, the bot cannot access the page and cannot read noindex directives. For noindex to work, crawling must be allowed. To block indexing only, remove Disallow from robots.txt and use only the noindex meta tag.

Q. Does registering a URL in a sitemap make crawling faster?
A. A sitemap is a hint that tells Google the URL list; it does not guarantee crawl speed. However, orphan pages without internal links are hard to discover without a sitemap. Large sites or new URLs can raise crawl priority through sitemap submission. See Sitemap for details.

Q. If crawling succeeded but indexing did not, what should I check first?
A. Check the "indexing error reason" for that URL in GSC. "Crawled - currently not indexed" is usually one of: low quality, duplicate content, or canonical issue. Review content length and uniqueness, and check for duplication with other pages.

Q. When will Google recrawl a page after I edit it?
A. Google decides on its own schedule, so timing cannot be guaranteed. For faster recrawling, use GSC "URL Inspection → Request indexing." Important pages (home, categories, etc.) are crawled more often; reindexing typically occurs within days to weeks after edits.

Q. Is it a problem if indexed page count is much lower than actual page count?
A. Not necessarily. Many pages may be intentionally excluded: parameter pages, noindex pages, internal-only pages, etc. However, if core content pages are not indexed, that is a problem. Analyze GSC "Indexed" count and "Not indexed" reasons by category to check for missing core pages.


Related Sources

  • Google Search Central (2024). How Google Search Works: Crawling, Indexing, and Ranking. Google Developers.
  • Google Search Central (2024). Page indexing report. Google Search Console Help.
  • John Mueller, Google (2023). Crawling vs. Indexing: What you need to know. Google Search Central Blog.

이 페이지를 참조하는 항목

관련 항목

📘Concept
Crawl Budget
Crawl budget is the number of pages Googlebot can and wants to crawl on your site within a given period — relevant for large sites where crawl allocation affects indexing speed and coverage.
📘Concept
Google Search Console
Google Search Console (GSC) is a free tool from Google for monitoring site search performance, diagnosing indexing issues, and submitting sitemaps — the essential foundation for SEO measurement.
📙How-to
Indexing Coverage Diagnosis
Indexing coverage diagnosis uses the GSC indexing report to check overall site indexing status, identify causes of unindexed pages, and fix them — a core SEO task.
📘ConceptPillar
What Is AEO?
AEO is the practice of optimizing content so AI answer engines cite it.
📘ConceptPillar
Canonical Tag
A canonical tag is an HTML meta tag that tells search engines 'this URL is the representative version' when duplicate or similar content exists across multiple URLs. It resolves duplicate content problems and concentrates PageRank on the canonical URL—a core on-page SEO tool.
📘Concept
Noindex
noindex is an on-page crawl control directive that tells search engine bots not to include a page in search results via robots meta tags or HTTP headers. It excludes pages that do not need or should not appear in search from the index, saving crawl budget and improving site quality signals.
📘Concept
Crawl Depth
Crawl depth (click depth) is the number of clicks required to reach a page from the homepage. It is a core site structure metric that determines page discovery priority for search engine and AI bots and PageRank transfer efficiency.
📘ConceptPillar
Crawlability
Crawlability is the ability of search engine and AI bots to access website pages and read content. It is the most basic condition for SEO and AEO, a required step that precedes indexing and ranking.
📘ConceptPillar
JavaScript SEO
JavaScript SEO is the technical SEO area of optimizing JavaScript-rendered web pages so search engines and AI bots recognize them correctly. The choice between SSR/SSG and CSR determines indexing feasibility.
📘ConceptPillar
Rendering
Rendering is the process of processing HTML, CSS, and JavaScript to produce the final screen seen by users and bots. The choice among CSR, SSR, SSG, and ISR determines SEO and AEO feasibility.
📙How-to
How to Allow AI Bots in robots.txt
Allowing AI bots means explicitly permitting major AI crawlers such as GPTBot, ClaudeBot, and PerplexityBot to access your site in robots.txt, exposing your content for citation in generative AI answers.
📙How-to
Sitemap (XML Sitemap)
An XML sitemap is an XML file listing a website’s URLs along with last-modified dates, update frequency, and priority information. It helps search engine bots understand site structure and improves crawling efficiency and indexing speed as a technical SEO foundation tool.

이런 항목도 있어요

이 페이지가 도움이 됐나요?