/Crawl Budget
📘Concept

Crawl Budget

최종 업데이트:

Definition

Crawl budget is the number of pages Googlebot can and wants to crawl on your site within a given time frame. It combines crawl capacity (Google's limit) and crawl demand (Google's interest in your site).


Summary

Crawl budget matters mainly for large sites. Reduce low-value URLs, fix errors, strengthen internal links to important pages. GSC crawl stats show crawl patterns. Most sites under 100K URLs should prioritize content quality and indexing over crawl budget optimization.


When crawl budget matters

Google guidance

Google states crawl budget is rarely a concern for sites with fewer than ~100,000 URLs that update at most once per day. Most sites should focus on:

  • Content quality
  • Indexing coverage
  • Technical health

When to optimize crawl budget

  • Large sites: 100,000+ URLs, e-commerce, news, UGC
  • Frequent updates: Daily or hourly new content
  • Indexing delays: Important pages stuck in "Discovered — currently not indexed"
  • Crawl waste: Many duplicate, thin, or low-value URLs consuming crawl

Crawl budget components

Crawl capacity

Google's limit on how many URLs it can crawl without overloading your server. Affected by:

  • Server response time and stability
  • 5xx errors (reduce crawl)
  • robots.txt and crawl rate settings in GSC

Crawl demand

Google's interest in crawling your site. Affected by:

  • Site popularity and authority
  • Content freshness and update frequency
  • Indexing value of URLs

Common crawl budget waste

Waste typeExampleFix
Duplicate URLsParameter variants, www/non-wwwCanonical, parameter handling
Thin/low-value pagesTag pages, search results, filtersnoindex, consolidate, reduce
Redirect chainsMultiple hops before final URLDirect 301 redirects
Soft 404sEmpty pages returning 200Fix content or 404/301
Blocked resourcesCSS/JS blocked unnecessarilyAllow critical resources in robots.txt
Infinite spacesCalendar pagination, session IDsnoindex, limit pagination

See Indexing Coverage Diagnosis for unindexed page causes.


How to monitor crawl budget

GSC Crawl stats report

GSC → Settings → Crawl stats (if available)

Metrics: Total crawl requests, average response time, response code distribution

Use: Identify crawl spikes, error patterns, and response time issues.

Server log analysis

Analyze server logs for Googlebot requests. Tools: Screaming Frog Log Analyzer, custom scripts. Shows which URLs Googlebot crawls most.

URL Inspection sampling

GSC URL Inspection → check "Last crawl" dates for important pages. Old crawl dates may indicate low crawl priority.


Five crawl budget optimization strategies

1. Reduce low-value URL count

noindex or remove tag pages, internal search results, faceted navigation duplicates. Consolidate similar content.

2. Fix crawl errors

Resolve 5xx, redirect loops, and soft 404s. Errors waste crawl and reduce capacity.

3. Optimize sitemap

Submit sitemap with important URLs only. Remove deleted or noindex URLs from sitemap. See Sitemap for details.

4. Strengthen internal linking

Important pages need strong internal links to signal crawl priority. Orphan pages get less crawl.

5. Improve server performance

Faster response times allow more crawl within capacity. Core Web Vitals and server stability matter.


robots.txt and crawl budget

Blocking URLs in robots.txt prevents crawl but does not remove from index if already indexed. Use noindex for pages you don't want indexed. See How to Allow AI Bots in robots.txt for AI bot considerations.


Local market application

Crawl budget for local sites

Most local business sites have few URLs — crawl budget rarely an issue. Focus on indexing and content quality. Multi-location sites with many location pages may need crawl optimization.

CMS and hosting

Some CMS generate many low-value URLs (tags, archives). Audit and noindex or consolidate to preserve crawl for important pages.


Frequently asked questions

Q. Should I use "Crawl rate" in GSC?
A. Google recommends leaving default unless server is overloaded. Reducing crawl rate can delay indexing; increasing rarely helps.

Q. Does blocking CSS/JS hurt crawl budget?
A. Blocking render-critical resources can prevent proper rendering and indexing. Allow Googlebot to access CSS/JS needed to render content.

Q. How do I know if crawl budget is my problem?
A. Important pages stuck "Discovered — currently not indexed" for months, GSC crawl stats show high crawl on low-value URLs, or server logs show Googlebot hitting duplicates/thin pages heavily.

Q. Does crawl budget affect AI bots?
A. AI bots (GPTBot, etc.) have separate crawl behavior. robots.txt controls them independently. See AI Bots robots.txt Matrix for details.

Q. Can I request more crawl budget?
A. No direct request. Improve site quality, fix errors, reduce waste — Google allocates more crawl to valuable, healthy sites.


Related sources

이 페이지를 참조하는 항목

관련 항목

📘Concept
Google Search Console
Google Search Console (GSC) is a free tool from Google for monitoring site search performance, diagnosing indexing issues, and submitting sitemaps — the essential foundation for SEO measurement.
📙How-to
Indexing Coverage Diagnosis
Indexing coverage diagnosis uses the GSC indexing report to check overall site indexing status, identify causes of unindexed pages, and fix them — a core SEO task.
📘ConceptPillar
Duplicate Content
Duplicate content is a state where identical or very similar content exists on multiple URLs, causing authority dilution and indexing confusion—a common technical SEO problem.
📘ConceptPillar
Thin Content
Thin content refers to shallow pages that fail to provide sufficient value to users. The Helpful Content system detects it and lowers overall site quality—a common SEO penalty trigger.
📘ConceptPillar
Canonical Tag
A canonical tag is an HTML meta tag that tells search engines 'this URL is the representative version' when duplicate or similar content exists across multiple URLs. It resolves duplicate content problems and concentrates PageRank on the canonical URL—a core on-page SEO tool.
📘Concept
Noindex
noindex is an on-page crawl control directive that tells search engine bots not to include a page in search results via robots meta tags or HTTP headers. It excludes pages that do not need or should not appear in search from the index, saving crawl budget and improving site quality signals.
📘ConceptPillar
Crawlability
Crawlability is the ability of search engine and AI bots to access website pages and read content. It is the most basic condition for SEO and AEO, a required step that precedes indexing and ranking.
📘Concept
HTTP Status Codes
HTTP status codes are three-digit codes returned when a server responds to client requests. In SEO, codes such as 200 (OK), 301 (permanent redirect), 302 (temporary redirect), 404 (not found), 410 (gone), and 500 (server error) directly affect crawling, indexing, and PageRank transfer.
📘ConceptPillar
JavaScript SEO
JavaScript SEO is the technical SEO area of optimizing JavaScript-rendered web pages so search engines and AI bots recognize them correctly. The choice between SSR/SSG and CSR determines indexing feasibility.
📙How-to
How to Allow AI Bots in robots.txt
Allowing AI bots means explicitly permitting major AI crawlers such as GPTBot, ClaudeBot, and PerplexityBot to access your site in robots.txt, exposing your content for citation in generative AI answers.
📘ConceptPillar
Site Architecture
Site architecture is the overall design of page hierarchy, URL structure, and internal linking on a website. It simultaneously determines crawl efficiency, indexing quality, and user navigation experience — a foundational SEO element.
📙How-to
Sitemap (XML Sitemap)
An XML sitemap is an XML file listing a website’s URLs along with last-modified dates, update frequency, and priority information. It helps search engine bots understand site structure and improves crawling efficiency and indexing speed as a technical SEO foundation tool.

이런 항목도 있어요

이 페이지가 도움이 됐나요?