/How to Allow AI Bots in robots.txt
📙How-to

How to Allow AI Bots in robots.txt

최종 업데이트:

Why Allowing AI Bots Is Needed

One cause of "my content is never cited in ChatGPT/Claude/Perplexity answers" is blocking AI crawlers.

Generative AI engines collect content in two ways.

  • Training data collection: Crawl the web during LLM pre-training to collect content as training data.
  • Real-time search augmentation: Crawl the web in real time during user questions to augment answers with latest information (RAG).

In both cases, robots.txt blocking prevents those bots from accessing your site. To get your content cited in AI answers, crawler access must be allowed.

8 AI Bots to Allow

OpenAI, Anthropic, Perplexity, and others operate separate User-agents. Roles differ by service, so individual allowance is required.

ServiceUser-agentRole
OpenAIGPTBotChatGPT training data collection
OpenAIOAI-SearchBotChatGPT Search real-time citation (separate from training)
OpenAIChatGPT-UserWhen users visit URLs directly from ChatGPT
AnthropicClaudeBotClaude training and answer augmentation
PerplexityPerplexityBotReal-time answer citation
GoogleGoogle-ExtendedGemini AI training (separate from Googlebot)
Common CrawlCCBotMajor source of open-source LLM training data
MetaMeta-ExternalAgentMeta AI training

Important: OpenAI operates GPTBot (training), OAI-SearchBot (search indexing), and ChatGPT-User (direct user requests) as three independent bots. Allowing only GPTBot does not help ChatGPT Search citations.

Google-Extended note: Google AI Overviews uses standard Googlebot, so blocking Google-Extended does not block AI Overviews exposure. Google-Extended controls Gemini model training data collection.

robots.txt Example

# Default allow
User-agent: *
Allow: /

# Explicitly allow OpenAI crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic Claude
User-agent: ClaudeBot
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Google Gemini training
User-agent: Google-Extended
Allow: /

# Common Crawl (LLM training data source)
User-agent: CCBot
Allow: /

# Meta AI
User-agent: Meta-ExternalAgent
Allow: /

# Declare sitemap
Sitemap: https://example.com/sitemap.xml

To allow only specific directories:

User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /private/
Disallow: /members/

4 Implementation Steps

Step 1: Diagnose Current Status

Access https://yourdomain.com/robots.txt directly and review current settings. Modification is needed if AI bot User-agent blocks are missing or the entire site is blocked with Disallow: /.

Step 2: Update robots.txt

Merge the example above with your existing file. Always preserve existing Disallow rules for /admin, /private, etc.

Step 3: Check Firewall/CDN Settings (Important)

Updating robots.txt alone has no effect if CDN or firewall still blocks AI bots.

  • Cloudflare: Check Security → Bots → Bot Fight Mode or AI Scrapers and Crawlers settings. Some plans default to blocking — disable as needed.
  • AWS WAF: Check AI crawler classification in Bot Control rulesets.
  • NGINX/Apache: Review User-Agent-based block rules and add AI bot exceptions.
  • Hardware firewall: IP-based blocking may affect AI bots.

Step 4: Verify

  • Confirm changes by accessing robots.txt directly
  • Use Google Search Console → robots.txt tester
  • Check server access logs for AI bot visits (bot visits expected within 2–7 days after change)

Korea Market Application

Setup methods differ by platforms commonly used in Korea.

  • Cafe24: Edit robots.txt directly via FTP or admin panel. Also verify Cafe24 firewall settings separately.
  • Imweb: robots.txt editing is limited. Request configuration through Imweb customer support.
  • Gabia general hosting: Edit directly in root directory via FTP.
  • Vercel/Netlify: Manage via public/robots.txt file or next.config.js settings.

Also specify Korean search engines:

# Naver search bot
User-agent: Yeti
Allow: /

User-agent: NaverBot
Allow: /

Korea IP-based blocking policies may also affect global AI bots — consider separately allowing AI bot IP ranges in WAF or firewall.

When Blocking AI Bots Is Needed

When blocking is needed for copyright protection or paid content protection:

# Block all training data collection bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

However, a tradeoff occurs from an AEO perspective. Blocking forfeits the opportunity for your content to be cited in AI answers. Allow general marketing and blog content that is not paid subscription or competitively sensitive information.

Frequently Asked Questions

How long after changing robots.txt until effects appear?
Timing varies by crawler. OpenAI documentation states robots.txt changes take about 24 hours to reflect in their system. New content appearing in actual AI answers is fast in real-time search (RAG) mode but reflects in training-based answers at the next model retraining.

ChatGPT already knows our site — do we still need to allow access?
Content may already be collected in training data, but features like ChatGPT Search that use real-time indexing depend on recent OAI-SearchBot access allowance. Without explicit allowance, updated content may not be reflected.

How are robots.txt and llms.txt different?
robots.txt controls "whether bots can access my site." llms.txt guides "how bots should understand my site." The two files are complementary; operating both together is ideal.

We use Cloudflare — is changing robots.txt enough?
If Cloudflare Bot Fight Mode or Super Bot Fight Mode is active, AI bots may be blocked regardless of robots.txt. Check and disable AI bot-related settings separately in the Cloudflare dashboard.

Should Meta-ExternalAgent also be allowed?
Meta AI still has lower domestic usage than ChatGPT, Claude, and Perplexity. Allowing it adds little traffic burden; recommended to allow for future readiness.

Related Sources

이 페이지를 참조하는 항목

관련 항목

📙How-to
llms.txt Writing Guide
llms.txt is a markdown-format metadata file that helps LLMs efficiently understand site content efficiently, placed at the site root (/) as an AI-friendly site guide.
📘Concept
Crawl Budget
Crawl budget is the number of pages Googlebot can and wants to crawl on your site within a given period — relevant for large sites where crawl allocation affects indexing speed and coverage.
📘Concept
Google Search Console
Google Search Console (GSC) is a free tool from Google for monitoring site search performance, diagnosing indexing issues, and submitting sitemaps — the essential foundation for SEO measurement.
📙How-to
Indexing Coverage Diagnosis
Indexing coverage diagnosis uses the GSC indexing report to check overall site indexing status, identify causes of unindexed pages, and fix them — a core SEO task.
📘ConceptPillar
GEO Master Guide: 5-Area Checklist
An execution guide for Generative AI Optimization covering GEO's five areas: content, structure, technical, off-site, and measurement.
📘ConceptPillar
What Is AEO?
AEO is the practice of optimizing content so AI answer engines cite it.
📘ConceptPillar
What Is GEO?
GEO is the practice of optimizing content so generative AI cites it in answers.
📙How-to
Naver Search Advisor Registration Guide
Naver Search Advisor is Naver's official free webmaster tool and an essential setup for the Korean market, providing site indexing status, sitemap submission, and search visibility analysis.
📘ConceptPillar
Internal Linking Strategy
Internal linking strategy is the practice of semantically connecting pages within your own site to optimize topic authority and bot and user navigation.
📙How-to
ChatGPT Citation Optimization
ChatGPT citation optimization is the work of getting content cited in ChatGPT answers.
📘Concept
Google AI Overviews
Google AI Overviews is a feature that adds AI answer blocks to search SERPs.
📙How-to
Perplexity Citation Optimization
Perplexity citation optimization is the work of securing citations from a real-time web search-based AI.
📘ConceptPillar
Core Web Vitals
Core Web Vitals are the three core user experience metrics defined by Google.
📘Concept
Crawl Depth
Crawl depth (click depth) is the number of clicks required to reach a page from the homepage. It is a core site structure metric that determines page discovery priority for search engine and AI bots and PageRank transfer efficiency.
📘ConceptPillar
Crawlability
Crawlability is the ability of search engine and AI bots to access website pages and read content. It is the most basic condition for SEO and AEO, a required step that precedes indexing and ranking.
📘ConceptPillar
JavaScript SEO
JavaScript SEO is the technical SEO area of optimizing JavaScript-rendered web pages so search engines and AI bots recognize them correctly. The choice between SSR/SSG and CSR determines indexing feasibility.
📘ConceptPillar
Rendering
Rendering is the process of processing HTML, CSS, and JavaScript to produce the final screen seen by users and bots. The choice among CSR, SSR, SSG, and ISR determines SEO and AEO feasibility.
📘ConceptPillar
Site Architecture
Site architecture is the overall design of page hierarchy, URL structure, and internal linking on a website. It simultaneously determines crawl efficiency, indexing quality, and user navigation experience — a foundational SEO element.
📕ChecklistPillar
Technical SEO Checklist 2026
A technical SEO checklist covering crawling, indexing, CWV, and structured data

이런 항목도 있어요

이 페이지가 도움이 됐나요?