How to Allow AI Bots in robots.txt
Why Allowing AI Bots Is Needed
One cause of "my content is never cited in ChatGPT/Claude/Perplexity answers" is blocking AI crawlers.
Generative AI engines collect content in two ways.
- Training data collection: Crawl the web during LLM pre-training to collect content as training data.
- Real-time search augmentation: Crawl the web in real time during user questions to augment answers with latest information (RAG).
In both cases, robots.txt blocking prevents those bots from accessing your site. To get your content cited in AI answers, crawler access must be allowed.
8 AI Bots to Allow
OpenAI, Anthropic, Perplexity, and others operate separate User-agents. Roles differ by service, so individual allowance is required.
| Service | User-agent | Role |
|---|---|---|
| OpenAI | GPTBot | ChatGPT training data collection |
| OpenAI | OAI-SearchBot | ChatGPT Search real-time citation (separate from training) |
| OpenAI | ChatGPT-User | When users visit URLs directly from ChatGPT |
| Anthropic | ClaudeBot | Claude training and answer augmentation |
| Perplexity | PerplexityBot | Real-time answer citation |
| Google-Extended | Gemini AI training (separate from Googlebot) | |
| Common Crawl | CCBot | Major source of open-source LLM training data |
| Meta | Meta-ExternalAgent | Meta AI training |
Important: OpenAI operates GPTBot (training), OAI-SearchBot (search indexing), and ChatGPT-User (direct user requests) as three independent bots. Allowing only GPTBot does not help ChatGPT Search citations.
Google-Extended note: Google AI Overviews uses standard Googlebot, so blocking Google-Extended does not block AI Overviews exposure. Google-Extended controls Gemini model training data collection.
robots.txt Example
# Default allow
User-agent: *
Allow: /
# Explicitly allow OpenAI crawlers
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Anthropic Claude
User-agent: ClaudeBot
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
# Google Gemini training
User-agent: Google-Extended
Allow: /
# Common Crawl (LLM training data source)
User-agent: CCBot
Allow: /
# Meta AI
User-agent: Meta-ExternalAgent
Allow: /
# Declare sitemap
Sitemap: https://example.com/sitemap.xml
To allow only specific directories:
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /private/
Disallow: /members/
4 Implementation Steps
Step 1: Diagnose Current Status
Access https://yourdomain.com/robots.txt directly and review current settings. Modification is needed if AI bot User-agent blocks are missing or the entire site is blocked with Disallow: /.
Step 2: Update robots.txt
Merge the example above with your existing file. Always preserve existing Disallow rules for /admin, /private, etc.
Step 3: Check Firewall/CDN Settings (Important)
Updating robots.txt alone has no effect if CDN or firewall still blocks AI bots.
- Cloudflare: Check Security → Bots → Bot Fight Mode or AI Scrapers and Crawlers settings. Some plans default to blocking — disable as needed.
- AWS WAF: Check AI crawler classification in Bot Control rulesets.
- NGINX/Apache: Review User-Agent-based block rules and add AI bot exceptions.
- Hardware firewall: IP-based blocking may affect AI bots.
Step 4: Verify
- Confirm changes by accessing robots.txt directly
- Use Google Search Console → robots.txt tester
- Check server access logs for AI bot visits (bot visits expected within 2–7 days after change)
Korea Market Application
Setup methods differ by platforms commonly used in Korea.
- Cafe24: Edit robots.txt directly via FTP or admin panel. Also verify Cafe24 firewall settings separately.
- Imweb: robots.txt editing is limited. Request configuration through Imweb customer support.
- Gabia general hosting: Edit directly in root directory via FTP.
- Vercel/Netlify: Manage via public/robots.txt file or next.config.js settings.
Also specify Korean search engines:
# Naver search bot
User-agent: Yeti
Allow: /
User-agent: NaverBot
Allow: /
Korea IP-based blocking policies may also affect global AI bots — consider separately allowing AI bot IP ranges in WAF or firewall.
When Blocking AI Bots Is Needed
When blocking is needed for copyright protection or paid content protection:
# Block all training data collection bots
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
However, a tradeoff occurs from an AEO perspective. Blocking forfeits the opportunity for your content to be cited in AI answers. Allow general marketing and blog content that is not paid subscription or competitively sensitive information.
Frequently Asked Questions
How long after changing robots.txt until effects appear?
Timing varies by crawler. OpenAI documentation states robots.txt changes take about 24 hours to reflect in their system. New content appearing in actual AI answers is fast in real-time search (RAG) mode but reflects in training-based answers at the next model retraining.
ChatGPT already knows our site — do we still need to allow access?
Content may already be collected in training data, but features like ChatGPT Search that use real-time indexing depend on recent OAI-SearchBot access allowance. Without explicit allowance, updated content may not be reflected.
How are robots.txt and llms.txt different?
robots.txt controls "whether bots can access my site." llms.txt guides "how bots should understand my site." The two files are complementary; operating both together is ideal.
We use Cloudflare — is changing robots.txt enough?
If Cloudflare Bot Fight Mode or Super Bot Fight Mode is active, AI bots may be blocked regardless of robots.txt. Check and disable AI bot-related settings separately in the Cloudflare dashboard.
Should Meta-ExternalAgent also be allowed?
Meta AI still has lower domestic usage than ChatGPT, Claude, and Perplexity. Allowing it adds little traffic burden; recommended to allow for future readiness.
Related Sources
- OpenAI Crawlers official documentation: https://platform.openai.com/docs/bots
- Anthropic ClaudeBot official documentation: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-the-web-and-how-can-site-owners-block-the-anthropic-crawler
- Perplexity crawler official documentation: https://docs.perplexity.ai/docs/resources/perplexity-crawlers
- Google-Extended explanation: https://developers.google.com/search/docs/crawling-indexing/google-extended
이 페이지를 참조하는 항목
- 📕ChecklistAI Bot robots.txt Matrix — Comprehensive Comparison and Setup Guide
- 📘ConceptComplete Guide to Anthropic Bots (ClaudeBot · Claude-User · Claude-SearchBot)
- 📘ConceptComplete Guide to Applebot-Extended — Apple Intelligence Training Control Token
- 📘ConceptCCBot (Common Crawl) Complete Guide
- 📘ConceptGoogle-Extended Complete Guide — A Policy Token, Not a Bot
- 📙How-tollms.txt Writing Guide
- 📘ConceptComplete Guide to OpenAI Bots (GPTBot · ChatGPT-User · OAI-SearchBot · OAI-AdsBot)
- 📘ConceptComplete Guide to Perplexity Bots (PerplexityBot · Perplexity-User)
- 📙How-toAI Citation Tracking Methodology
- 📘ConceptCrawl Budget
- 📘ConceptGoogle Search Console
- 📙How-toIndexing Coverage Diagnosis
- 📘ConceptGEO Master Guide: 5-Area Checklist
- 📘ConceptWhat Is AEO?
- 📘ConceptWhat Is GEO?
- 📘ConceptWhat Is SEO?
- 📙How-toNaver Search Advisor Registration Guide
- 📘ConceptCanonical Tag
- 📙How-toH Tag Hierarchy Design
- 📙How-toHow to Write Image Alt Text
- 📘ConceptInternal Linking Strategy
- 📘ConceptMeta Description
- 📘ConceptNoindex
- 📘ConceptTitle Tag
- 📙How-toChatGPT Citation Optimization
- 📙How-toClaude Citation Optimization
- 📙How-toCopilot Citation Optimization
- 📙How-toGemini Citation Optimization
- 📘ConceptGoogle AI Overviews
- 📙How-toGrok Citation Optimization
- 📙How-toPerplexity Citation Optimization
- 📘ConceptCore Web Vitals
- 📘ConceptCrawl Depth
- 📘ConceptCrawlability
- 📘ConceptCrawling vs Indexing
- 📘ConceptJavaScript SEO
- 📘ConceptRendering
- 📘ConceptSite Architecture
- 📙How-toSitemap (XML Sitemap)
- 📕ChecklistTechnical SEO Checklist 2026
- 📘ConceptTTFB (Time to First Byte)
- 📘ConceptURL Parameters