📙How-to

How to Allow AI Bots in robots.txt

최종 업데이트: May 6, 2026

Why Allowing AI Bots Is Needed

One cause of "my content is never cited in ChatGPT/Claude/Perplexity answers" is blocking AI crawlers.

Generative AI engines collect content in two ways.

Training data collection: Crawl the web during LLM pre-training to collect content as training data.
Real-time search augmentation: Crawl the web in real time during user questions to augment answers with latest information (RAG).

In both cases, robots.txt blocking prevents those bots from accessing your site. To get your content cited in AI answers, crawler access must be allowed.

8 AI Bots to Allow

OpenAI, Anthropic, Perplexity, and others operate separate User-agents. Roles differ by service, so individual allowance is required.

Service	User-agent	Role
OpenAI	GPTBot	ChatGPT training data collection
OpenAI	OAI-SearchBot	ChatGPT Search real-time citation (separate from training)
OpenAI	ChatGPT-User	When users visit URLs directly from ChatGPT
Anthropic	ClaudeBot	Claude training and answer augmentation
Perplexity	PerplexityBot	Real-time answer citation
Google	Google-Extended	Gemini AI training (separate from Googlebot)
Common Crawl	CCBot	Major source of open-source LLM training data
Meta	Meta-ExternalAgent	Meta AI training

Important: OpenAI operates GPTBot (training), OAI-SearchBot (search indexing), and ChatGPT-User (direct user requests) as three independent bots. Allowing only GPTBot does not help ChatGPT Search citations.

Google-Extended note: Google AI Overviews uses standard Googlebot, so blocking Google-Extended does not block AI Overviews exposure. Google-Extended controls Gemini model training data collection.

robots.txt Example

# Default allow
User-agent: *
Allow: /

# Explicitly allow OpenAI crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic Claude
User-agent: ClaudeBot
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Google Gemini training
User-agent: Google-Extended
Allow: /

# Common Crawl (LLM training data source)
User-agent: CCBot
Allow: /

# Meta AI
User-agent: Meta-ExternalAgent
Allow: /

# Declare sitemap
Sitemap: https://example.com/sitemap.xml

To allow only specific directories:

User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /private/
Disallow: /members/

4 Implementation Steps

Step 1: Diagnose Current Status

Access https://yourdomain.com/robots.txt directly and review current settings. Modification is needed if AI bot User-agent blocks are missing or the entire site is blocked with Disallow: /.

Step 2: Update robots.txt

Merge the example above with your existing file. Always preserve existing Disallow rules for /admin, /private, etc.

Step 3: Check Firewall/CDN Settings (Important)

Updating robots.txt alone has no effect if CDN or firewall still blocks AI bots.

Cloudflare: Check Security → Bots → Bot Fight Mode or AI Scrapers and Crawlers settings. Some plans default to blocking — disable as needed.
AWS WAF: Check AI crawler classification in Bot Control rulesets.
NGINX/Apache: Review User-Agent-based block rules and add AI bot exceptions.
Hardware firewall: IP-based blocking may affect AI bots.

Step 4: Verify

Confirm changes by accessing robots.txt directly
Use Google Search Console → robots.txt tester
Check server access logs for AI bot visits (bot visits expected within 2–7 days after change)

Korea Market Application

Setup methods differ by platforms commonly used in Korea.

Cafe24: Edit robots.txt directly via FTP or admin panel. Also verify Cafe24 firewall settings separately.
Imweb: robots.txt editing is limited. Request configuration through Imweb customer support.
Gabia general hosting: Edit directly in root directory via FTP.
Vercel/Netlify: Manage via public/robots.txt file or next.config.js settings.

Also specify Korean search engines:

# Naver search bot
User-agent: Yeti
Allow: /

User-agent: NaverBot
Allow: /

Korea IP-based blocking policies may also affect global AI bots — consider separately allowing AI bot IP ranges in WAF or firewall.

When Blocking AI Bots Is Needed

When blocking is needed for copyright protection or paid content protection:

# Block all training data collection bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

However, a tradeoff occurs from an AEO perspective. Blocking forfeits the opportunity for your content to be cited in AI answers. Allow general marketing and blog content that is not paid subscription or competitively sensitive information.

Frequently Asked Questions

How long after changing robots.txt until effects appear?
Timing varies by crawler. OpenAI documentation states robots.txt changes take about 24 hours to reflect in their system. New content appearing in actual AI answers is fast in real-time search (RAG) mode but reflects in training-based answers at the next model retraining.

ChatGPT already knows our site — do we still need to allow access?
Content may already be collected in training data, but features like ChatGPT Search that use real-time indexing depend on recent OAI-SearchBot access allowance. Without explicit allowance, updated content may not be reflected.

How are robots.txt and llms.txt different?
robots.txt controls "whether bots can access my site." llms.txt guides "how bots should understand my site." The two files are complementary; operating both together is ideal.

We use Cloudflare — is changing robots.txt enough?
If Cloudflare Bot Fight Mode or Super Bot Fight Mode is active, AI bots may be blocked regardless of robots.txt. Check and disable AI bot-related settings separately in the Cloudflare dashboard.

Should Meta-ExternalAgent also be allowed?
Meta AI still has lower domestic usage than ChatGPT, Claude, and Perplexity. Allowing it adds little traffic burden; recommended to allow for future readiness.