Google enforces a strict 500 KiB (512,000 bytes) limit on robots.txt files, ignoring any content beyond this threshold. This file size restriction, codified in RFC 9309 in September 2022, is just one of many technical requirements that can make or break your site’s crawlability. Understanding robots.txt best practices ensures search engines can efficiently discover your important pages while respecting your crawl budget.

This guide covers official file size limits, the critical difference between robots.txt and noindex, supported directives, proper syntax with examples, and common mistakes that hurt SEO performance.

Quick Reference: robots.txt Specifications

SpecificationRequirement
Maximum file size500 KiB (512,000 bytes)
File locationRoot directory only (example.com/robots.txt)
Character encodingUTF-8
StandardRFC 9309 (September 2022)
Case sensitivityDirectives are case-insensitive; paths are case-sensitive
Supported wildcards* (any sequence) and $ (end of URL)

Official File Size Limits

Google’s 500 KiB limit applies to the raw robots.txt file before any processing. This translates to approximately 512,000 bytes or roughly 500,000 characters of text. While this sounds generous, large enterprise sites with thousands of disallow rules can approach this limit.

What happens when you exceed 500 KiB:

  • Google stops processing at the 500 KiB mark
  • Any directives after this point are completely ignored
  • No warning appears in Google Search Console
  • Crawlers may access pages you intended to block

The RFC 9309 specification, published in September 2022, formally standardized this limit after years of informal enforcement by major search engines. Before this standard, different crawlers handled oversized files inconsistently.

Checking your file size:

To verify your robots.txt size, check the file properties on your server or use command-line tools. The file should remain well under the limit to account for future additions. As a rule of thumb, keep your robots.txt under 400 KiB to maintain a safety buffer.

If your robots.txt approaches the limit, consider consolidating rules using wildcards or restructuring your URL patterns. Multiple specific disallow rules can often be replaced with a single wildcard pattern.

robots.txt vs noindex: Understanding the Critical Difference

One of the most misunderstood concepts in technical SEO is the difference between blocking crawling with robots.txt and blocking indexing with noindex. These serve different purposes and using them incorrectly can cause serious SEO problems.

robots.txt blocks crawling:

  • Prevents search engine bots from accessing a page
  • The URL can still appear in search results (without content)
  • Google may index the URL based on external signals (links, anchor text)
  • Crawlers cannot see meta robots tags on blocked pages

noindex blocks indexing:

  • Allows crawlers to access the page
  • Tells search engines not to include the page in results
  • Requires the crawler to actually visit the page to see the directive
  • Works via meta tag or HTTP header

The dangerous combination:

Never use both robots.txt disallow AND noindex on the same page. If you block a page in robots.txt, search engines cannot crawl it to see your noindex tag. The page may still appear in search results based on external links, which is often the opposite of what site owners intend.

Scenariorobots.txtnoindexResult
Block from resultsNoYesPage crawled but not indexed (correct)
Save crawl budgetYesNoPage not crawled, may still appear in index
Incorrect approachYesYesnoindex ignored; page may still appear

When to use each:

Use robots.txt when you want to save crawl budget on pages that don’t need indexing and have no external links pointing to them. Use noindex when you need to ensure a page never appears in search results, regardless of inbound links.

Supported Directives and Syntax

The robots.txt protocol supports a limited set of directives. Understanding which directives work with which search engines prevents wasted effort on unsupported rules.

User-agent

Specifies which crawler the following rules apply to. The asterisk (*) targets all crawlers.

User-agent: *
User-agent: Googlebot
User-agent: Bingbot

Rules under a specific user-agent only apply to that crawler. Place more specific user-agent blocks before the general (*) block.

Disallow

Prevents crawling of specified paths. An empty disallow allows full access.

Disallow: /admin/
Disallow: /private/document.pdf
Disallow:

The disallow directive matches URL prefixes. A rule like Disallow: /blog blocks /blog, /blog/, /blog/post, and /blogging.

Allow

Permits crawling of specific paths within a disallowed directory. Allow takes precedence over disallow when both match.

Disallow: /folder/
Allow: /folder/public/

In this example, /folder/public/ and its contents are crawlable, while the rest of /folder/ is blocked.

Sitemap

Points crawlers to your XML sitemap location. Can appear anywhere in the file and applies globally.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml

Multiple sitemap directives are allowed. Use absolute URLs including the protocol.

Crawl-delay (Not Supported by Google)

The crawl-delay directive tells crawlers to wait a specified number of seconds between requests. However, Google explicitly does not support this directive.

Crawl-delay: 10

While Bing and some other crawlers respect crawl-delay, using it has no effect on Googlebot. To control Google’s crawl rate, use Google Search Console’s crawl rate settings instead.

Wildcard Patterns and Special Characters

robots.txt supports two wildcard characters that enable powerful pattern matching without listing every URL individually.

Asterisk (*) Wildcard

Matches any sequence of characters, including an empty sequence.

Disallow: /*/private/
Disallow: /*.pdf
Disallow: /product/*?sort=

Examples:

  • /*/private/ blocks /a/private/, /folder/private/, /x/y/private/
  • /*.pdf blocks any URL containing .pdf
  • /product/*?sort= blocks product pages with sort parameters

Dollar Sign ($) End Matcher

Indicates the pattern must match the end of the URL.

Disallow: /*.pdf$
Disallow: /page$

Examples:

  • /*.pdf$ blocks URLs ending in .pdf but allows /pdf-guide/
  • /page$ blocks /page but allows /page/ and /pages

Combining Wildcards

Wildcards can be combined for precise targeting:

Disallow: /category/*/page/*$

This blocks URLs like /category/shoes/page/2 but allows /category/shoes/page/2/details.

Common robots.txt Syntax Examples

Here are practical examples for common scenarios, with correct syntax.

Block a Single Directory

User-agent: *
Disallow: /admin/

Blocks all crawlers from /admin/ and all subdirectories.

Block Multiple Directories

User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /temp/

Each disallow rule goes on its own line. Do not combine paths.

Allow Specific Subdirectory

User-agent: *
Disallow: /members/
Allow: /members/public/

Blocks /members/ except for /members/public/.

Block Query Parameters

User-agent: *
Disallow: /*?sessionid=
Disallow: /*&sort=
Disallow: /*?ref=

Prevents crawling of URLs with tracking or session parameters.

Block Specific File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /*.xls$

Blocks document files while allowing HTML pages.

Different Rules for Different Crawlers

User-agent: Googlebot
Disallow: /internal/

User-agent: Bingbot
Disallow: /internal/
Disallow: /beta/

User-agent: *
Disallow: /private/

Googlebot follows only its specific rules. Bingbot follows its rules. All other crawlers follow the asterisk rules.

Complete Example File

# robots.txt for example.com

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /*?sessionid=
Disallow: /*&sort=
Allow: /admin/public-resources/

User-agent: Googlebot-Image
Disallow: /private-images/

Sitemap: https://example.com/sitemap.xml

Comments begin with # and are ignored by crawlers. Use them to document your rules.

Best Practices for robots.txt

Following these practices ensures your robots.txt works effectively without causing crawl issues.

Place File in Root Directory Only

The robots.txt file must be at your domain’s root: https://example.com/robots.txt. Files in subdirectories are ignored. Subdomains require their own robots.txt files.

Use UTF-8 Encoding

Save your robots.txt with UTF-8 encoding without BOM (byte order mark). Other encodings may cause parsing errors or incorrect character interpretation.

Keep Rules Organized

Group rules logically by user-agent. Place specific user-agent blocks before the general asterisk block. Use comments to explain complex rules.

Test Before Deploying

Use Google Search Console’s robots.txt Tester to verify your rules work as intended. Test specific URLs to confirm they’re blocked or allowed correctly.

Monitor File Size

Track your robots.txt size, especially on large sites. Set alerts if the file approaches 400 KiB. Consolidate rules using wildcards to reduce file size.

Update When Site Structure Changes

Review robots.txt whenever you reorganize your site, add new sections, or change URL patterns. Outdated rules may block important pages or allow access to sensitive areas.

Don’t Block CSS and JavaScript

Google needs to render pages to understand their content. Blocking CSS and JavaScript files can hurt your rankings because Googlebot cannot see how users experience your pages.

Use Sitemap Directive

Always include your sitemap URL in robots.txt. This helps search engines discover your sitemap even if it’s not linked from your pages.

Common Mistakes to Avoid

These errors frequently appear in robots.txt files and can significantly impact SEO performance.

Blocking Important Resources

# Wrong - blocks all images
Disallow: /images/

# Wrong - blocks CSS needed for rendering
Disallow: /css/
Disallow: /js/

Google needs access to CSS, JavaScript, and images to properly render and understand your pages. Only block resources that truly shouldn’t be crawled.

Using Noindex in robots.txt

# Wrong - noindex is not a valid robots.txt directive
User-agent: *
Noindex: /private/

The noindex directive only works as a meta tag or HTTP header. Including it in robots.txt has no effect and may indicate confusion about how crawl control works.

Forgetting the Trailing Slash

# This blocks /admin, /admin/, /admins, /administrator
Disallow: /admin

# This blocks only /admin/ and contents
Disallow: /admin/

Without the trailing slash, the rule blocks any URL starting with /admin. Include the trailing slash to block only that specific directory.

Case Sensitivity Errors

# These are different paths
Disallow: /Private/
Disallow: /private/

While directives are case-insensitive, URL paths are case-sensitive. Match the exact case of your actual URLs.

Blocking Pages You Want Indexed

# Wrong - prevents indexing of important content
Disallow: /blog/

If you want blog posts to appear in search results, don’t block the blog directory. Use noindex selectively on specific pages instead.

Multiple Sitemaps on One Line

# Wrong
Sitemap: https://example.com/sitemap1.xml, https://example.com/sitemap2.xml

# Correct
Sitemap: https://example.com/sitemap1.xml
Sitemap: https://example.com/sitemap2.xml

Each sitemap URL requires its own sitemap directive on a separate line.

Assuming All Crawlers Respect robots.txt

robots.txt is a protocol, not a security measure. Malicious bots ignore it entirely. Never use robots.txt to protect sensitive data. Use proper authentication and access controls instead.

Frequently Asked Questions

What happens if I don’t have a robots.txt file?

Search engines assume all pages are available for crawling. A missing robots.txt returns a 404, which crawlers interpret as “no restrictions.” This is fine for most sites that want full crawling, but you lose the ability to provide sitemap locations through this file.

Can robots.txt remove pages already in Google’s index?

No. robots.txt only controls crawling, not indexing. To remove indexed pages, use the noindex meta tag or the URL Removal tool in Google Search Console. Pages blocked by robots.txt may remain in the index indefinitely.

How often do search engines check robots.txt?

Google typically caches robots.txt for up to 24 hours, sometimes longer. After making changes, it may take a day or more for crawlers to see your updated rules. There’s no way to force immediate refresh.

Should I block AI crawlers in robots.txt?

Many AI companies claim to respect robots.txt. You can add rules for known AI user-agents like GPTBot or CCBot. However, not all AI systems honor these restrictions, and new crawlers appear regularly. Blocking requires ongoing maintenance of user-agent lists.

Does robots.txt affect PageRank or link equity?

Blocking a page in robots.txt doesn’t prevent link equity from flowing to that URL. However, blocked pages cannot pass link equity to other pages because crawlers can’t see their outbound links. This is another reason to prefer noindex over robots.txt for pages you want to exclude from search results.

Can I use robots.txt to block specific IP addresses?

No. robots.txt works based on user-agent strings, not IP addresses. To block specific IPs, use server configuration files (.htaccess for Apache) or firewall rules.

Key Takeaways

  • Google enforces a 500 KiB (512,000 bytes) limit on robots.txt files, ignoring content beyond this threshold
  • robots.txt blocks crawling while noindex blocks indexing; never use both on the same page
  • The file must be in your root directory only, using UTF-8 encoding
  • Google does not support the crawl-delay directive; use Search Console to control crawl rate
  • Use wildcards (* and $) to create efficient rules and keep file size manageable
  • Always test changes in Google Search Console before deploying to production

Conclusion

A properly configured robots.txt file helps search engines crawl your site efficiently while protecting pages that don’t need indexing. Remember the key distinctions: robots.txt controls crawling behavior, not indexing, and Google’s 500 KiB limit means large sites must optimize their rules carefully. Test your configuration regularly, especially after site structure changes, and always verify that important resources remain accessible to crawlers.

Try our free letter counter → to check your meta descriptions and title tags stay within search engine limits.