Robots.txt Explained: How to Create One and Control What Google Crawls

Q: Do I need a robots.txt file if I want everything indexed?

Yes — include it anyway with `Allow: /` and your sitemap URL. It confirms to crawlers that you've thought about this and points them to your sitemap, improving crawl efficiency.

Q: Can I block a specific crawler but allow Google?

Yes. Use separate User-agent blocks: ```text User-agent: Bingbot Disallow: / User-agent: Googlebot Allow: / ```

Every website has a /robots.txt file — or should. It's one of the first things search engine crawlers check when they visit your site. And yet most website owners either skip it entirely or copy a generic one without understanding what it does.

A badly written robots.txt can accidentally block Google from indexing your entire site. A well-written one protects private sections, reduces crawl waste, and signals to search engines exactly what they should focus on.

What is robots.txt?

Robots.txt is a plain text file placed in the root of your website that tells search engine crawlers which pages or sections they're allowed to access.

It follows the Robots Exclusion Protocol — a standard dating back to 1994 that all major search engines (Google, Bing, DuckDuckGo, etc.) respect.

When a crawler visits your site, it checks yourdomain.com/robots.txt first before crawling anything else.

The Basic Syntax

A robots.txt file is made up of "groups," each specifying:

User-agent — which crawler the rule applies to
Disallow — paths the crawler should NOT visit
Allow — paths explicitly permitted (overrides Disallow)
Sitemap — the location of your XML sitemap

## Allow all crawlers to access everything
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

## Block all crawlers from the admin section
User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Generate Your Robots.txt File

Featured Utility

Robots.txt Generator

Create a robots.txt file to guide search engine crawlers.

Try Robots.txt Generator

What You Should and Shouldn't Block

Block these paths (common examples):

/admin/ — Admin panels should never be indexed
/login/ and /register/ — Auth pages don't add SEO value
/checkout/ — E-commerce checkout pages
/api/ — API endpoints aren't pages
/search? — Internal search result pages create duplicate content issues
/wp-admin/ — WordPress admin directory

Don't block these (common mistakes):

Your CSS and JavaScript files — Googlebot needs to render your pages. Blocking /static/ or /assets/ prevents Google from seeing how your pages look.
Pages you want indexed — sounds obvious, but copy-paste errors happen
Your entire site — Disallow: / blocks everything. This is occasionally correct but usually catastrophic

Crawl Budget Optimization and the Crawl-delay Directive

When search engines crawl your site, they allocate a specific "crawl budget"—a limit on how many pages they will fetch during a given time period. For massive e-commerce sites or news portals with millions of pages, crawl budget optimization is critical. If Googlebot spends its entire budget crawling low-value filter pages (e.g., ?color=red&size=large), it might miss your newly published, high-value blog posts. Properly managing your crawl budget ensures that search engines index your most valuable, revenue-generating content immediately, rather than wasting time crawling infinite pagination loops or obsolete archive directories.

Using the Disallow directive on infinite parameter combinations is the most effective way to conserve your crawl budget. However, you can also use the Crawl-delay directive to prevent aggressive bots from overloading your server architecture.

The Crawl-delay directive tells a bot how many seconds it must wait between requests.

User-agent: Bingbot
Crawl-delay: 10

In the example above, Bingbot is instructed to wait 10 seconds after each page fetch before requesting the next one. This prevents your server from crashing under the weight of thousands of simultaneous requests.

Important Note on Googlebot: Google no longer officially supports the Crawl-delay directive in robots.txt. If Googlebot is hitting your server too hard, you must log into your Google Search Console account and manually reduce the crawl rate in the site settings panel. However, other major search engines like Bing, Yandex, and Baidu still highly respect the Crawl-delay rule.

The Relationship with Meta Tags

Robots.txt controls crawler access — whether a bot can visit a URL at all. But you can also control indexing at the page level using HTML meta tags:

<!-- Tell Google: index this page but don't follow links on it -->
<meta name="robots" content="index, nofollow">

<!-- Tell Google: don't index this specific page -->
<meta name="robots" content="noindex">

The meta tag approach is more precise — it applies to a single page. Robots.txt rules apply to entire path patterns.

Featured Utility

Meta Tag & Description Generator

Generate and AI-write optimized HTML meta tags, Open Graph cards, Twitter Cards, and meta descriptions — all in one free tool.

Try Meta Tag & Description Generator

Important Limitations

Robots.txt is advisory, not enforced. Legitimate crawlers respect it. Malicious bots and scrapers don't. Don't rely on robots.txt to protect sensitive content — use proper authentication instead.

Disallowed pages can still appear in search results. If other sites link to a page you've disallowed in robots.txt, Google may still show it in results (without a description). Use noindex meta tags to prevent indexing of pages you don't want appearing in search.

Crawl budget matters on large sites. For sites with thousands of pages, blocking low-value URLs (search result pages, filter combinations, duplicate content) with robots.txt helps Google spend its crawl budget on your important pages.

Regional and Compliance Considerations

EU (GDPR): Private user account pages, order history, and personal data URLs must never be accessible to crawlers. Robots.txt is your first line of defense (alongside authentication).
For all markets: Internal search result pages create duplicate content problems globally — blocking them with robots.txt improves crawl efficiency and prevents content dilution.

Frequently Asked Questions

Do I need a robots.txt file if I want everything indexed?

Yes — include it anyway with Allow: / and your sitemap URL. It confirms to crawlers that you've thought about this and points them to your sitemap, improving crawl efficiency.

Can I block a specific crawler but allow Google?

Yes. Use separate User-agent blocks:

User-agent: Bingbot
Disallow: /

User-agent: Googlebot
Allow: /

Will blocking a URL in robots.txt remove it from Google's index?

No. Disallowing a URL prevents Google from crawling it, but if the URL is already indexed, it stays in search results. To remove indexed pages, use noindex meta tags or Google Search Console's URL removal tool.

Does robots.txt protect my private data?

No. It tells well-behaved crawlers not to access certain paths, but it doesn't add any authentication or security. Anyone can still access those URLs directly.

Does FluxToolkit store my robots.txt configuration?

No. The generator runs entirely in your browser. Your configuration is never sent to our servers.