SEO

Robots.txt Testing Checklist for Custom Website Launches

Collin D Johnson
Robots.txt Testing Checklist for Custom Website Launches

What robots.txt can and cannot do

Robots.txt gives crawlers instructions about which paths they may request. It sits at the root of a host, such as https://example.com/robots.txt, and uses user-agent rules to allow or disallow crawling.

It does not secure private content. Anyone can open the file. If a staging URL, private PDF, or internal tool needs protection, use authentication, access control, or remove it from the public web.

It also does not guarantee deindexing. A blocked URL can still appear in search results if Google discovers it through links or other signals. If you need a page out of the index, use the right removal or noindex approach, then make sure Google can crawl the page long enough to see that instruction.

For launch QA, treat robots.txt as a crawl control file. Do not treat it as a security tool or an index cleanup tool.

The pre-launch robots.txt checklist

1. Test the exact production hostname

Check the robots.txt file on the host that will serve the live site.

Test these if they apply:

  • https://example.com/robots.txt
  • https://www.example.com/robots.txt
  • Any regional, app, docs, or marketing subdomain with indexable content

Robots.txt rules attach to a specific protocol, host, and port. A clean file on www does not fix a bad file on the root domain.

If the launch includes a domain cutover, test the file again after DNS, redirects, and CDN rules settle. Teams often validate the preview URL, then miss the production host.

2. Remove staging blocks from production

Many builds use a broad staging rule:

txt User-agent: Disallow: /

That rule makes sense on a preview environment. It can wreck a launch on production.

Before launch, confirm that the production file does not include a full-site block unless you have a deliberate reason. Check the final deployed file, not a local template or old CMS setting.

Also check for environment-specific robots files in the repo, CMS, hosting platform, and edge middleware. The file that reaches the browser wins.

3. Keep admin and system paths blocked

A normal marketing site may need to block CMS admin paths, search result pages, internal API routes, cart utilities, or parameter-heavy pages.

Examples:

txt User-agent: Disallow: /admin/ Disallow: /api/ Disallow: /search

Keep these rules specific. Broad patterns create collateral damage. If you block /app, you may also block /applications, /apply, or another public section depending on your URL structure and matching rules.

Review each disallow rule against the final sitemap and navigation.

4. Confirm the XML sitemap path

If robots.txt lists a sitemap, open the exact URL.

txt Sitemap: https://example.com/sitemap.xml

Then check three things:

  • The sitemap returns a 200 status.
  • The sitemap uses the production domain.
  • The listed URLs are indexable pages, not staging URLs or redirected leftovers.

This matters because sitemap and robots rules often ship from different parts of the stack. The site generator may create the sitemap. The hosting layer may serve robots.txt. The CMS may store SEO settings. Someone needs to compare the final outputs.

5. Compare robots.txt against the sitemap

Open the sitemap and scan the main URL patterns.

Then ask a simple question for each pattern: does robots.txt allow crawlers to request this page?

Check these sections on a B2B or SaaS site:

  • Homepage
  • Product or service pages
  • Pricing or plan pages
  • Comparison pages
  • Approved case studies
  • Blog posts and category pages
  • Resource pages
  • Legal pages

If a URL appears in the sitemap, robots.txt should not block it. If you do not want crawlers to request a URL, remove it from the sitemap and choose the right exclusion method.

6. Test redirects before crawl rules

A migration can send old URLs through redirects before the crawler reaches the new page. That makes robots testing more than a file check.

Test a sample of old URLs from the redirect map:

  1. Request the old URL.
  2. Follow the redirect chain.
  3. Confirm the final URL returns the right page.
  4. Confirm robots.txt allows the final path.
  5. Confirm the final URL appears in the sitemap when it should.

Do this for high-value old pages first. Start with pages that already rank, pages that attract qualified leads, and pages with external links.

7. Check canonical pages, not templates alone

Template-level QA can miss path-level problems.

A crawler may request a blog template while /blog/category/private-notes/ still fails. A product page may pass while /compare/ fails. A pricing page may work on the new route while the old redirected route lands on a blocked path.

Pick real pages from the production sitemap and test those URLs. Include one page from each major template.

8. Confirm noindex and robots rules do not fight

A common mistake: a team blocks a page in robots.txt and also adds a noindex tag to the page.

That sounds safe. It can create a problem.

If the crawler cannot request the page, it may not see the noindex tag. Use robots.txt to manage crawling. Use noindex when you want search engines to remove or exclude a page from the index and can let them crawl it.

For launch decisions, write down the intent for each excluded section:

Page typeGoalBetter control
Admin routesPrevent crawlingrobots.txt plus real access control
Thank-you pagesKeep out of search resultsnoindex when crawlable
Staging siteKeep privateauthentication
Faceted search pagesReduce crawl wastespecific robots rules or canonical strategy
Old pages with replacementsPreserve equityredirects

9. Inspect with Search Console after launch

Pre-launch tools help, but Search Console tells you how Google sees the live site.

After launch, inspect priority URLs:

  • Homepage
  • One service or product page
  • One pricing or conversion page
  • One comparison or BOFU page
  • One blog post
  • One old URL that redirects to a new URL

Look for crawl permission, indexing status, selected canonical URL, and sitemap discovery. If Search Console reports a page as blocked by robots.txt, fix the rule before you ask for indexing again.

10. Keep a robots.txt change log

Teams give robots.txt changes less review than application code. They should treat both as production changes.

A useful change log records:

  • The date
  • The person who changed the file
  • The reason
  • The exact rule added or removed
  • The pages tested after the change

This protects the team later. When organic traffic drops or a page disappears from search, you can rule robots.txt in or out within minutes.

A practical launch test sequence

Run this sequence the day before launch:

  1. Open the production robots.txt URL.
  2. Confirm the file returns 200.
  3. Search for Disallow: / and any broad path blocks.
  4. Open the sitemap URL listed in robots.txt.
  5. Test five to ten URLs from the sitemap.
  6. Test the highest-value old redirects.
  7. Confirm crawl permission for final URLs.
  8. Check that staging and preview environments use authentication or remain blocked.
  9. Save a copy of the final robots.txt file in the launch QA notes.
  10. Repeat the checks after DNS and CDN changes are live.

The goal is not a perfect robots.txt file. The goal is a file that matches the launch plan.

Common robots.txt launch mistakes

Leaving the whole site blocked

This starts in staging. A preview environment needs protection, then the same rule reaches production.

Fix it by separating production and preview robots files in the deployment process. Do not rely on someone remembering to edit one line on launch day.

Blocking the blog by accident

A rule like Disallow: /blog can block every article. That may happen when a team wants to hide one unfinished post or an old blog archive.

Use narrower rules. Better yet, control unpublished content in the CMS instead of robots.txt.

Listing blocked URLs in the sitemap

This sends mixed instructions. The sitemap says the page matters. Robots.txt says crawlers should not request it.

Fix the source of truth. If the page should rank, allow it. If it should stay out, remove it from the sitemap and choose the right exclusion method.

Hiding staging URLs in robots.txt

A robots file can reveal staging paths because the file is public. If a URL should stay private, protect it with authentication.

Robots.txt asks compliant crawlers to stay away. It does not keep people out.

Ignoring subdomains

Marketing teams often check the main domain and forget docs, app, help, or landing-page subdomains.

If the subdomain has search value, it needs its own robots.txt QA.

Who should own robots.txt QA?

Robots.txt sits between SEO, development, hosting, and content operations. That makes it easy for everyone to assume someone else checked it.

Assign one owner for launch QA. That person does not need to write every rule, but they need authority to stop the launch if production blocks important pages.

For a custom website build, Virdis treats this as part of launch readiness. Robots.txt affects crawl access, sitemap consistency, migration quality, and the trustworthiness of post-launch reporting. It belongs in the launch checklist, not in a cleanup ticket after traffic goes quiet.

Find the 3 leaks most likely to cost you demos.

A 48-hour conversion teardown before you commit
Clear scope, timeline, and next-step plan
Design, development, and CRO handled for you