Robots.txt wildcards - the devs had a disagreement - which is correct?

McTaggart

Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”

The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20

So which of the developers is correct?

Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?

McTaggart

Thanks Logan - much appreciated, as ever - that really helps - if I was to add another * to **Allow: /?resultspage= > so **Allow: /?*resultspage= - what would happen then? ****

LoganRay

Ok, gotcha. Add the following directives:

Disallow: /shirts/?

This prevents crawling of the following:

/shirts**/golden/**?minprice=10&maxprice=20
/shirts/?minprice=10&maxprice=20

Allow: /*?resultspage=

Allows crawling of the following:

/shirts/navy/?resultspage=02
/shirts/?resultspage=01

McTaggart

Thanks Logan - much appreciated - the aim would be to prevent bots crawling any parameter'd URL but only in the products section, and not all of them - see below.

I noticed the shirt URLs can be produce many pages of results - e.g. if you look for a type of shirt you can get up to 20 pages of results - the resulting URLs also feature a ?

So you end up with - for example - /shirts/?resultspage=01 and then /shirts/?resultspage=02 or shirts/navy/?resultspage=01 and /shirts/navy/?resultspage=02 - and so on - and it would be good to index them somehow. So I wonder how I can override disallow parameters robots.txt instruction only for specific paths and even individual pages?

LoganRay

Disallow: /shirts/?* will only block URLs that end with /shirts/ before beginning a parameter string. If you want to block /shirts**/golden/**?minprice=10&maxprice=20 you'll have to add the asterisk before and after the ?

What the end goal here? Preventing bots from crawling any parameter'd URL?

McTaggart

I suppose the nub of the disagreement is this: would Disallow: /shirts/?* block /shirts/?minprice=10&maxprice=20 and also block URLS further down the URL directory structure - e.g. /shirts/mens/navyblue/?minprice=10&maxprice=20 ?

McTaggart

Thanks Logan - the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”

If I amended the URL to
/shirts/?minprice=10&maxprice=20 would robots.txt work as intended right there?

and would that robots.txt work as intended further down the directory structure of the URLs? E.g.
/shirts**/golden/**?minprice=10&maxprice=20

LoganRay

Hi Luke,

The second developer is correct....well, more correct than the first. Your example of /shirts?minprice=10&maxprice=20 would not be blocked by this direction, since there's no slack after shirts.

For future reference, you can test how directives function in Google Search Console. Under the 'Crawl' menu, there's a robots.txt tester in which you can manually edit the robots.txt directives (they don't apply to the live file) and enter test URLs to see which directive, if any, would prevent crawling.

You are correct in your assumption that a * on either side of the ? would prevent crawling of both /shirts/blue?mprice=100&maxp=20 and /shirts/?minprice=10&maxprice=20

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt wildcards - the devs had a disagreement - which is correct?

Browse Questions

Explore more categories

Related Questions

Google cache is for a 3rd parties site for HTTP version and correct for HTTPS

Application & understanding of robots.txt

Part of my site does not show the correct Meta title

Google Indexing Duplicate URLs : Ignoring Robots & Canonical Tags

Robots.txt vs noindex

Robots Disallow Backslash - Is it right command

Our Robots.txt and Reconsideration Request Journey and Success

202 error page set in robots.txt versus using crawl-able 404 error