Robots.txt wildcards - the devs had a disagreement - which is correct?
-
Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20
So which of the developers is correct?
Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?
-
Thanks Logan - much appreciated, as ever - that really helps - if I was to add another * to **Allow: /?resultspage= > so **Allow: /?*resultspage= - what would happen then? ****
-
Ok, gotcha. Add the following directives:
Disallow: /shirts/?
This prevents crawling of the following:
- /shirts**/golden/**?minprice=10&maxprice=20
- /shirts/?minprice=10&maxprice=20
Allow: /*?resultspage=
Allows crawling of the following:
- /shirts/navy/?resultspage=02
- /shirts/?resultspage=01
-
Thanks Logan - much appreciated - the aim would be to prevent bots crawling any parameter'd URL but only in the products section, and not all of them - see below.
I noticed the shirt URLs can be produce many pages of results - e.g. if you look for a type of shirt you can get up to 20 pages of results - the resulting URLs also feature a ?
So you end up with - for example - /shirts/?resultspage=01 and then /shirts/?resultspage=02 or shirts/navy/?resultspage=01 and /shirts/navy/?resultspage=02 - and so on - and it would be good to index them somehow. So I wonder how I can override disallow parameters robots.txt instruction only for specific paths and even individual pages?
-
Disallow: /shirts/?* will only block URLs that end with /shirts/ before beginning a parameter string. If you want to block /shirts**/golden/**?minprice=10&maxprice=20 you'll have to add the asterisk before and after the ?
What the end goal here? Preventing bots from crawling any parameter'd URL?
-
I suppose the nub of the disagreement is this: would Disallow: /shirts/?* block /shirts/?minprice=10&maxprice=20 and also block URLS further down the URL directory structure - e.g. /shirts/mens/navyblue/?minprice=10&maxprice=20 ?
-
Thanks Logan - the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
If I amended the URL to
/shirts/?minprice=10&maxprice=20 would robots.txt work as intended right there?and would that robots.txt work as intended further down the directory structure of the URLs? E.g.
/shirts**/golden/**?minprice=10&maxprice=20 -
Hi Luke,
The second developer is correct....well, more correct than the first. Your example of /shirts?minprice=10&maxprice=20 would not be blocked by this direction, since there's no slack after shirts.
For future reference, you can test how directives function in Google Search Console. Under the 'Crawl' menu, there's a robots.txt tester in which you can manually edit the robots.txt directives (they don't apply to the live file) and enter test URLs to see which directive, if any, would prevent crawling.
You are correct in your assumption that a * on either side of the ? would prevent crawling of both /shirts/blue?mprice=100&maxp=20 and /shirts/?minprice=10&maxprice=20
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If my website do not have a robot.txt file, does it hurt my website ranking?
After a site audit, I find out that my website don't have a robot.txt. Does it hurt my website rankings? One more thing, when I type mywebsite.com/robot.txt, it automatically redirect to the homepage. Please help!
Intermediate & Advanced SEO | | binhlai0 -
Robots.txt Help
I need help to create robots.txt file. Please let me know what to add in the file. any real example or working example.?
Intermediate & Advanced SEO | | Michael.Leonard0 -
Robots txt is case senstive? Pls suggest
Hi i have seen few urls in the html improvements duplicate titles Can i disable one of the below url in the robots.txt? /store/Solar-Home-UPS-1KV-System/75652
Intermediate & Advanced SEO | | Rahim119
/store/solar-home-ups-1kv-system/75652 if i disable this Disallow: /store/Solar-Home-UPS-1KV-System/75652 will the Search engines scan this /store/solar-home-ups-1kv-system/75652 im little confused with case senstive.. Pls suggest go ahead or not in the robots.txt0 -
Regular Expression / Wildcard Redirect Situation
I am dealing with an interesting situation. Here's what's going on: Current URLs Example1:
Intermediate & Advanced SEO | | NakulGoyal
www.domain.com/red-widgets-cid-1234.html
www.domain.com/red-widgets-cid-1234-1.html
www.domain.com/red-widgets-cid-1234-1-1.html Canonical on All Above URLs:
www.domain.com/red-widgets-cid-1234.html New URL:
www.domain.com/red-widgets-cid-4567.html Current URLs Example2:
www.domain.com/red-widgets-cid-1234+10.html
www.domain.com/red-widgets-cid-1234+10-1.html
www.domain.com/red-widgets-cid-1234+10-1-1.html Canonical on All Above URLs:
www.domain.com/red-widgets-cid-1234+10.html New URL:
www.domain.com/red-widgets-cid-6789.html I want to make sure all variations of the above URL redirect to the new url. What wildcard 301 redirect / regular expression can I use to tackle these ?0 -
How to Disallow Tag Pages With Robot.txt
Hi i have a site which i'm dealing with that has tag pages for instant - http://www.domain.com/news/?tag=choice How can i exclude these tag pages (about 20+ being crawled and indexed by the search engines with robot.txt Also sometimes they're created dynamically so i want something which automatically excludes tage pages from being crawled and indexed. Any suggestions? Cheers, Mark
Intermediate & Advanced SEO | | monster990 -
Robots.txt unblock
I'm currently having trouble with what appears to be a cached version of robots.txt. I'm being told via errors in my Google sitemap account that I'm denying Googlebot access to the entire site. I uploaded clean and "Allow" robots.txt yesterday, but receive the same error. I've tried "Fetch as Googlebot" on the index and other pages, but still the error. Here is the latest: | Denied by robots.txt |
Intermediate & Advanced SEO | | Elchanan
| 11/9/11 10:56 AM | As I said, there in not blocking on the robots.txt for 24 hours. HELP!0 -
Negative impact on crawling after upload robots.txt file on HTTPS pages
I experienced negative impact on crawling after upload robots.txt file on HTTPS pages. You can find out both URLs as follow. Robots.txt File for HTTP: http://www.vistastores.com/robots.txt Robots.txt File for HTTPS: https://www.vistastores.com/robots.txt I have disallowed all crawlers for HTTPS pages with following syntax. User-agent: *
Intermediate & Advanced SEO | | CommercePundit
Disallow: / Does it matter for that? If I have done any thing wrong so give me more idea to fix this issue.0 -
XML Sitemap instruction in robots.txt = Worth doing?
Hi fellow SEO's, Just a quick one, I was reading a few guides on Bing Webmaster tools and found that you can use the robots.txt file to point crawlers/bots to your XML sitemap (they don't look for it by default). I was just wondering if it would be worth creating a robots.txt file purely for the purpose of pointing bots to the XML sitemap? I've submitted it manually to Google and Bing webmaster tools but I was thinking more for the other bots (I.e. Mozbot, the SEOmoz bot?). Any thoughts would be appreciated! 🙂 Regards, Ash
Intermediate & Advanced SEO | | AshSEO20110