Partial Match or RegEx in Search Console's URL Parameters Tool?

Ria_

So I currently have approximately 1000 of these URLs indexed, when I only want roughly 100 of them.

Let's say the URL is www.example.com/page.php?par1=ABC123=&par2=DEF456=&par3=GHI789=

All the indexed URLs follow that same kinda format, but I only want to index the URLs that have a par1 of ABC (but that could be ABC123 or ABC456 or whatever). Using URL Parameters tool in Search Console, I can ask Googlebot to only crawl URLs with a specific value. But is there any way to get a partial match, using regex maybe?

Am I wasting my time with Search Console, and should I just disallow any page.php without par1=ABC in robots.txt?

Andy.Drinkwater

No problem

Hope you get it sorted!

-Andy

Ria_

Thank you!

Ria_

Haha, I think the train passed the station on that one. I would have realised eventually... XD

Thanks for your help!

DirkC

Don't forget that . & ? have a specific meaning within regex - if you want to use them for pattern matching you will have to escape them. Also be aware that not all bots are capable of interpreting regex in robots.txt - you might want to be more explicit on the user agent - only using regex for Google bot.

User-agent: Googlebot

#disallowing page.php and any parameters after it

disallow: /page.php

#but leaving anything that starts with par1=ABC

allow: page.php?par1=ABC

Dirk

Andy.Drinkwater

Ah sorry I missed that bit!

-Andy

Andy.Drinkwater

Disallowing them would be my first priority really, before removing from index.

The trouble with this is that if you disallow first, Google won't be able to crawl the page to act on the noindex. If you add a noindex flag, Google won't index them the next time it comes-a-crawling and then you will be good to disallow

I'm not actually sure of the best way for you to get the noindex in to the page header of those pages though.

-Andy

Ria_

Yep, have done. (Briefly mentioned in my previous response.) Doesn't pass

Ria_

I thought so too, but according to Google the trailing wildcard is completely unnecessary, and only needs to be used mid-URL.

Ria_

Hi Andy,

Disallowing them would be my first priority really, before removing from index. Didn't want to remove them before I've blocked Google from crawling them in case they get added back again next time Google comes a-crawling, as has happened before when I've simply removed a URL here and there. Does that make sense or am I getting myself mixed up here?

My other hack of a solution would be to check the URL in the page.php, and if URL includes par1=ABC then insert noindex meta tag. (Not sure if that would work well or not...)

Martijn_Scheijbeler

My guess would be that this line needs an * at the end.
Allow: /page.php?par1=ABC*

Andy.Drinkwater

Sorry Martijn, just to jump in here for a second - Ria, you can test this via the Robots.txt testing tool in search console before going live to make sure it work.

-Andy

Ria_

Hi Martijn, thanks for your response!

I'm currently looking at something like this...

**user-agent: *** #disallowing page.php and any parameters after it
disallow: /page.php #but leaving anything that starts with par1=ABC
allow: /page.php?par1=ABC

I would have thought that you could disallow things broadly like that and give an exception, as you can with files in disallowed folders. But it's not passing Google's robots.txt Tester.

One thing that's probably worth mentioning really is that there are only two variables that I want to allow of the par1 parameter. For example's sake, ABC123 and ABC456. So would need to be either a partial match or "this or that" kinda deal, disallowing everything else.

Andy.Drinkwater

Hi Ria,

I have never tried regular expressions in this way, so I can't tell you if this would work or not.

However, If all 1000 of these URL's are already indexed, just disallowing access won't then remove them from Google. You would ideally be able to place a noindex tag on those pages and let Google act on them, then you will be good to disallow. I am pretty sure there is no option to noindex under the URL Parameter Tool.

I hope that makes sense?

-Andy

Martijn_Scheijbeler

Hi Ria,

What you could do, but it also depends on the rest of your structure is Disallow these urls based on the parameters (what you could do in a worst case scenario is that you would disallow all URLs and then put an exception Allow in there as well to make sure you still have the right URLs being indexed).

Martijn.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Partial Match or RegEx in Search Console's URL Parameters Tool?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Client has an inexplicable jump in crawled pages being reported in Google Search Console

Duplicate Content with URL Parameters

Google Webmaster Tools Parameters

URL Parameter Being Improperly Crawled & Indexed by Google

What's the best way to redirect categories & paginated pages on a blog?

Search Engine Blocked by robots.txt for Dynamic URLs

Is 404'ing a page enough to remove it from Google's index?

URL Length or Exact Breadcrumb Navigation URL? What's More Important