Regular Expressions for Filtering BOT Traffic?
-
I've set up a filter to remove bot traffic from Analytics. I relied on regular expressions posted in an article that eliminates what appears to be most of them.
However, there are other bots I would like to filter but I'm having a hard time determining the regular expressions for them.
How do I determine what the regular expression is for additional bots so I can apply them to the filter?
I read an Analytics "how to" but its over my head and I'm hoping for some "dumbed down" guidance.
-
No problem, feel free to reach out if you have any other RegEx related questions.
Regards,
Chris
-
I will definitely do that for Rackspace bots, Chris.
Thank you for taking the time to walk me through this and tweak my filter.
I'll give the site you posted a visit.
-
If you copy and paste my RegEx, it will filter out the rackspace bots. If you want to learn more about Regular Expressions, here is a site that explains them very well, though it may not be quite kindergarten speak.
-
Crap.
Well, I guess the vernacular is what I need to know.
Knowing what to put where is the trick isn't it? Is there a dummies guide somewhere that spells this out in kindergarten speak?
I could really see myself botching this filtering business.
-
Not unless there's a . after the word servers in the name. The . is escaping the . at the end of stumbleupon inc.
-
Does it need the . before the )
-
Ok, try this:
^(microsoft corp|inktomi corporation|yahoo! inc.|google inc.|stumbleupon inc.|rackspace cloud servers)$|gomez
Just added rackspace as another match, it should work if the name is exactly right.
Hope this helps,
Chris
-
Agreed! That's why I suggest using it in combination with the variables you mentioned above.
-
rackspace cloud servers
Maybe my problem is I'm not looking in the right place.
I'm in audience>technology>network and the column shows "service provider."
-
How is it titled in the ISP report exactly?
-
For example,
Since I implemented the filter four days ago, rackspace cloud servers have visited my site 848 times, , visited 1 page each time, spent 0 seconds on the page and bounced 100% of the time.
What is the reg expression for rackspace?
-
Time on page can be a tricky one because sometimes actual visits can record 00:00:00 due to the way it is measured. I'd recommend using other factors like the ones I mentioned above.
-
"...a combination of operating system, location, and some other factors can do the trick."
Yep, combined with those, look for "Avg. Time on Page = 00:00:00"
-
Ok, can you provide some information on the bots that are getting through this that you want to sort out? If they are able to be filtered through the ISP organization as the ones in your current RegEx, you can simply add them to the list: (microsoft corp| ... ... |stumbleupon inc.|ispnamefromyourbots|ispname2|etc.)$|gomez
Otherwise, you might need to get creative and find another way to isolate them (a combination of operating system, location, and some other factors can do the trick). When adding to the list, make sure to escape special characters like . or / by using a \ before them, or else your RegEx will fail.
-
Sure. Here's the post for filtering the bots.
Here's the reg x posted: ^(microsoft corp|inktomi corporation|yahoo! inc.|google inc.|stumbleupon inc.)$|gomez
-
If you give me an idea of how you are isolating the bots I might be able to help come up with a RegEx for you. What is the RegEx you have in place to sort out the other bots?
Regards,
Chris
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How to handle sorting, filtering, and pagination in ecommerce? Canonical is enough?
Hello, after reading various articles and watching several videos I'm still not sure how to handle faceted navigation (sorting/filtering) and pagination on my ecommerce site. Current indexation status: The number of "real" pages (from my sitemap) - 2.000 pages Google Search Console (Valid) - 8.000 pages Google Search Console (Excluded) - 44.000 pages Additional info: Vast majority of those 50k additional pages (44 + 8 - 2) are pages created by sorting, filtering and pagination. Example of how the URL changes while applying filters/sorting: example.com/category --> example.com/category/1/default/1/pricefrom/100 Every additional page is canonicalized properly, yet as you can see 6k is still indexed. When I enter site:example.com/category in Google it returns at least several results (in most of the cases the main page is on the 1st position). In Google Analytics I can see than ~1.5% of Google traffic comes to the sorted/filtered pages. The number of pages indexed daily (from GSC stats) - 3.000 And so I have a few questions: Is it ok to have those additional pages indexed or will the "real" pages rank higher if those additional would not be indexed? If it's better not to have them indexed should I add "noindex" to sorting/filtering links or add eg. Disallow: /default/ in robots.txt? Or perhaps add "noindex, nofollow" to the links? Google would have then 50k pages less to crawl but perhaps it'd somehow impact my rankings in a negative way? As sorting/filtering is not based on URL parameters I can't add it in GSC. Is there another way of doing that for this filtering/sorting url structure? Thanks in advance, Andrew
Intermediate & Advanced SEO | | thpchlk0 -
Site under attack from Android SEO bots - expert help needed
For last 25 days, we are facing a weird attack on our site. We are getting 10x the normal mobile traffic - all from Android, searching for our name specifically. We are sure that this is not authentic traffic as the traffic is coming from Organic searches and bouncing off. Initially, we thought this was a DDoS attack, but that does not seem to be the case. It looks like someone is trying to damage our Google reputation by performing too many searches and bouncing off. Has any one else faced a similar issue before? What can be done to mitigate the impact on site. (FYI - we get ~2M visits month on month, 80% from Google organic searches). Any help would be highly appreciated.
Intermediate & Advanced SEO | | KJ_AV0 -
Should I noindex the site search page? It is generating 4% of my organic traffic.
I read about some recommendations to noindex the URL of the site search.
Intermediate & Advanced SEO | | lcourse
Checked in analytics that site search URL generated about 4% of my total organic search traffic (<2% of sales). My reasoning is that site search may generate duplicated content issues and may prevent the more relevant product or category pages from showing up instead. Would you noindex this page or not? Any thoughts?0 -
What to do after a sudden drop in traffic on May 8?
Hello, I own Foodio54.com, which provides restaurant recommendations (mostly for the US). I apologize in advance for the lengthy questions below, but we're not sure what else to do. On May 8 we first noticed a dip in Google results, however the full impact of this sudden change was masked by an increase in Mother's Day traffic and is only today fully apparent. It seems as though we've lost between 30% and 50% of our traffic. We have received no notices in Google Webmaster Tools of any unnatural links, nor do we engage in link buying or anything else that's shady, and have no reason to believe this is a manual action. I have several theories and I was hoping to get feedback on them or anything else that anyone thinks could be helpful. 1. We have a lot of pictures of restaurants and each picture has its own page and these pages aside from the image are very similar. I decided to put a noindex,follow on the picture pages (just last night) especially considering Google's recent changes to image search that send less traffic anyways. Is there any way to remove these faster? There's about 3.5 million of them. I was going to exclude them in robots.txt, but that won't help the ones that are already indexed. Example Photo Page: http://foodio54.com/photos/trulucks-austin-2143458 2. We recently (within the last 2 months) got menu data from SinglePlatform, which also provides menus to UrbanSpoon and Yelp and many others, we were worried that adding a page just for menus that was identical to what is on Urbanspoon et all would just be duplicate content so we added these inline with our listing pages. We've added menus on about 200k listings.
Intermediate & Advanced SEO | | MikeVH
A. Is Google considering this entire listing page duplicate content because the menu is identical to everyone else?
B. If it is, should we move the menus to their own pages and just exclude them with robots.txt? We have an idea on how to make these menus unique for us, but it's going to be a while before we can create enough content to make that worthwhile. Example Listing with Menu: http://foodio54.com/restaurant/Austin-TX/d66e1/Trulucks 3. Anything else? Thank you in advance. Any insight anyone in the community has would be greatly appreciated. --Mike Van Heyde0 -
Will Google bots crawl tablet optimized pages of our site?
We are in the process of creating a tablet experience for a portion of our site. We haven’t yet decided if we will use a one URL structure for pages that will have a tablet experience or if we will create separate URLs that can only be access by tablet users. Either way, will the tablet versions of these pages/URLs be crawled by Google bots?
Intermediate & Advanced SEO | | kbbseo0 -
Best way to handle traffic from links brought in from old domain.
I've seen many versions of answers to this question both in the forum, and throughout the internet... However, none of them seem to specifically address this particular situation. Here goes: I work for a company that has a website (www.example.com) but has also operated under a few different names in the past. I discovered that a friend of the company was still holding onto one of the domains that belonged to one of the older versions of the company (www.asample.com) and he was kind enough to transfer it into our account. My first reaction was to simply 301 redirect the older to the newer. After I did this, I discovered that there were still quite a few active and very relevant links to that domain, upon reporting this to the company owners they were suddenly concerned that a customer may feel misdirected by clicking www.asample.com and having www.example.com pop up. So I constructed a single page on the old domain that explained that www.asample.com was now called www.example.com and provided a link. We recently did a little house cleaning and moved all of our online holdings "under one roof" so to speak, and when the rep was going over things with the owners began to exclaim that this was a horrible idea, and that domain should instead be linked to it's own hosting account, and wordpress (or some other CMS) should be installed, and a few pages of content about the companies/subject should be posted. So the question: Which one of these is the most beneficial to the site and the business that are currently operating (www.example.com?) I don't see a real problem with any of these answers, but I do see a potentially un-needed expense in the third solution if a simple 301 will bring about the most value. Anyone else dealt with a situation like this?
Intermediate & Advanced SEO | | modulusman0 -
Drop in Traffic on Friday April 20th.
Just curious if anyone noticed a drop in traffic last friday. I got hammered with about a 20% drop overall. Didn't know if there was an update or what. Thanks in advance!
Intermediate & Advanced SEO | | astahl110 -
Fading Text Links Look Like Spammy Hidden Links to a g-bot?
Ah, Hello Mozzers, it's been a while since I was here. Wanted to run something by you... I'm looking to incorporate some fading text using Javascript onto a site homepage using the method described here; http://blog.thomascsherman.com/2009/08/text-slideshow-or-any-content-with-fades/ so, my question is; does anyone think that Google might see this text as a possible dark hat SEO anchor text manipulation (similar to hidden links)? The text will contain various links (4 or 5) that will cycle through one another, fading in and out, but to a bot the text may appear initially invisible, like so; style="display: none;"><a href="">Link Here</a> All links will be internal. My gut instinct is that I'm just being stupid here, but I wanted to stay on the side of caution with this one! Thanks for your time 🙂 http://blog.thomascsherman.com/2009/08/text-slideshow-or-any-content-with-fades
Intermediate & Advanced SEO | | PeterAlexLeigh0