Robots.txt to disallow /index.php/ path
-
Hi SEOmoz,
I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?.
I don't use that extension, but would it cause me any problems from an SEO perspective?
How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
-
Hi Cyrus,
Thanks for your reply!
Unfortunately the problem is yet to be fixed, I hope that my disallow will work shortly.
It seems that most of the index.php links to each other internally (and from old /index.php/ pages that no longer exist), which is super weird. How google found them does not make any sense to me.
I don't beleive that external sources are linking to these pages either - I mean, how would they find these links anyway?.
-
Hi Mikkel,
Like Chris, I spidered your site and couldn't find any links to /index.php files, which probably indicates one of two things:
- You've fixed the problem - Yay!
- Or Google is finding those links from external sources
- Google found those links at one time in the past, and is still trying to crawl them.
In the Crawl Errors report in Google Webmaster Tools, if you click on the link of each 404, there's often a "linked from" source where you can see where Google discovered the broken link. This is really helpful in rooting out the cause.
Regardless, I'm going to go with #1 and optimistically believe that you were able to fix the problem.
-
If I spider your site I'm not seeing any /index.php urls. Does that mean you did get Joomla to cooperate with your rewriting?
Or was your problem that you'd previously had urls indexed with /index.php/ paths and you needed to remove them?
-
Hi Mikkel, I have checked your robots.txt, it looks perfect. If you redirect /index.php to home page that using httaccess file or by using any joomla plugin that would great for you. And its also a permanent solution.
-
Well, I tried the sensible solution and redirecting to the correct URL instead. However the SEF program is quite limited and keep on creating new URLs regardless of my modification. Im looking for a more permanent solution, and the disallow seems at bit simple as I'm not a super programmer.
By the way - thanks for quick replys, kudos to both of you!
-
Sure, the website in question is www.vauni.dk
I don't think that there is any inbound links to the index.php pages. They are not easily found.
-
Couldn't you rewrite those /index.php/ urls to remove the /index.php/?
Like this in .htaccess:
RewriteRule ^(.*)$ /index.php/$1 [L]
Only used Joomla once, but there must be a way to configure joomla to just use "/" instead of "/index.php/"?
Update:
Here's a solution to your /index.php/ issue:
http://www.eprcreations.com/remove-index-php-from-joomla-urls/
Once you've updated that, and have your urls working properly without the /index.php/, you could add this slight modification of the rewrite rule above so that all your old /index.php/ urls would be 301'd to your new ones:
RewriteRule ^(.*)$ /index.php/$1 [R=301,L]
Put it underneath the RewriteBase / line they describe in that post.
-
Hi Mikkel,
Do you inbound link pointing to you index.php pages ? If yes, then it might affect your seo. Disallow: /index.ph/ is perfect but after implementing it don't inter link those index.php pages. Can you share me your website URL so that I can show you with example. How to do it.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Sites website https://www.opcfitness.com/ title NOT GOOD FOR SEO
We set up a website https://www.opcfitness.com/home on google sites. but google sites page title not good for SEO. How to fix it?
Technical SEO | | ahislop5740 -
How can I make it so that robots.txt is not ignored due to a URL re-direct?
Recently a site moved from blog.site.com to site.com/blog with an instruction like this one: /etc/httpd/conf.d/site_com.conf:94: ProxyPass /blog http://blog.site.com
Technical SEO | | rodelmo4
/etc/httpd/conf.d/site_com.conf:95: ProxyPassReverse /blog http://blog.site.com It's a Wordpress.org blog that was set as a subdomain, and now is being redirected to look like a directory. That said, the robots.txt file seems to be ignored by Google bot. There is a Disallow: /tag/ on that file to avoid "duplicate content" on the site. I have tried this before with other Wordpress subdomains and works like a charm, except for this time, in which the blog is rendered as a subdirectory. Any ideas why? Thanks!0 -
Z-indexed content
I have some content on a page that I am not using any type of css hiding techniques, but I am using an image with a higher z-index in order to prevent the text from being seen until a user clicks a link to have the content scroll down. Are there any negative repercussions for doing this in regards to SEO?
Technical SEO | | cokergroup0 -
Have I constructed my robots.txt file correctly for sitemap autodiscovery?
Hi, Here is my sitemap: User-agent: * Sitemap: http://www.bedsite.co.uk/sitemaps/sitemap.xml Directories Disallow: /sendfriend/
Technical SEO | | Bedsite
Disallow: /catalog/product_compare/
Disallow: /media/catalog/product/cache/
Disallow: /checkout/
Disallow: /categories/
Disallow: /blog/index.php/
Disallow: /catalogsearch/result/index/
Disallow: /links.html I'm using Magento and want to make sure I have constructed my robots.txt file correctly with the sitemap autodiscovery? thanks,0 -
Removing indexed website
I had a .in TLD version of my .com website floated for about 15 days, which was a duplicate copy of .com website. I did not wish to use the .in further for SEO duplication reasons and had let the .in domain expire on 26th April. But still now when I search from my website the .in version also shows up in results and even in google webmaster it shows the the website with maximum (190) number of links to my .com website. I am sure this is hurting the ranking of my .com website. How can the .in website be removed from googles indexing and search results. Given that is has expired also. thanks
Technical SEO | | geekwik0 -
Can I rely on just robots.txt
We have a test version of a clients web site on a separate server before it goes onto the live server. Some code from the test site has some how managed to get Google to index the test site which isn't great! Would simply adding a robots text file to the root of test simply blocking all be good enough or will i have to put the meta tags for no index and no follow etc on all pages on the test site also?
Technical SEO | | spiralsites0 -
/index.php/ page
I was wondering if my system creates this page www my domain com/index.php/ is it better to block with robot.txt or just canonize?
Technical SEO | | ciznerguy0 -
How do you disallow HTTPS?
I currently have a site (startuploans.org) that runs everything as http, recently we decided to start an online application to process loan apps. Now, for one certain section we configured ssl to work (https://www.startuploans.org/secure/). If I go to the HTTPS url for any of my other pages they show up...I was going to just 301 everything from https but because it is in a subdirectiory I can't... Also, canonical URL's won't work either because it's a totally different system and the pages are generated in an odd manor. It's really just 1 page that needs to be disallowed.. Is there any way to disallow all HTTPS requests from robots.txt while keeping all the HTTP requests working as normal?
Technical SEO | | WebsiteConsultants0