URL Rewriting With PHP
October 9, 2008 – 9:06 pm
Checking Google webmaster tools data for one of the web sites i finished recently i found that all the URLs in Google’s site listing had PHPSESSID=ae334597413160fd8e2a3979a84840ef
tacked onto the end of the query string. I set up the site to use php sessions, to keep track of shopping basket items, and it seems that when the browser doesn’t accept cookies, php adds the session ID onto the end of the URL, in a query string.
This is a problem as you really don’t want everyone who comes via a google link to pick up the same session the googlebot got when it indexed the site. Also, it doesn’t look good in the google listing.
After a bit of digging around on the web, i hit on a couple of solutions. Firstly, to prevent it happening again, i set the apache variable session.use_only_cookies to 1. I did this from php using ini_set – partly because it didn’t seem to work when i put it in .htaccess and i couldn’t work out from the documentation if it should have worked or not. Doing it from within the code, using ini_set, had the added benefit that it won’t get overlooked if i move the site to a different server.
Secondly, i wanted to rewrite the URL, by returning a 301 moved permanently header. This will remove the session id from the URL for human browsers and it will encourage google to eventually forget the phpsessid URLs in its index and replace them with the correct URLs.
I spent several hours struggling to try and make sense of the vast array of bad documentation kicking around on the web about URL rewriting using .htaccess statements. Firstly i tried to get RedirectMatch to work – but i gave up after struggling with that for ages. Then i tried mod_rewrite – using RewriteCond and RewriteRule statements, but i gave that up in the end too.
I’m not sure why i couldn’t get either of those to work. I read several web pages about them – including examples of code that was supposed to do exactly what i was trying to do. Some of the examples were obviously wrong, and none of the docs gave a clear and comprehensive explanation of the way those statements and their associated regexs work. I’m quite comfortable with regular expressions – although i do find some of the perl type regexs a bit baffling, i have to admit – but i still couldn’t work out exactly what i was doing wrong. It may have helped if i could have found a clear explanation of perl regexs and a clear explanation of the way the mod_rewrite statements work – but the problem seems to be that the people who understand them can’t communicate with humans.
In the end, after a few hours of pulling my hair out, i finally had a flash of inspiration. I couldn’t make head or tail of these damn .htaccess methods of doing redirections, but i knew how to do it in php! It didn’t take me long to work out how to do exactly what i wanted with a few php statements. I tested it and it worked first time – after all that!
Here’s the php code i used to do the job:
// check if we've been called with PHPSESSID in the URL query string if ( isset( $_GET['PHPSESSID'] )) { // find the start position of the PHPSESSID string $sessidPos = strpos( $_SERVER['REQUEST_URI'] , "PHPSESSID" ); // create a new URI without the PHPSESSID string - assume it's at the // end of the query string (which it should be) $newURI = substr( $_SERVER['REQUEST_URI'] , 0 , $sessidPos - 1 ); // and send back a 301 Moved Permanently header header( "Location: $newURI" , TRUE , 301 ); exit; }
Note (if you’re not fully familiar with php): This code must go before any output to the browser. If there has been any output at all (even a blank line before the ‘<?php’ tag), the ‘header’ command won’t work.
What i also need to do is to check for googlebot and a number of other regularly visiting spiders and not try to initiate a session with them in the first place. But that can wait till i’m not so busy!
Leave a Reply