Google Crawling Database-Driven Pages
July 26th, 2005Here’s a neat little trick I figured a little over a year ago with a couple of my PHP-driven websites. I noticed that Google and other search engines weren’t picking up some of my new database-driven content pages while other database-driven pages were picked up and updated regularly. Of course we’ve all read that the big search engines are making an effort to include these pages today, but coverage is far from complete even by the search engines’ own admissions. So maybe this trick isn’t that useful anymore but I thought I would share it in case I’m the only one who figured it out
The pages that were picked up by the search engines were my “state navigation pages” and their urls typically looked something like: database.php?state=CA&sort=name. The pages that weren’t getting picked up (which just so happened to be the bulk of my content) were those pages that had a url similar to: trail.php?id=14. Can you spot the pattern? Turns out the search engines weren’t interested in pages generated through numeric keys but they were interested in those pages whose URLs contained only letters (and question marks, ampersands, periods, etc.). I quickly threw together a little code to convert my number references to alpha characters: 1 = a, 2 = b, etc. I prolly could have used hex or something like that but I’m no computer scientist
. So now a request for trail.php?id=ad was translated in my php script to call the trail data associated with primary key 14.
Wonder upon wonders, within a week all of my trail pages were picked up by Google. I’ve since taken most references to my alpha-code down since Google now indexes the bulk of my pages regardless of the URL. I understand that search engines most prefer pages with static looking URLs (they must think the more ghetto and unsophisticated your site is, the better the info) but I’d be interested to know how pages under my alpha scheme stack up to those under the numeric scheme (I’d guess they’re equal now but who knows?). What about a page whose url looks like: /trails/14 versus a page whose url is masked using my method: /trails/ad? Clearly the more data a site has, the more likely they are to use some kind of content database, yet these sites are penalized under current algorithms (unless they are sophisticated enough to look ghetto by building a static looking scheme like the /trail/14 example above). Crazy.
My scripts still accept either the alpha-code or the numeric id but I’ve removed all the alpha-code links for the site (as far as I know). Interestingly, Google still indexes both versions of the pages (often as supplemental results, but not always!). I’m sure this is bad and perhaps grounds for banishment from Google but an interesting result nonetheless, although I don’t know of any advantage to having multiple listings of the identical page in a search engine.

May 15th, 2008 at 6:58 am
[…] wrote a couple years ago about a neat little trick I figured out for getting my dynamically generated web pages listed in the search engines by […]