What’s a URL to do? – How to Save URLs
The World Wide Web is an apt analogy. We’re all spiders spinning threads of links. Some people spin their threads with blogs, while others do it with social bookmarking sites like del.icio.us, Digg, reddit or Netscape (see them all at popurls).
One thing that bothers me about these social bookmarking sites is that they don’t do a good job of knowing when two links point to the same document. Ignoring the malicious users who purposely try to resubmit something using slightly different links, there are the flaws in the social bookmarking sites themselves.
As an example I’m going to look at one of my blog posts that has been saved to del.icio.us several 10 different ways. It’s crazy all of the ways a link URL [wiki] can be saved.
1682 people saved it using the trailing slash
132 people saved it with no trailing slash and the named anchor “holygrail”
36 people saved it with a bad URL
25 people saved it without the trailing slash
5 people saved it with the named anchor “comments”
4 people saved it with a different bad URL
4 people saved it with the trailing slash and the named anchor “holygrail”
2 people saved it with a trailing query string
… and several other saved of cached copies of the document or through an anonymizer/translator proxy.
There are several reasons why the same document could be referenced by different URLs.
Easy to Fix Mistakes
Trailing slash – Remove the trailing slash from the end of the URL.
Query string – Remove it if it isn’t taking any arguments.
Named anchors – Remove completely before storing. The named anchor can be used with or without the trailing slash.
- vs http://internetducttape.com#comments
- vs http://internetducttape.com/#comments
Hard to Fix Mistakes
For these mistakes with URLs it comes down to the website in question should provide a redirect to the canonical URL and/or not create duplicate methods of accessing content.
Index file – different web server software uses different names for the index, but if you specify the directory it will grab the index file automatically.
WWW prefix – there is a movement to get rid of using www. at the start of domain names [wiki].
Useless query string – you can add any kind of query to the end of a URL and it is ignored.
Duplicate Pages – through poor planning the web developer can create multiple links to the same content.
- vs http://internetducttape.com/tag/engtech-blogging-how-tos-tech-news-and-reviews/
The moral of the story is when developing websites always try to use semantic URLs [wiki] that are cruft free, not too long, and readable by humans. When writing an application that uses URLs as keys to store data, make sure you clean them first.
Who knows where they’ve been?
Subscribe to comments with RSS.
Comments are closed.