Prioritising URL Normalization Steps for Search Engine Optimisation of a Legacy Site
I have skimmed through a paper on normalization techniques (by Sang Ho Lee, Sung Jin Kim, and Seok Hoo Hong) :
See: http://dblab.ssu.ac.kr/publication/LeKi05a.pdf
In the paper they describe some extra normalizations that could be applied that exceed the standard normalizations specified by the various ”Standards Committees” etc. (rfc3986, for example).
They tested three types of extra normalization and monitored the effect it had on getting accurate info from the web.
This paper considers the three URL normalization steps that are beyond the
standard URL normalization. Discussed three steps are
* the case sensitivity at the path component of a URL,
* the last slash symbol in the path component of URLs, and
* the designation of a default page.
One interesting finding is that the the first two normalizations can be implemented by search engines without much loss of accuracy. But the third normalization (the designation of a default page) causes a significant number of ”fails”.
Therefore, you can extrapolate this finding to state:
If I was a search engine trying to crawl and index accurate information from the web, I could implement the first two normalizations on behalf of most sloppy webmasters. But, its best not to implement the third cos the loss of accuracy is too risky. It might also be impossible to fix with significant accuracy even when combining site-specific knowledge into the mix.
Basically, a search engine does not want to to be presenting duplicate data to its users. So, given two web addresses (URLs) that contain similar contents, it needs to be able to pick best one to use. If http://example.com/mydirectory/ and http://example.com/mydirectory/index.htm both have the same content it might seem obvious to pick the first, shorter, URL. However, as mentioned above, the findings in the paper imply that following that rule might lead to getting a lot of errors.
Therefore, if you are overwhelmed by the task of implementing a comprehensive normalization strategy for your site and looking to prioritise the most important things to fix. Then, fixing the DUST (different URLs with similar text) issues regarding the ”default page” might be one of the ones to put at the top of your to-do list.
I.e. It might have a bigger impact on SEO than many of the other normalization steps. This, I propose, is because the other steps are easier to fix accurately without prior knowledge.
I’d welcome your comments on this issue. Add one NOW or I’ll stick Mentos up your nose and immerse your face in Diet Coke!
Further Reading
- Paper: On URL Normalization, Sang Ho Lee, Sung Jin Kim, and Seok Hoo Hong
- Wikipedia on URL Normalization
- rfc3986 - Uniform Resource Identifier (URI): Generic Syntax
- A guide to SEO Canonicalization and URL Normalization - Be a Normalizer - a C14N Exterminator
