Personal Research Into Character Encoding - Those Funny Characters Explained

This blog post follows on from the post that I wrote whilst I was digging around the net for good info on ’screen scraping’.

The only thing that alerted me to the whole issue of ‘character encoding issues on the net’ was the fact that I had ventured into the realm of the ‘browsers’ by, in effect, trying to write my own robot to crawl and ’screen scrape’ web sites. I stepped off the shoulders of the internet giants - Macromedia, Mozilla, Microsoft etc - and was floundering beside them.

The key problem is the fact that the software makers had been sheilding me, the web developer, from character encoding issues by dealing with them for me. For example, when text is submitted to my website via a web form - IE or Firefox will do its best to ensure that this text is not ‘corrupted’ during the transfer. By writing a screen scraper I became responsible for these issues but had very little understanding of them. Hence, I was mixing character encodings - a classic school boy error - and creating silly problems for myself. E.g Outputting unreadable text like: É – on my site.

As a self-taught web developer of many years - I am embarrassed that I have spent such a long time ignorant of the most basic understanding of character encodings etc.

I, then, read an article by Joel Spolsky, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” - it was a wake call. - IT IS ESSENTIAL READING!

—The following part of this post is a collection of ‘copy and paste links’ and snippets that I have stored for my personal future reference - but may be of use to you too.

In fact most of this stems from the writings of the famous Harry Fuecks…

Site Point article on php and utf-8
http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/

Quote from Harry’s response to a comment: “If there are two key points to getting it in PHP I’d say it’s to consider PHP’s problem
- http://www.phpwact.org/php/i18n/charsets#php_s_problem_with_character_encoding then look closely at the table
here: http://en.wikipedia.org/wiki/UTF-8#Description - examine the 0’s and 1’s it’s describing. Eventually it will fall into place.”

That article lead to:
pdf slide show outlining php issues
http://www.webtuesday.ch/_media/meetings/utf-8_survival.pdf?id=meetings%3A20060808&cache=cache

Which lead to many useful links:

MUST READ: “A tutorial on character code issues”
This document tries to clarify the concepts of character repertoire, character code, and character encoding especially in the Internet context.
http://www.cs.tut.fi/~jkorpela/chars.html

Do you know your character encodings?
http://www.sitepoint.com/blogs/2006/03/15/do-you-know-your-character-encodings/

utf-8 site
http://www.utf-8.com/

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
“This is the first of a three-part essay on modern character string processing for computer programmers. Here I explain and illustrate the methods for storing Unicode characters in byte sequences in computers, and discuss their advantages and disadvantages. These methods have well-known names like UTF-8 and UTF-16.”

Handling UTF-8 with PHP
http://www.phpwact.org/php/i18n/utf-8

Lists of UTF-8 dangerous PHP functions
http://www.phpwact.org/php/i18n/utf-8
http://wiki.silverorange.com/UTF-8_Notes

DokuWiki UTF8 conversion helper
http://wiki.splitbrain.org/wiki:utf8update

intersting stuff
General reading
http://www.phpwact.org/php/i18n/charsets

set of utf-8 helper functions for php
http://phputf8.sourceforge.net/

PHP UTF-8 project page
http://sourceforge.net/projects/phputf8
PHP UTF-8 is a UTF-8 aware library of functions mirroring
“PHP’s own string functions. Does not require PHP mbstring
extension though will use it, if found, for a (small)
performance gain.”

One man’s story of encoding woe:
http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss

– thats enough, Ed

One Response to “Personal Research Into Character Encoding - Those Funny Characters Explained”

  1. Eat My Business » Blog Archive » Screen Scraping Tutorials and Info Says:

    […] UPDATE - 02 Jan 2008 - WARNING - a liberal/naive use of the mb_convert_encoding() function can cause lots of problems if you do not understand character encodings (I have written about these issues here). […]

Leave a Reply