Encoding The Tough Stuff

It turned out that this was the first subscriber whose name contained a character not representable in 7-bit ASCII. The character is one that I can type in my emacs text editor (using its insert-ascii function) as the integer 232 (hex E8), thusly: e. What you will see, in your browser, depends on the encoding that it’s using. For many of us, that encoding will be ISO-8859-1, and you will see the character whose Unicode number is 00E8, and whose name is LATIN SMALL LETTER E WITH GRAVE:

But if your encoding is set to ISO-8859-2, you will instead see the character whose Unicode number is 010D, and whose name is LATIN SMALL LETTER C WITH CARON:

wfThe Web form that accepted this ASCII 0xE8 character relayed it to a backend business system that happily stored it. But that backend system also communicated the character, by way of XML-RPC, to another system. And that system — specifically, its XML parser — choked on the character. It did so because the parser, MSXML, defaults to UTF-8. This, by the way, is one of those infuriating industry acronyms that is often used but rarely spelled out, and that must also be recursively expanded. Thus, UTF-8 stands for UCS Transformation Format 8, and UCS in turn stands for Universal Multiple-Octet Coded Character Set (UCS).

I found an excellent description of the properties of UTF-8 in the UTF-8 and Unicode FAQ for Unix/Linux:

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings that contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.

The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD, and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.

All possible 231 UCS codes can be encoded.

UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.

The sorting order of Bigendian UCS-4 byte strings is preserved.

The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

So, MSXML saw the 0xE8 as the first byte of a multibyte sequence, expected the next byte in the sequence to be in the range 0x80 to 0xBF, and choked when that wasn’t the case.

There are a couple of ways to “fix” this problem. One would be to use only 7-bit ASCII characters, which mean the same thing in UTF-8. To do this, you’d represent 0xE8 as the entity e. Note that to ensure you see what I intend in the last sentence, I actually had to write e — that is, “escape” the ampersand. Then, of course, to show you how I did that, I actually had to write e. And to show you how I did that, I had to write…oh, never mind, just view the source in your browser.

Another “fix” that we in fact adopted, was to tell the XML-RPC module that sent the packet of data containing the 00E8 character to use the ISO-8859-1 encoding. That meant that instead of beginning like this:

<?xml version=”1.0″?>

it instead begins like this:

<?xml version=”1.0″ encoding=”ISO-8859-1″?>

As my quotation marks around “fix” suggest, neither of these approaches is a complete solution. When I mentioned this issue in the newsgroup, a longtime correspondent — whose name includes a Unicode 00E1, LATIN SMALL LETTER WITH ACUTE — responded:

You have to use an encoding that suits the parser that will read the XML *AND* preserves the information you are trying to convey.

If I am writing XML that will be read by a UTF-only parser, I will have no choice but to encode it so, even if I (as a Portuguese-speaking Brazilian) somewhat prefer ISO-8859-1 or Windows-1252 to anything else (and wonder why ASCII had to be 7 bits wide). That’s sad, but true; XML is not that portable.

There are parsers that don’t care about encoding, but won’t be able to change the encoding. If they get UTF-8, they will spit UTF-8 out.

I wouldn’t encourage using numeric entities, as they depend on the encoding (a “C” may have the same numeric under ISO-8859-1 code as “[yen]” under ISO-8859-12; of course, I made this up) and this information is sure to be lost somewhere. I find “&Ccedil;” preserves the meaning much better. However, it is useless if you intend to put it inside an e-mail message or print it on a POS printer or sort it on a relational database.

Welcome to the wonders of XML.

Of course it wasn’t really XML’s charter to solve this problem. That’s what Unicode is for, and XML can do no better than to follow the evolving Unicode saga.

 

Leave a Comment

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.