Contact Info

(for those who care)

Instant Gratification   



Mon, 09 Jun 2008

Fighting Mojibake at Home

At work, we’re pretty serious about internationalization. Plus I’d recently read a good book that included the topic and some posts on the interwebs. There’s a term that that was unfamiliar to me (mojibake), but immediately made sense that there should be a word to describe the phenomenon.

Mojibake is basically what happens when you have character encodings declared one way but actually encoded in a different way. Either read my del.icio.us links or make do with the metaphor that it’s like writing something using the Caesar Cipher but declaring that it’s ROT-13 (only it’s only noticeable when you use é’s and stuff). Have you ever seen boxes or question marks on the internet? That’s mojibake.

The good news is that if you’re a developer, you can fight it. Read the articles I’ve bookmarked on I18N and continuing with the example above: strings are never strings anymore, unless you know their encoding.

(note: this text is encoded using rot-13)

GUVF VF EBG GUVEGRRA


(note: this text is encoded using the caesar cipher)

WKLV LV FDHVDU FLSKHU

Every piece of text that you own, store, process, export, send over a network, render in a webpage, read from a database, write to a filesystem, enter into a textbox, put into an email, etc, etc, etc. YOU MUST DECLARE THE ENCODING. That’s one aspect of I18N in a nutshell. And it’s also the simplest answer to fighting mojibake.

If I just gave you those jumbles of letters above without the encoding, they’d be effectively meaningless. Unless your primary text is seriously non-roman (and probably even if it is), UTF-8 should be your default encoding. Most programming languages are leaning towards using UTF-8 encoded strings as their default string types, so that is currently the path of least resistance.

The title of this post is “Fighting Mojibake at Home”, and the inspiration for this post was a stupid link I’d bookmarked: NOTES ON AN ?INSURGENCE OF QUALITY? (question marks for posterity). It was showing up on my syndication sidebar (oooh, web2.0gasm) with stupid question marks and at my current pace of bookmarking, it’d be there taunting me for at least a month. This led me to take the fairly simple corrective actions of:

  1. Set Content-Type headers to include “text/html; charset=utf-8”
  2. Add <meta> tag for UTF-8 (in case you save my HTML to disk)
  3. Change Magpie-RSS to output UTF-8 instead of ISO-8559-1
  4. Add “:set encoding=utf-8” to my .vimrc

This was in addition to poking around in the Blosxom source to make sure there was no obviously wrong string manipulation going on (none that I saw straight away). If you’re serious about development, see if your code passes the Turkey test, a completely awesome checklist of how your software (right now) will break in Turkey.

Gobble Gobble.

23:46 CST | category / entries
permanent link | comments?

Like what you just read? Subscribe to a syndicated feed of my weblog, brought to you by the wonders of RSS.



Thanks for Visiting!