2.6.08

Translation, i18n and L10n

The task of turning software in one language into software in another language is something that all but the biggest companies outsource.

I've often thought that this is a great "wedge" issue to use with American engineers who are reluctant participants in outsourcing. It's a good, non-controversial issue to discuss because no matter what strategic yard-stick you use, it's unlikely that you'd make a strong case for insourcing translation or the translation related engineering services of localization and internationalization.

Through the years, I've had to put a lot of thought into these issues. I have been involved in internationalization of several software products and web sites, and I have attempted to simultaneously ship multiple releases of one particular product that was localized into 14 languages. I have always outsourced the translation, and have been involved with a variety of hybrid sourcing arrangements around internationalization and localization.

I was speaking with someone about this recently, and the framework I use to understand and break down this space was useful enough in that conversation to warrant a blog post all on its own. Hence the tower of Babel above, and my attempts to debabelize this messy space.

But before I get into the specifics of my thought-model, it's worth explaining i18n and L10n, in case some readers don't know about these abbreviations.

As always, Wikipedia has a good informative post on the topic. I quote here:

Due to their length, the terms are frequently abbreviated to i18n (where 18 stands for the number of letters between the i and the n in internationalization and localization, a usage coined at DEC in the 1970s or 80s) and L10n respectively. The capital L on L10n helps to distinguish it from the lowercase i in i18n.

As an editorial aside, I'll say that I think these are silly abbreviations. They usually require a long conversational preface in order to use them in any but the most geeky circles. But nonetheless, they are commonly used. To make my point that these are silly, I've tried to coin other similar alpha-numeric abbreviations, but unfortunately mine never caught on:
  • p2l for perl
  • j2a for java
  • m0e for me
Anyway, now that you know what these abbreviations stand for, I'll describe my simple mental model.

Given that most of my work on localized software products was on the quality assurance side, it's no surprise that my model refers to the classes or kinds of problems or defects you're likely to see with localized software.

The first is the hardest to diagnose and fix, though the least severe from a software engineering perspective. These are "translation" bugs. Software engineers typically view language as, well, language. It's a rule-based symbolic representation of abstract and concrete ideas. To many in my trade, English, Hindi and C++ are thought of as roughly equivalent.

But computer languages and human languages aren't even close to equivalent. Human languages are infinitely more complex and nuanced. And while we'd all like to believe that "Babelfish" does a good job, the truth is that machine translation is in its infancy, and that real humans are required to translate any given sentence into a new language with any hope of retaining situational context. This is all a long-winded way of saying that translation is art, not science.

This has been famously discussed with such apocryphal blunders as:Examples of this kind of blunder have found their way into our urban legend lexicon:
  • The Chevy Nova, which translated as the "It doesn't go!" in Spanish markets.
  • The Ford Pinto, which had a euphemistic translation of "very small male reproductive organs" in Brazil.
  • The Schweppes Tonic Water ad in Italy that translated their flagship product as "Schweppes Toilet Water."
  • Or even JFK's famous "I am a jelly donut" speech.

These profound and obvious translation errors are easy to find and easy to fix. While these examples are mostly fiction, it does sometimes happen that translators just straight-out mess up, and select the wrong word, the wrong syntax, or in any of a hundred other ways really get it wrong. Picking the wrong word for something just means you've got a weak translator with poor grammar or idiomatic vocabulary skills in either the source or the target language.

But because translation is art it is subject to the whims of fancy. And because of this, a second sub category of very nuanced "possible errors" results. That is, two different translators will often translate the same sentence differently. This results in cosmetic defects wherein the reader or end-user just doesn't like the way something was translated. These defects are difficult to discover, because you have to involve a large audience of native-language testers or users, and they're hard to fix, because they're about art and opinion. I've had to play King Solomon in many arguments between my user community (or sales teams) and my translators, where the "baby" in question was the translator's decision on a particular phrase or passage.

Anyway, that gives us Category One: Artful Translation Bugs.

The second group of issues has to do more with on-screen display. In my experience and to my way of thinking, this is mostly about software graphical user interfaces, and it's mostly the result of static layout and bad software design. In lay terms, graphical user interfaces (GUIs) are built by telling the computer to draw a window, and to put stuff in it or on it. Designers like to control the stuff, so they tell the computer how big to make buttons, where to put them, what they should look like, etc. The designers typically try to make their screens "pretty" in what ever version of pretty is popular in that particular shop on that particular day.

The trouble comes when you write text on the top of your pretty button. For instance, you might have a "Go" button on your application or in your web page. You might constrain the software to use Courier font, size 12 (a popular choice with designers!). Then, you might tell the application that the Go button needs to be 25 pixels wide and 18 pixels tall. It works great, and it looks great, and the word "Go" just fits. Then, you decide this product would sell well in Germany, and you localize it into German. The "Go" in the "Go" button gets translated, and the whole thing now becomes the "Gehen Sie" button. "Gehen Sie" doesn't fit onto your 25X18 button. Depending on the application, maybe you have a "Ge" button, or possibly a "ie" button, both of which are meaningless. Or maybe "Gehen Sie" gets written past the boundaries of your button, over the top of some other text or object. Either way, the software suffers, and it's a bad defect. This problem gets even worse if you localize to languages that uses graphemes, phonemes, morphemes or ideograms instead of a "Western alphabet." And it gets worse yet again in cultures where the written page doesn't start at the top left.

One obvious way to get around this is make your GUI objects conditional to the text they have to hold, and to the display properties of the language in question, and never to paint screens or objects of a fixed size. But this seldom happens - designers hate that approach because it messes with the pretty.

Another subset of problem within this class is the periodic "untranslatable" phrase, word or value within the application. The word in question is always quite translatable. It just may be that the software application doesn't get the word from a list of resources designated for translation, but instead derives it with some business logic embedded deep in the application. I can tell you from experience that French customers are positively delighted when they see something like: Votre prochain balayage a lieu le Tuesday. They actually have laws against this kind of bug in France.

Anyway, this broad issue of how usable and comprehensible the software is once it's been translated gives us Category Two: Display and Usability Bugs.

The last group of issues are more complex, more insidious, and more dangerous. They involve the real "guts" of internationalization: Character sets and data structures. To illustrate my point, just look at this date:
  • 11/5/2008
Now, is that May 11, 2008, or November 5, 2008?

The correct answer is "Yes." If an application takes user input for dates, or displays time in the "local" convention, how it handles date stamps is very important.

Character sets are another class of issues that similarly result in vastly different and unexpected behavior, depending on how well the given piece of software handles them. Basically, each language system in popular use has its own character set encoding, or is covered by a "catch-all" standard. Not handling this well can result in on-screen garbage. Again referencing Wikipedia, here's a great essay on the phenomena called Mojibake:

Mojibake is often caused by forced display of writing systems or character encodings that are "foreign" to the user's computer system: if a computer does not have the software required to process a foreign language's characters, it will attempt to process them in its default language encoding, usually resulting in gibberish.

This works in both directions, and bad internationalization can cause gibberish to be displayed, or it can store gibberish, instead of, say, your password.

Another example of hits comes from static encoding of values that change in localized operating systems. For instance, many installer programs are "hard coded" to install software into the file system location specified by this string of characters:
  • C:/windows/program files/MyNewSoftware.

Well, again going to the French, on a French operating system, this "logical" directory would be represented by:
  • C:/Windows/Dossiers de Programme/MonNouveauLogiciel

There are ways to abstract your software away from this kind of information, so that it is either derived from the operating system or referenced using some locale-independent pointer. But this is often not done during the original design and "1.0" version of an application so it becomes an afterthought. Finding all instances where static strings are referenced, or where character set encoding is assumed to be ASCII or ANSI it tough, important work. These defects are bad, and can do horrible unpredictable things to software.

Important stuff this, and representative of Category Three: Data Structure Bugs.

That's my mental model, and my three categories of problems associated with changing the language on a piece of software. I take some liberties, I know, but this is useful for talking about a complex and wide ranging realm of software engineering.

I'll wrap with this: There's a new alpha-numeric abbreviation coming into vogue. It's globalization, in the context of "globalizing" your piece of software. It's abbreviated as g11n. The word is already too overburdened with meaning, so I really hope this one doesn't catch on.

3 comments:

Clint Tustison said...

Hey Tom, I like what you said about translation. I do agree that translation is more of an art than a science, but I'd have to disagree with the part where you said that:

"Picking the wrong word for something just means you've got a weak translator with poor grammar or idiomatic vocabulary skills in either the source or the target language."

Not necessarily. One thing that people often forget is that translation is so subjective. There are many different levels of "wrong words" and what one person deems as a wrong word can be perfectly acceptable to another translator.

Also, I would suggest you be careful to not propagate some of the "translation blunders" you mention, specifically the Chevy Nova and Kennedy examples. You can check out these links for clarification:

http://urbanlegends.about.com/cs/historical/a/jfk_berliner.htm

http://www.snopes.com/business/misxlate/nova.asp

Tom Hickman said...

Clint:

Thanks for the comments.

I think we're in agreement about the nuances of translation. I was trying to point out that sometimes translators just plain get it wrong (wrong word, bad grammar, etc.). I speak later in the blog post about the more nuanced, harder to nail down "artful" nature of translation, where mostly it's a matter of personal preference. But you're right, two different translators will often pick different words, nuances, etc. for the same strings!

Regarding the urban legends, I thought I covered my bases with the "apocryphal" but I'll go back and amend it, per your suggestion. Thanks for keeping me honest!

Hong Zhang said...

Yes, it's often a daunting challenge to support multilanguages. For example, I have a mac that meant to be sold worldwide, and can be tuned to act like a local machine that display and input Chinese characters. Still, it display gibberish for certain characters, which unfortunately, include my last name. It stored correct though. So I know when I send email out, I know it will be right if others use Windows.

On the other hand, when I use Yahoo to write email or store contact info in Chinese characters, sometime it will display right on my own screen, but show up as numbers on the destination. The mysterious thing is that it's pretty random and I can't expect when the problem will come up.

With the globle source of Apple and Yahoo, it is still not bug free to fit every language setting. I can see the difficulty faced by the other software developers.