i18n and L10n: 2 - Unicode

Article about Unicode and UTF-8 for Internationalization

Author: Matthew Wittering | Published: 26th June 2009

This article is meant to extend on my previous efforts on Localization / L10n and moving through an introduction to Unicode. Unicode is a critical component for Internationalization and Localization and effects content in both desktop and web-based applications. I will however focus on the latter.

Character-encoding schemes

Unicode like ASCII (American Standard Code for Information Interchange) is a character-encoding scheme to represent letters, numbers and control characters in the digital environment. ASCII uses seven bits to store upper & lower case letters, punctuation, numbers and control characters from the Latin alphabets used by the English Language and dialects. 7 bits offers Computer Scientists 128 different combinations of bits to store represent data points. In ASCII, 33 of those code points are reserved for special functions, control keys such as space, backspace, delete, escape, etc.

The failing of ASCII is simple, 128 bits does not offer enough space to account for all combinations of upper & lower case letters with and without accents to cover all for use across all languages using the basic Latin alphabet. In contrast to ASCII, Unicode consists of a library of more than 100,000 character entries into its database to date, for representing glyphs on the computer.

The History of Unicode

The origins of Unicode date back to 1987 when Joe Becker from Xerox and Lee Collins and Mark Davis from Apple started investigating the practicalities of creating a universal character set.

Quote 1: History of Unicode, Credit Wikipedia.

The importance of such work would remove issues concerning the compatibility between different systems but also the inclusion of other languages beyond basic Latin. Unicode in constrast to ASCII uses 32 bits to represent each code point. 32 bits offers over 4 billion (4,294,967,296) different entries for current and historic glyphs.

To date the Unicode project support a large number scripts, this includes:

For a more detailed list please read, the Unicode Chart found at http://unicode.org/charts/.

Representing Glyphs

A glyph is the smallest element of a written language. I have used the world glyph rather than character or letter because languages in the near and far east are derived from calligraphy.

In Arabic as words grow in length glyphs may change there shape. In contrast to Western written languages Chinese, Japanese and Korean will depict emblematically an activity or object to present and entire world with one graphical cell.

If you study the Chinese language at any length you will learn a radical difference to Latin based languages. In English for example you will see that the Latin alphabet as variable width glyphs, Chinese, Japanese and Korean do not. In Chinese each written unit has the same width as the language appears as a grid when penned.

Chinese Example:

Translation and Localization

Now there is a virtually universal character-encode for much of the glyphs used in written languages there is the opportunity to translate and localize content for other cultures. Unicode and the Unified Transformation Format using eight bits UTF-8 allows the representation of Unicode points with backward compatibility for currently non-Unicode applications. By using services such as Google Translate users can translate the content into their preferred language using the process of localization. Below you will find a series of translations and localizations for the sentence 'Hello, my name is Matthew'.

Examples of Localizations:

Summary

I do greatly appreciate that I could have and probably should have dug deeper into Unicode and explained how the character-encoding scheme works. There are however numerous pages devoted to this already on the Internet. If you would life to learn more I suggest first reading the Unicode page on Wikipedia.

It is my desire to elevate the importance and encourage more to use the Unicode characters when constructing websites and applications. I would urge all that have read this article to not use html special entities for characters such as, é. I would instead prefer you to use the Unified Transformation Format 8 or simply UTF-8 to use Unicode characters in your projects. However more on this another time.

Links

  1. http://en.wikipedia.org/wiki/Unicode
  2. http://www.unicode.org/
  3. http://translate.google.com/

This work is licenced under a Creative Commons Licence

A brief introduction

Matthew WitteringI am a graduate of Lougborough University where I read Computing and Management BSc (Hons) earning a 2:1 classification.

Currently I am working in the Product Team as a Junior Product Manager at Ask Jeeves UK. Continue