i18n and L10n: 6 - International Websites
Blog post detailing how to configure websites using Unicode and UTF-8
In this weblog I am covering how to use Unicode in your next PHP project. This is an extremely simple activity to configure PHP5 to operate with Unicode and more importantly the UTF-8 character encoding scheme. We will start with a quick recap of Unicode then look at UTF-8 finishing with how to alter the internal encoding scheme of PHP5.
Unicode
Unicode like ASCII (American Standard Code for Information Interchange) is a character-encoding scheme to represent letters, numbers and control characters in the digital environment. ASCII uses seven bits to store upper & lower case letters, punctuation, numbers and control characters from the Latin alphabets used by the English Language and dialects. 7 bits offers Computer Scientists 128 different combinations of bits to store represent data points. In ASCII, 33 of those code points are reserved for special functions, control keys such as space, backspace, delete, escape, etc.
The failing of ASCII is simple, 128 bits does not offer enough space to account for all combinations of upper & lower case letters with and without accents to cover all for use across all languages using the basic Latin alphabet. In contrast to ASCII, Unicode consists of a library of more than 100,000 character entries into its database to date, for representing glyphs on the computer.
The importance of such work would remove issues concerning the compatibility between different systems but also the inclusion of other languages beyond basic Latin. Unicode in constrast to ASCII uses 32 bits to represent each code point. 32 bits offers over 4 billion (4,294,967,296) different entries for current and historic glyphs.
To date the Unicode project support a large number scripts, this includes:
- Arabic
- Basic and Extended Latin
- Cyrillic
- Unified Chinese Japanese Korean Ideographs
For a more detailed list please read, the Unicode Chart found at http://unicode.org/charts/.
UTF-8
UTF-8 is an entire topic in itself with no quick answer. However here is an attempt to summarise and provide you with the key headlines.
UTF-8 (Unicode Transformation Format - 8 Bit) is a variable-length character encoding for Unicode. The variable-width, ranging from 1-4 bytes. It is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII. So the first 128 characters (US-ASCII) need one byte only to represent the characters.
PHP
By default PHP5 does not come configured to use the UTF-8 character encoding. However do not threat, this can be rectified. To correctly configure your PHP Script to handle Unicode characters you must first set the Internal Encoding.
Internal Encoding Configuration
Setting the script to handle Unicode data using the UTF-8 character encoding allows the system to recognise ASCII characters and variable byte length non-ASCII glyphs from the Unicode library.
Configuring the Internal Encoding is a simple activity. At runtime you can define the character encoding with the mb_internal_encoding() function. To unlock the advantages of the Unicode and UTF-8 add mb_internal_encoding() into the configuration file or beginning of your scripts.
Configure Internal Encoding Example
mb_internal_encoding("utf-8");
Return Internal Encoding Example
echo mb_internal_encoding();
Now that you have added the mb_internal_coding() function into your project its now ready to handle Unicode characters and improve the flexibility of your application to handle non-Latin Scripts.
Summary
Configuring your PHP5 application to use Unicode is a simple activity with vast results. Setting the character encoding to UTF-8 at the beginning of the script, program or project allows the developer to offer greater power to the users of the services. Doing this will allow you to produce an application which divorces itself from classic compatibility issues associated to storing wrongly encoded data. This will then remove unexpected characters and question marks. This allows you to focus on the presentation of content and running the site rather than troubleshooting strange characters.
Links
- internationalization-localization-unicode.htm
- http://unicode.org/
- http://unicode.org/charts/
- http://en.wikipedia.org/wiki/Unicode
- http://en.wikipedia.org/wiki/UTF-8
- http://www.php.net/manual/en/function.mb-internal-encoding.php
This work is licenced under a Creative Commons Licence.