Unicode background
In the ever expanding world of e-commerce and information technology, one thing above all is coming to the forefront of internet design - globalization. For many large companies, web documents are translated into many different languages. However, the computer technologies we use for storing data weren't originally designed to deal with information in multiple languages. A web page could show only one set of characters, be it Latin, Cyrillic, Greek, Japanese, or any other character set. Luckily, there is now a standard that is implemented in most browsers called
Unicode. This standard encodes characters differently than older technologies, allowing almost all the characters in all human printed languages to be displayed in one page.
Let's clarify the difference between some commonly used terms. A character is a textual unit, such as a letter, number, symbol, punctuation mark, etc. A character set is a set of characters you would like to use. For instance, English uses a Latin character set, while Russian uses a Cyrillic character set. Unicode is a character set that includes characters needed for almost all current written human languages.
When we request text data from the web the data must be encoded. As you know, all data is stored as numbers in computers, and this is what the encoding is. Think of the old decoder rings you would get in cereal boxes. "1" would stand for "A", "2" for "B", etc. The character encoding is the same thing, just on a much larger scale. In computers it matches up integers to characters. The Unicode standard provides several different character encodings, which may be appropriate for different technologies. The one that is leading the way in web development is called UTF-8. UTF-8 contains a numerical representation for over 100,000 characters (all the characters in the Unicode character set).
A glyph is a pictorial representation of a character. For instance the typeset "g" and the handwritten "g" are both glyphs. They look different, but they both mean "letter g." UTF-8 encodes characters, not glyphs. There is only one code for the "letter g." A font is a collection of glyphs. It takes the characters and maps them to glyphs. You must have a font capable of producing the correct glyphs to correctly view Unicode text. Most fonts only have glyphs for subsets of the entire Unicode character set, for instance Latin and Greek. If you have Cyrillic characters in a document and you attempt to render them with such a font, the Cyrillic characters may be rendered as open rectangle, question marks, or not at all. This does not mean that the data is not there, it just means that the font you have selected does not include the glyphs needed to display the data. Very few fonts have glyphs for all the characters in Unicode, primarily because creating more than 100,000 passes the point of diminishing returns for commercial fonts.
Code2000 is a $5 shareware font that is the most comprehensive with over 61,000 glyphs. Other fonts can be found at
Alan Wood's Unicode site.
How Unicode fits in your Dynamic Web Site
PHP 5 supports UTF-8 natively (without special compilation options) as does MySQL 4.1. However some care needs to be taken to ensure your data is stored and displayed correctly. This article is about storing and retrieving Unicode data in a MySQL 4.1 database using PHP 5. It is not about support for using Unicode characters in variable names and other PHP code, or in the names of tables and columns in MySQL. As Unicode support is still young, I would recommend avoiding such things at this time.
Preparing MySQL for Unicode
Although MySQL has support for UTF-8, it doesn't use it as its default character encoding. If you have control over your server, you can configure it at compilation or through its configuration files to use UTF-8 as default. But since most people don't have complete control, we'll focus on things you can do at the table level.
Let's create a table to hold our data. What is most important here is the "CHARACTER SET utf8" portion. This tells MySQL that all the text in this table will be encoded in UTF-8.
CREATE TABLE document (
id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
unicodeText VARCHAR(45) NOT NULL
)
CHARACTER SET utf8 COLLATE utf8_general_ci;
Note that MySQL uses a non-standard name "utf8" to mean UTF-8. The COLLATE command is used to tell how to sort the data when using the SORT BY command. Also note that you should always use VARCHAR instead of CHAR with UTF-8. (UTF-8 uses variable sized numbers for different characters. For instance, Latin letters use 1 byte codes, while Japanesee characters are 3 bytes. Using CHAR(10) would force the database to reserve 30 bytes, because it doesn't know ahead of time which length with be used, so it reserves the maximum.)
Telling MySQL how to store the data is just half of the equation. You must also tell MySQL that the data you are passing into it is UTF-8 otherwise it will assume it is in its default encoding. (If you've been doing a lot of searching on Google, and haven't been able to get things to work, this is probably the information that you haven't found.) The command for this is:
SET NAMES 'utf8';
Using PHP and XHTML to encode
Here's our code snippet. You'll want to modify this so that it's valid XHTML before you use it in production:
1 <?php
2 $DB = new mysqli('localhost', 'user', 'root', 'dbname');
3 $DB->query("SET NAMES 'utf8'");
4 if (!empty($_POST['ta'])) {
5 $DB->query("UPDATE document SET unicodeText='{$_POST['ta']}' WHERE ID=1");
6 }
7 $result = $DB->query("SELECT unicodeText FROM document WHERE ID=1");
8 $return = $result->fetch_object();
9 $result->close();
10 ?>
11 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
12 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
13 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
14 <head>
15 <title>Test</title>
16 </head>
17 <body>
18 <p>Posted: <?php echo $_POST['ta'];?></p>
19 <form enctype="multipart/form-data" method="post" action="test2.php">
20 <fieldset>
21 <textarea name="ta"><?php echo $return->unicodeText;?></textarea>
22 <input type="submit" />
23 </fieldset>
24 </form>
25 </body>
26 </html>
Because we are using MySQL 4.1 with PHP 5 it makes sense to use the MySQLi extension. I like using the OO version of the extension. Line 2 is the connection to the database. Line 3 tells the database to expect UTF-8 data. Lines 4-6 update a row in the database with the Unicode data we send through the form. (Make sure to have some sample data in row 1 of the database when you test this code) Lines 7-8 pull the data back out of the database. Lines 11-12 are important. XHTML defaults to using UTF-8 encoding, unless you specifically tell it otherwise. Using the correct Doctype declaration will alert your browser to use UTF-8 encoding. You will see a lot of things if you search on Google about how you need to put tags about UTF-8 in all sorts of places, like the form, or the form controls, or in META tags. If you are serving valid XHTML this is not necessary. The rest of the code is the form. Lines 18 and 21 will let you compare the raw POST data with what the database returns.
By taking a little care with setting up the database, and using valid XHTML, one can properly store and serve UTF-8 code. Here's a link to a page that will allow you to
copy some Unicode that you can paste into your form for testing.