You are on page 1of 3

Unicode (UTF-8) with PHP 5.3, MySQL 5.

5 and HTML5 Cheat Sheet


CONVERSION CONFIGURATION MYSQL CODE

How to transform file encoding HTTP and HTML MySQL Stored Procedures and Functions

Example with PHP files on Linux: In php.ini [1]: Right after each connection, call1 [2]: Old function:
find . -name "*.php" default_charset = UTF-8 SET NAMES 'utf8'; CREATE FUNCTION example_function (
-exec iconv IN parameter_name VARCHAR(255)
-f ISO-8859-1 -t UTF-8 or in httpd.conf or .htaccess [5]: RETURNS varchar(255)
{} -o /path/to/utf8_files/{} \; READS SQL DATA
AddDefaultCharset UTF-8
Ordering in MySQL BEGIN
or in the PHP code [5]: DECLARE data VARCHAR(255);
...
header('Content-type: text/html; charset=UTF-8'); Ordering in MySQL depends on the collation you RETURN data;
choose. Detailed information about this subject may END;
Additionally, put this in you HTML <head> block: be found in the documentation on MySQL.com [2].
Look especially at the Unicode Character Sets section: New function:
<meta charset=UTF-8"/>
http://dev.mysql.com/doc/refman/5.5/en/charset- CREATE FUNCTION example_function (
unicode-sets.html. IN parameter_name VARCHAR(255)
CHARACTER SET utf8
RETURNS varchar(255)
CHARACTER SET utf8
How to transform character encoding PHP
READS SQL DATA
in MySQL databases BEGIN
In php.ini [1.1]: DECLARE data VARCHAR(255)
Procedure2 (use the INFORMATION_SCHEMA CHARACTER SET utf8;
mbstring.language = Neutral ...
database to build a script automatically): mbstring.internal_encoding = UTF-8 RETURN data;
mbstring.encoding_translation = On END;
● create a temporary, identical structure in a mbstring.http_input = auto
new database, mbstring.http_output = UTF-8
● copy all data to that structure, mbstring.detect_order = auto
mbstring.substitute_character = none
● drop the initial structure and
● recreate it with the new character encoding: or in httpd.conf or .htaccess: Verifications4
CHARACTER SET utf8
php_value <php.ini directive> <value>
COLLATE utf8_general_ci
● Copy all data from the temporary structure or in the PHP code3 [1]: Run this small PHP script:
to the new structure, converting all texts if ( ! extension_loaded('mbstring'))
mb_internal_encoding('UTF-8');
the new encoding, and finally mb_http_output('UTF-8'); die('mb functions not loaded');
● Drop the temporary structure. mb_detect_order('auto'); if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar))
mb_substitute_character('none'); die('PCRE is not compiled with UTF-8 support');
exit('ok');

1 Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, no matter what the actual encoding is. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP.
2 Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data.
3 Some php.ini directives cannot be modified in the PHP code.
4 Source: utf8.php in PHP UTF-8 library [3]

Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 1 / 3
PHP CODE (1/2)

Multibyte string functions [1] String access by character [1] UTF-8-safe functions7
Replace With addslashes() stripslashes()
Search all use of curly or square brackets to extract bin2hex() strip_tags()
strlen() mb_strlen(); // How many characters single characters of strings: explode() [4] str_repeat()
strlen(); // How many bytes implode() str_replace() [4]
mb_strwidth(); // Monotype characters $string{$position} // old syntax nl2br()
$string[$position] // new syntax
substr() mb_substr()
Regular expressions to find them:
strstr() mb_strstr() Escapement functions
stristr() mb_stristr() /\$\w(\w|\d)*\{(\d+|\$\w(\w|\d)*)\}/
/\$\w(\w|\d)*\[(\d+|\$\w(\w|\d)*)\]/
strrchr() mb_strrchr()5 The functions htmlentities()8 and
Replace htmlspecialchars() both have a third parameter
strpos() mb_strpos()
stripos() mb_stripos()
which corresponds to the character set used during
$char = $string{$pos};
strrpos() mb_strrpos() conversion. Unlike with multibyte functions
strripos() mb_strripos() with (mb_*()), this 3rd parameter is mandatory if not
'ISO-8859-1', no matter what the internal encoding
strtolower() mb_strtolower() $char = mb_substr($string, $pos, 1); is!
strtoupper() mb_strtoupper()
Replace
substr_count() mb_substr_count()6 The functions urlencode() and rawurlencode()
$string{$pos} = $char; do not have any character encoding parameter. The
with safest solution is to put your UTF-8 strings in session
variables instead of URL arguments.
$string = mb_substr($string, 0, $pos)
. $char
. mb_substr($string, $pos + 1);

Comparing strings and sorting arrays SimpleXML PRCE functions [1] Storable representation of variables

Use the Collator class: SimpleXML uses UTF-8 internally and converts all Search all PRCE function calls ( preg_*) and append The serialize() and unserialize() functions
http://www.php.net/manual/en/class.collator.php XML content to UTF-8 [1.2], so usually nothing needs the /u pattern modifier9 can be used transparently. However, be careful when
to be done. reading or writing serialized UTF-8 strings with other
languages than PHP [4].

5 Note that the mb_strrchr() functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is a mb_strrichr() function, which has no equivalent in standard PHP functions.
6 Be careful because the 3rd and 4th arguments of substr_count() no longer exist with mb_substr_count(). You can use mb_substr() to circumvent this limitation.
7 The strcmp() function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: http://www.php.net/manual/en/collator.compare.php
8 The function htmlentities() converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “when
using UTF-8, you don’t need entities”.
9 However, there may still have problems, as explained in Handling UTF-8 with PHP [4].

Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 2 / 3
PHP CODE (2/2) CREDITS

String functions that are problematic and for which there is no built-in replacement function Sources
Replace With a function from the PHP UTF8 Library10 [3] Comment [1] PHP.net documentation
ord($chr) 11
utf8_ord($chr)
[1.1] PHP.net, Multibyte String Runtime Configuration,
http://www.php.net/manual/en/mbstring.c
sprintf() The x and X type specifiers could be an issue, according onfiguration.php
to [4].
[1.2] A comment about SimpleXML on PHP.net:
str_ireplace($search, $replace, utf8_ireplace($search, $replace, Alternatively, write your own implementation using http://www.php.net/manual/en/ref.simpl
$subject [, &$count]) $subject [, &$count]) preg_replace(). exml.php#79258

str_pad($input, $length, $padStr, $type) utf8_str_pad($input, $length, $padStr, $type) [2] MySQL.com documentation
str_split($str, $split_len) utf8_str_split($str, $split_len) Alternatively, use this function: [3] Harry Fuecks, PHP UTF-8 library,
http://www.php.net/manual/ref.mbstring.php#95192 http://sourceforge.net/projects/phputf8

strcasecmp($str1, $str2) Write your own implementation using [4] Web Application Component Toolkit, Handling
collator_compare() and mb_strtolower() UTF-8 with PHP,
http://www.phpwact.org/php/i18n/utf-8
strncmp($str1, $str2, $len) Cut the two strings at the specified length, and use
collator_compare() [5] W3C, Setting the HTTP charset parameter,
http://www.w3.org/International/O-HTTP-charset.php
strncasecmp($str1, $str2, $len) Write your own implementation using your replacement
of strncmp() and mb_strtolower()
strspn($str1, $str2[, $start[, $len]]) utf8_strspn($str1, $str2[, $start[, $len]]) Author and Copyright
strcspn($str1, $str2[, $start[, $len]]) utf8_strcspn($str1, $str2[, $start[, $len]])
Copyright © François Cardinaux 2011
strrev($string) utf8_strrev($string)
Feel free to contact me at:
strtr() This function doesn't work if any parameter is UTF-8.
Write your own implementation. http://www.linkedin.com/in/francoiscardinaux

substr_replace() utf8_substr_replace()

trim($str, $charlist) utf8_trim($str, $charlist) The original functions trim(), ltrim() and rtrim() License
ltrim($str, $charlist) utf8_ltrim($str, $charlist) are UTF-8-safe as long as the 2nd parameter is not used
rtrim($str, $charlist) utf8_rtrim($str, $charlist) Creative Commons
[4]. Attribution-Non-Commercial-Share Alike 3.0
ucfirst($str) utf8_ucfirst($str)
ucwords($str) utf8_ucwords($str)
wordwrap() Write your own implementation

10 Version 0.5
11 Underlined parameters fail if they are UTF-8-encoded

Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 3 / 3

You might also like