You are on page 1of 19

Perl Programming/Unicode UTF-8 1

Perl Programming/Unicode UTF-8


Overview
In the context of web application development, Unicode with UTF-8 encoding is the best way to support multiple
languages in your web application. Multiple languages can even be supported on the same web page.
Unicode (usually in UTF-8 form) is replacing ASCII and the use of 8-bit "code pages" such as ISO-8859-1 and
Windows-1252.

Unicode
Unicode [1] is a standard that specifies all of the characters for most of the world's writing systems. Each character is
assigned a unique codepoint, such as U+0030. The first 256 code points are the same as ISO-8859-1 [2] to make it
trivial to convert existing Western/Latin-1 text.
To view properties [3] for a particular codepoint:

use Unicode::UCD 'charinfo';


use Data::Dumper;
print Dumper(charinfo(0x263a)); # U+263a

If you view the Unicode character reference [4], you will notice that not every codepoint has an assigned character.
Also, because of backward compatibility with legacy encodings, some characters have multiple codepoints.

UTF-8
UTF-8 [5] is a specific encoding of Unicode — the most popular encoding. Other encodings include UTF-7, UTF-16,
UTF-32, etc. You will probably want to use UTF-8, if you decide to use Unicode.
An encoding defines how each Unicode codepoint maps to bits and bytes. In UTF-8 encoding, the first 128 Unicode
codepoints use one byte. These byte values are the same as US-ASCII [6], making UTF-8 encoding and ASCII
encoding interchangeable if only ASCII characters are used. The next 1,920 codepoints use 2-byte encoding in
UTF-8. Three or four bytes are needed to encode the remaining codepoints.
Note that although Unicode codepoints 128-255 are the same as ISO-8859-1, UTF-8 encodes each of these
codepoints differently. UTF-8 uses two bytes to encode each of these codepoints, whereas ISO-8859-1 only uses one
byte for each character in that range. Therefore, ISO-8859-1 and UTF-8 are not interchangeable. (If only ASCII
characters are used, then they are all interchangeable, since ASCII, ISO-8859-1, and UTF-8 all share the same
encoding for the first 128 Unicode codepoints.)
So, to reiterate, with UTF-8, not all characters are encoded into a single byte (unlike ASCII and ISO-8859-1). Think
about that for a moment... how might that affect editors (like Notepad), web pages and forms, databases, Perl itself,
Perl IO, your Perl source code (if you want to include a character with a multi-byte encoding)? How might that affect
passing strings around, if the strings contain characters with multi-byte encodings? Do regular expressions still
work?
Perl Programming/Unicode UTF-8 2

Character Encoding Comparison


Character Encoding # characters 128 US-ASCII characters Next 128 characters Remaining Characters

US-ASCII 128 1 byte N/A N/A

ISO-8859-1 256 1 byte 1 byte N/A

UTF-8 > 100,000 1 byte 2 bytes 2 - 4 bytes

As you can see from the table above, codepoints 128-255 (0x80-0xff) are where you need to be careful. Later, you
will find out that codepoints 128-159 (0x80-0x9F) are even trickier, due to the fact that the popular Windows-1252
character set (another one-byte-per-character encoding) is incompatible with ISO-8859-1 in this range.

Do I need UTF-8?
If your web application will only ever need to use one language (or one character encoding), then you may not need
UTF-8. However, if maybe someday your application will need to handle multiple languages, it is better to start
using UTF-8 now, rather than later. Reason? It is easier to use UTF-8 everywhere, rather than in just a few places
(more about this below). And it is more difficult to convert all of the pieces later.

How much does UTF-8 "cost"?


• some functions are slower [7] with UTF-8 encoded strings in Perl
• you have to write some additional Perl code to ensure that data coming into Perl is decoded properly, and that
data going out of Perl is encoded properly — but you have to do this anytime you use a character set other than
the native 8-bit character set of the platform (which we'll now refer to as N8CS[8]), which is often
ISO-8859-1/Latin-1
• you have to interact with your database appropriately -- is it using UTF-8?
• you have to ensure your web pages specify that pages are encoded in UTF-8
• you may need to make a web server adjustment (if it is configured to always serve some particular character set,
which is not UTF-8)

How do I use UTF-8?


The "best practice" approach is to use UTF-8 everywhere, if possible. This includes web pages and hence web forms,
databases, HTML templates, and strings stored internally in Perl. One exception might be your Perl source code
itself. If N8CS is sufficient (i.e., if you don't need any UTF-8 characters or strings in your source code), your source
code does not have to be encoded as UTF-8. (Okay, another exception might be your HTML templates. If your
templates only require/contain N8CS, they do not have to be encoded as UTF-8 either.)
To properly use UTF-8 in a Perl web application, here is a summary of what must be done:
• All text (non-binary) data/octets coming into Perl (hence form data, database data, file reading, HTML templates,
etc.) must be properly decoded. If the incoming text/octets are UTF-8 encoded, they must be UTF-8 decoded. If
they are N8CS (usually ISO-8859-1) encoded, they should be N8CS decoded. If they are encoded with some
other character set, they must be decoded with that character set.
• All text data going out of Perl (hence to the browser, database, files, etc.) must be properly encoded (into an octet
stream). STDOUT (which goes to the browser) must be UTF-8 encoded.
• the browser needs to be told that web pages are UTF-8 encoded via an HTTP header and a <meta> tag
Do not use Perl versions prior to 5.8.1. Although support for UTF-8 began with v5.6.0, regular expressions do not
work even in the next release, v5.6.1. v5.8.1 added some speed improvements [7]. (By the way, PHP will not have
Perl Programming/Unicode UTF-8 3

UTF-8 support until v6.0.)


Before we start getting into the finer details about how to use UTF-8, we need to first define some terms, and then
talk a bit about Perl's dual personality when it comes internally storing text.

Terminology
A character is a logical entity. Characters must be encoded (using a character set) in order to be used, stored,
written, exchanged between programs, etc. Encoding turns a logic character into something we can use in a program.
Depending on which character set is used for encoding, a single character may be require one or more bytes to
represent it.
We'll use the term octets when referring to data passing into or out of a Perl program. An octet is a byte, 8 bits.
Encoded characters make up an octet stream. When an octet stream comes into Perl, the bytes should be decoded
(using the correct character set -- the character set they were encoded with) so that Perl can determine which logical
characters are contained in the encoded octet stream. Perl can then store these as strings -- a sequence of characters.
Binary data also comes in as an octet stream. It should not be decoded using a character set, because it likely either
doesn't contain any characters, or it contains information in addition to characters, and hence cannot be decoded with
a character set.

Perl strings/text
Internally, Perl stores each string in one of the following encodings:
• native encoding — byte encoding. It uses N8CS[8], the native 8-bit character set of the platform (often
ISO-8859-1/Latin-1). This is a one-byte-per-character encoding, and hence a maximum of only 255 characters
can be encoded. This is the default encoding for all incoming text/octets if Perl is not instructed to decode (bad
idea). Strings using this encoding are called byte strings or binary strings.
• UTF-8 encoding — character encoding. It uses (obviously) UTF-8. Strings using this encoding are called
character strings or text strings or Unicode strings.
When creating your own strings, Perl uses N8CS when possible (for backwards compatibility and efficiency
reasons). However, if a character can not be represented in N8CS, UTF-8 is used. In other words, if all code points in
a string are are <= 0xFF, N8CS is used, otherwise UTF-8 is used.

$native_string = "\xf1";
$native_string = "\x{00f1}"; # still N8CS, since <= 0xff
$native_string = chr(0xf1); # still N8CS, since <= 0xff
$utf8_string = "\x{0100}";

You can convert a N8CS string to a UTF-8 string using utf8::upgrade():

$my_string = "\xf1"; # N8CS byte string (one byte is used


internally to encode)
utf8::upgrade($my_string); # UTF-8 character string now (two bytes
are used internally to encode)

Your program can have a mix of strings in both of Perl's internal formats. Perl uses a "UTF8 flag" to keep track of
which encoding a string is internally using. Thankfully, the format/flag follows the string. Perl keeps a string in
N8CS as long as possible. However, when a N8CS/native string is used together with a UTF-8 string, the native
string is silently implicitly decoded using N8CS, and upgraded (encoded) to UTF-8. In other words, the native byte
string gets decoded with the native character set, and then it gets internally encoded into UTF-8. The resulting
character string will have the UTF8 flag set.
Perl Programming/Unicode UTF-8 4

$my_string = 'a'; # N8CS byte string


$my_string .= "\x{0100}"; # UTF-8 character string now

Normally, you should not need to know about how Perl is internally storing/encoding text. An exception to this rule
is if you have a natively encoded string with bytes in the 0x80-0xFF range — in other words, a natively encoded
string with non-ASCII characters. In this case, some string operations may not work as expected — see Perl 5
"Unicode Bug". Normally, you should upgrade these N8CS byte strings to UTF-8 character strings using
utf8::upgrade(). String operations will then work as expected, although sometimes slower.

UTF-8 Flow
Any Perl IO needs to correctly handle decoding and encoding of strings/text. Since there are multiple character
encodings in use in the world, Perl can't correctly guess which character encoding was used to encode some
particular incoming text/octets, nor can it know which character encoding you want to use for outgoing text/octets.
An incoming stream of UTF-8 octets is not the same as, say, an incoming stream of Windows-1252 octets. For
example, Unicode character U+201c (left double quotation mark) is encoded in one byte in Windows-1252 (0x93),
but UTF-8 encodes it using three octets (0xE2 0x80 0x9C). If you want Perl to interpret your incoming text/octets
correctly, you must tell Perl which character set was used to encode them, so they can be decoded properly.
The typical flow of UTF-8 text/octets in to and out of a Perl program is as follows:
1. Receive an external UTF-8 encoded text/octet stream and correctly decode it — i.e., tell Perl which character set
the octets are encoded in (in this case, the encoding is UTF-8). Perl may check for malformed data (bad encoding)
while decoding, depending on which decoding method you select. Perl stores the string internally as N8CS or
UTF-8, depending on which decoding method you select, and what characters are found to be in the octet stream.
(Normally, the string will be internally stored as UTF-8.)
2. Process the string as you normally would.
3. Encode the string into a UTF-8 encoded octet stream and output it.

1. Decoding Text Input


External input includes submitted HTML form data, database data (e.g., from SELECT statements), HTML
templates, text files, sockets, other programs, etc. If any of these might contain UTF-8 encoded data/text, you must
decode it. UTF-8 decoding in Perl involves two steps:
1. Decoding the text according to UTF-8 format rules. This may generate decoding errors, depending on which
decoding method you select. Using decode() always results in the string being internally stored as UTF-8,
with the UTF8 flag set (despite what the documentation for Encode says). Using utf8::decode() may result
in N8CS or UTF-8 internal encoding. If the incoming text only contains ASCII characters, N8CS is used,
otherwise UTF-8 is used.
2. Encoding the text (this might be a no-op) and storing it internally as N8CS or UTF-8. If it is stored as UTF-8, the
UTF8 flag is set.
If you are certain that the incoming data/octets only contains N8CS (often this means ISO-8859-1) text, you do not
need to explicitly decode it (because Perl's default internal encoding is N8CS, which is a one-byte-per-character
encoding). However, "best practice" suggests that all incoming data/octets should be explicitly decoded — you can
explicitly decode ISO-8859-1, ASCII, and a number of other character encodings.
If you don't decode, Perl assumes input text/octets are N8CS encoded, hence each octet is treated as a separate
character — clearly, this is not what you want if you have a multi-byte UTF-8 encoded octet stream/text coming in.
Improper decoding can lead to double encoding, and it this be difficult to locate due to implicit decoding (discussed
above).
Perl Programming/Unicode UTF-8 5

Another important point to make here: you need to know which encoding was used for each input text. Do not guess,
do not assume.

Input - Files, File Handles


Perl can automatically decode data as it comes into Perl using PerlIO layers:

open my $in_fh, "<:encoding(utf8)", $filename or die; # auto UTF-8 decoding on read

If you already have an open filehandle:

binmode $in2_fh, ':encoding(utf8)';

Do not use :utf8 since it does not check that your incoming text is valid UTF-8, it simply marks it as UTF-8 —
see Perlmonks [9].

Input - HTML Templates


If you are using a CGI framework or template engine to pull in UTF-8 encoded HTML template files, you may need
to inform it about the UTF-8 encoding, so that it can "UTF-8 decode" the template files as they are read in. Basically,
the framework or template engine needs to do what we talked about in the previous section.
For Template::Toolkit [10], if you use an appropriate Byte Order Mark (BOM) [11] in your template files to indicate
the encoding, the toolkit will decode them appropriately, automatically. If the templates do not use BOMs, use the
ENCODING option:
my $template = Template->new({ ENCODING => 'utf8' });
HTML::Template [12] currently does not support decoding of UTF-8 encoded HTML template files. This is a known
limitation/bug [13]. There are a few workarounds:
• A patch [13] is available.
• You can use TMPL_VARs to insert UTF-8 content [14] into an N8CS (or even ASCII) encoded template file.
UTF-8 decode your parameters/content before inserting them into an HTML template using TMPL_VARs, and
implicit decoding should upgrade the resulting text (i.e., the template and the filled-in variables) to UTF-8
internally. For many applications, this is often sufficient.

Input - Web Forms


By default, CGI.pm [15] does not decode your form parameters. You can use the -utf8 pragma, which will treat
(and decode) all parameters as UTF-8 strings, but this will fail if you have any binary file upload fields. A better
solution involves overriding the param method:

package CGI::as_utf8;
BEGIN {
use strict;
use warnings;
use CGI 3.47; # earlier versions have a UTF-8 double-decoding bug
{
no warnings 'redefine';
my $param_org = \&CGI::param;
my $might_decode = sub {
my $p = shift;
# make sure upload() filehandles are not modified
return $p if !$p || ( ref $p && fileno($p) );
utf8::decode($p); # may fail, but only logs an error
Perl Programming/Unicode UTF-8 6

$p
};
*CGI::param = sub {
# setting a param goes through the original interface
goto &$param_org if scalar @_ != 2;
my ($q, $p) = @_; # assume object calls always
return wantarray
? map { $might_decode->($_) } $q->$param_org($p)
: $might_decode->( $q->$param_org($p) );
}
}
}
1
---
use CGI::as_utf8; # put this line in your app, e.g., in your
CGI::Application module(s)

The above is rhesa's solution [16] with a slight modification — utf8::decode() is used instead of Encode's [17]
decode_utf8(), as it is more efficient when only ASCII characters are involved (since the UTF8 flag is not set).
Note that the module assumes that web pages and forms are always UTF-8 encoded, and that the OO interface of
CGI.pm is always used.
Note, browsers should encode form data in the same character encoding that was used to display the form. So, if you
are sending UTF-8 forms, you should get UTF-8 encoded data back for text fields. You should not have to use
accept-charset in your HTML markup.

Input - STDIN
When a web form is POSTed, form data comes into Perl via STDIN. If you are using CGI.pm, text form data is
available via CGI.pm's param() method, and the previous section describes how to properly handle UTF-8 encoded
text form data.
If you don't have any file uploads (i.e., all of your data is text), then instead of the CGI::as_utf8 module, you could
add the following line of code to the beginning of your script to cause all data received on STDIN (i.e., all POSTed
form data) to be automatically decoded as UTF-8:

binmode STDIN, ":encoding(utf8)";

Do not use

<strike>
binmode STDIN, ":utf8"; # do NOT use this!
</strike>

since it does not check that your incoming text is valid UTF-8, it simply marks it as UTF-8 — see Perl 5 Wiki [18].
The approach in the previous section is preferred, since it will "do the right thing" if there is any binary form data
(file uploads).
If you are writing some other (non-CGI) program that receives data on STDIN, decode appropriately:

my $utf8_text = decode('UTF-8', readline STDIN);


my $iso8859_text = decode('ISO-8859-1', readline STDIN);
my $binary_data = read(...); # don't decode
Perl Programming/Unicode UTF-8 7

Note that decode() always sets Perl's internal UTF8 flag.

Input - Database
In the "use UTF-8 everywhere" model, configure your database to store values in UTF-8.
When reading data from a UTF-8 database, ensure incoming UTF-8 encoded string field data is UTF-8 decoded, but
do not decode incoming binary field data.

Input - MySQL
With MySQL, UTF-8 decoding (and encoding) of string field data is automatic if you use the
mysql_enable_utf8 database handle attribute [19]:

use DBI();
my $dbh = DBI->connect('dbi:mysql:test_db', $username, $password,
{mysql_enable_utf8 => 1}
);

This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field
data — the driver will do that for you. If the incoming data for a field only contains ASCII octets, the UTF8 flag is
not set for that field (so it appears to be using utf8::decode()). The driver is also smart enough to not decode
binary data.
Version 4.004 or higher of DBD::mysql is required. UTF-8 was first available in MySQL v4.1. As of v5.0, it is the
system default.

Input - PostgreSQL
With PostgreSQL, UTF-8 decoding (and encoding) of string field data is automatic if you use the
pg_enable_utf8 database handle attribute [20]:

use DBI();
my $dbh = DBI->connect('dbi:mysql:test_db', $username, $password,
{pg_enable_utf8 => 1}
);

This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field
data — the DBD::Pg driver will do that for you. The driver is also smart enough to not decode binary data.
You may (TBD: when?) also need to tell PostgreSQL to use UTF-8 when sending data out of the database:
SET CLIENT_ENCODING TO 'UTF8';
or
SET NAMES 'UTF8';
For example, with Rose::DB:

__PACKAGE__->register_db(
domain => 'development',
...
connect_options => {
pg_server_prepare => 0,
pg_enable_utf8 => 1,
},
post_connect_sql => "SET CLIENT_ENCODING TO 'UTF8';",
Perl Programming/Unicode UTF-8 8

);

See Automatic Character Set Conversion Between Server and Client [21]

2. Processing strings
Once all incoming strings have been decoded into UTF-8 internally, you can process your text as normal. Regular
expression will work (if using Perl v5.8 or higher).
If you create any strings in your source code that contain non-ASCII characters (characters above 0x7f), ensure you
upgrade them to internal UTF-8 encoding:

my $text = "\xE0"; # 0xE0 = à in ISO-8859-1


utf8::upgrade($text);

my $unicode_char = "\x{00f1}"; # U+00F1 = ñ


utf8::upgrade($unicode_char);

Perl 5 "Unicode Bug"


Without a locale specified, if you have native/N8CS strings with characters in the 0x80-0xFF (128-255) range, then
\d, \s, \w, \D, \S, \W (hence regular expressions), and uc(), lc(), etc. may not work as expected,
since the non-ASCII part (0x80-0xFF) of the character set is ignored for those operations. (This is another reason to
try and use UTF-8 everywhere.) Without a locale, Perl can't properly interpret characters in this range, since different
encodings use different characters in this range, so it ignores them -- this is called ASCII semantics.
There are two ways to avoid this "Unicode Bug". Both involve getting the natively encoded string to switch to
UTF-8 encoding — because when the internal encoding is UTF-8, Unicode semantics are used, which always work
as expected.
1. Follow "best practice" and always properly decode all external input text/octets. During decoding, any text/octets
found to contain non-ASCII characters will be converted to UTF-8 internal encoding. For example

use Encode;
# suppose $windows1252_octets contains text from an external input, and
it contains the character
# "\xE0" (0xE0 = à). String $windows1252_octets will exhibit the
Unicode bug -- it won't match /\w/
my $utf8_string = decode('cp1252',$windows1252_octets); # no Unicode
bug, $utf8_string matches /\w/

2. Use utf8::upgrade($native_string) to force $native_string to switch to UTF-8 internal encoding.


(Even if the string only contains ASCII characters, it is still "upgraded" to UTF-8.)

my $text = "\xE0"; # will exhibit Unicode bug, won't match /\w/


utf8::upgrade($text); # no Unicode bug, matches /\w/

Note that with internal UTF-8 encoding, \w represents a much, much larger set of characters, so regex operations
will be slower (vs. native encoding). TBD: what is the actual performance degradation? What is the character set for
\w with Unicode semantics?
See also Unicode::Semantics [22]. This issue should be fixed in Perl v5.12.
4/19/10 update: v5.12 is now available, and the "case changing component" has been fixed: "Perl 5.12 now bundles
Unicode 5.2. The “feature” pragma now supports the new “unicode_strings” feature:

use feature "unicode_strings";


Perl Programming/Unicode UTF-8 9

This will turn on Unicode semantics for all case changing operations on strings, regardless of how they are currently
encoded internally." Read more [23].

3. Encoding and output


Output from a web program includes STDOUT (which is sent to your browser for a CGI program), stderr (which
usually goes to the web server's error log), database writes, log file output, etc.
If outgoing text is not encoded, the text will be sent using the bytes in Perl's internal format, which could be a
mixture of native/N8CS and UTF-8. This may work, but don't take a chance — "best practice" calls for explicitly
encoding all output appropriately.
Perl will warn you if you print a string with a character that has an ordinal value greater than 255:

$ perl -e 'print "\x{0100}\n"'


Wide character in print at -e line 1.
Ä

To avoid this warning, explicitly encode output (as described below).

Output - STDOUT
To ensure all output going back to the web browser (i.e., STDOUT) is UTF8-encoded, add the following near the top
of your Perl script:

binmode STDOUT, ":encoding(utf8)";

If you want to be a little more efficient (but not follow "best practice"), you can opt to only encode the outgoing page
if it is flagged as UTF-8:

if(utf8::is_utf8($page)) {
utf8::encode($page);
}
# else, $page is natively encoded, so skip encoding for output

Here is a snippet [24] that can be used with the CGI::Application [25] framework:

__PACKAGE__->add_callback('postrun', sub {
my $self = shift;
# Make sure the output is utf8 encoded if it needs it
if($_[0] && ${$_[0]} && utf8::is_utf8(${$_[0]}) ){
utf8::encode( ${$_[0]} );
# ${$_[0]} .= 'utf8::encode() called'; # useful for debugging
}
});

The above code should be put into CGI::Application base class(es). Optionally, the code can be added to
cgiapp_postrun().
Note that all of the above encoding techniques will only work properly if all of the input UTF-8 octets were properly
decoded.
Perl Programming/Unicode UTF-8 10

Output - Database
As mentioned above, in the "use UTF-8 everywhere" model, configure your database to store values in UTF-8.
When writing data to a UTF-8 database (INSERT, UPDATE, etc.), ensure your UTF-8 strings get UTF-8 encoded
before being written to the database. Do not encode binary field data.

Output - MySQL
As mentioned above, UTF-8 encoding (and decoding) of string field data is automatic if you use the
mysql_enable_utf8 database handle attribute [19]. This means you should not call utf8::encode() (or
any other UTF-8 encode function) on your strings when using this attribute — the driver will do that for you. The
driver is also smart enough to not encode binary data.
Version 4.004 or higher of DBD::mysql is required. UTF-8 was first available in MySQL v4.1. As of v5.0, it is the
system default.

Output - PostgreSQL
As mentioned above, UTF-8 encoding (and decoding) of string field data is automatic if you use the
pg_enable_utf8 database handle attribute [20]. This means you should not call utf8::encode() (or any
other UTF-8 encode function) on your strings when using this attribute — the DBD::Pg driver will do that for you.
The driver is also smart enough to not encode binary data.
You may (TBD: when?) also need to tell PostgreSQL to expect UTF-8 coming into the database:
SET CLIENT_ENCODING TO 'UTF8';
or
SET NAMES 'UTF8';
See Automatic Character Set Conversion Between Server and Client [21]

Output - Files, File Handles


If you need to write to files, Perl can automatically encode data as it is written using PerlIO layers:

open my $out_fh, ">:utf8", $filename or die; # auto UTF-8 encoding on


write

If you already have an open filehandle:

binmode $out2_fh, ':utf8';

Tell the Browser to use UTF-8


To serve a UTF-8 encoded page to a browser, "best practice" is to specify the UTF-8 charset in an HTTP
Content-Type header and inside the HTML file in a content-type <meta> tag. CGI.pm defaults to sending the
following Content-Type header:
Content-Type: text/html; charset=ISO-8859-1
Add the following to cause UTF-8 to be used instead of ISO-8859-1, where $q is your CGI object:
$q->charset('UTF-8');
If you are using the CGI::Application framework, put the above line in cgiapp_init().
If you are not using CGI.pm to generate your HTML markup, put the following meta tag as the first meta tag in the
<header> section of your HTML markup:
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
Perl Programming/Unicode UTF-8 11

Perl source code


If you only need to embed a few Unicode characters in a few strings in your source code, you do not need to save
your source code/file in UTF-8. Instead, use \x{...} or chr() in your code, followed by utf8::upgrade():

my $smiley = "\x{263a}";
or
my $smiley = chr(0x263a);
utf8::upgrade($smiley); # convert to internal UTF-8 encoding

If you have a lot of Unicode characters, or you prefer to save your source code in UTF-8, then you need to tell Perl
that your source code is UTF-8 encoded. Do this by adding the following line to your source code:

use utf8; # this script is in UTF-8

This is the only reason your program should ever have the above line -- see utf8 [26].
If your source code is UTF-8 encoded, make sure your editor supports reading, editing, and writing in UTF-8!

Gotchas
Often you may not notice Unicode issues until characters with codepoints above 128 are used. This is because
ASCII, ISO-8859-1, Windows-1252 and UTF-8 are all encoded with the same one-byte values for the first 128
Unicode codepoints. To give your application a good Unicode test, try a character in the 0x80 - 0x9F (128-159)
range, and a character above 0xFF (255).

Wide character in print at ...


Perl will warn you [27] if you print a string that has a character with an ordinal value greater than 255 (hence it is a
"wide" character that requires more than one byte of storage):
Wide character in print at ... line ...
Explicitly encode your output to avoid this warning.

Cannot decode string with wide characters at ...


If you receive this error, your code is probably trying to decode the same string a second time, which will fail.

Web Server Always Sends an ISO-8859-1 Header


If you followed the steps above, but your pages are not being displayed properly, it could be that your web server is
configured to always send a particular character encoding in a header, such as ISO-8859-1. To determine if a
Content-Type header is being sent by the web server:
$ lwp-request -de www.bing.com | grep Content
Apache may be configured with the following:
AddDefaultCharset ISO-8859-1
If you can, remove that line, or change it to
AddDefaultCharset UTF-8
if all of the pages served by the server use UTF-8. See also When Apache and UTF-8 Fight [28].
Perl Programming/Unicode UTF-8 12

ISO-8859-1 vs Windows-1252
Since you are learning about character encodings, you need to be aware of the difference between ISO-8859-1 and
Windows-1252. From Windows-1252 [29]: "The Windows-1252 encoding is a superset of ISO-8859-1, but differs
from ISO-8859-1 by using displayable characters rather than control characters in the 0x80 to 0x9F (128 - 159)
range. It ... also contains all the printable characters that are in ISO-8859-15 (though some are mapped to different
code points). It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web
browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to
accommodate such mislabeling... the draft HTML 5 specification requires that documents advertised as ISO-8859-1
actually be parsed with the Windows-1252 encoding" since it is a superset of ISO-8859-1.
Here's a fun program to try:

my @undefined_chars_in_windows_1252 = (0x81, 0x8d, 0x8f, 0x90, 0x9d);


my %h = map { $_ => undef } @undefined_chars_in_windows_1252;
foreach my $i (0x80 .. 0x9f) {
next if exists $h{$i};
printf "%02x:%c ", $i,$i;
}

What do you see? Do you see the Windows-1252 characters, no characters, square boxes? If you are using PuTTY,
Change Settings... Window, Translation and try selecting ISO-8859-1 or Windows-1252 and run the program again.

Microsoft "Smart" Quotes


MS-Word (TBD: only older versions?) uses those nice left and right fancy/smart quotes. If you copy-paste those
characters into a web form that was served with a Windows-1252 charset (or possibly even an ISO-8859-1 charset),
the characters may be submitted to the web server using the nebulous 0x80-0x9F (128-159) range. (Recall that
Unicode defines control characters in this range — not printable characters like smart quotes.) If your Perl script
does not decode the submitted form properly (i.e., according the same character encoding that the web form used),
you will get gibberish.
Decode and encode correctly and you will not have any problems with Microsoft smart quotes or any of the other
characters in the nebulous range. Better yet, if you serve all web pages as UTF-8, submitted forms should never
contain these nebulous values, since the "paste" operation should automagically convert these characters to valid
Unicode characters. Your Perl script will then only receive valid UTF-8 encoded characters.

Strange Characters in my Browser


Strange character: �
This is Unicode's "replacement character" (codepoint U+FFFD), which is used to indicate when a Unicode parser
(such as a browser) was not able to decode a stream of Unicode encoded data. The problem is likely an
encode/decode problem somewhere in the chain. (U+FFFD encodes to EF BF BD in UTF-8. If you save the web
page and then open it in bvi, you may see EF BF BD.) IE displays the replacement character as the empty square
box. Firefox uses the black diamond with the question mark.
Usually, these replacement characters appear because the HTML data is Windows-1252 encoded, but the browser
was instructed to use UTF-8 encoding. In your browser, select View->Character Encoding and see if it is set to
UTF-8. If so, try selecting Windows-1252 or Western European (Windows) and see if that resolves the problem. If if
does, then you know that the web server is serving up the wrong character encoding — there is a mismatch between
what is being sent (i.e., how the data is encoded), and what character set the browser is being told to use (i.e., HTTP
header and/or meta tag). If it doesn't resolve the problem, it might be that you don't have a Unicode font installed on
Perl Programming/Unicode UTF-8 13

your computer, or the Unicode font does not have a glyph for that particular character.
Strange characters: ‘ ’ “ ” • – —
These are the individual characters that correspond to the multi-byte UTF-8 encodings for the following
Windows-1252 characters:
‘’“”•–—
which are in the nebulous 0x80-0x9F (128-159) range. Usually, these characters appear because the HTML data is
UTF-8 encoded, but the browser was instructed to use ISO-8859-1 or Windows-1252. In your browser, try changing
the encoding to UTF-8 and see if that resolves the problem. If that doesn't resolve the problem, or if the encoding is
already set to UTF-8, there may be a double encoding problem somewhere.
Strange characters: ‘ ’ “ ” • – —
These also correspond to some of the characters in the nebulous 0x80-0x9F (128-159) range. If you see the above
sequences, it is likely that you forgot to decode incoming UTF-8 data (such as form data submitted from an UTF-8
encoded HTML form) in your Perl program and then you UTF-8 encoded it for output — a natively encoded string
was UTF-8 encoded (not good). Fix the problem by calling utf8::decode() on the incoming UTF-8 encoded
data.

Strange Characters in my Editor


• ensure your editor supports reading, editing, and writing in UTF-8
• ensure you set your editor to use a Unicode font
• ensure you have a Unicode font installed

Install a Unicode Font on Windows


If you have one of the Microsoft products listed on this page [30], you should have the Arial Unicode MS font.
If it is not installed, follow these steps to install it: Add/Remove Programs, select MS-Office, Add or Remove
Features, click "Choose advanced", Office Shared Features, International Support, Universal Font. Apply the
changes and restart your web browser.

I asked for UTF-8 but I Got Something Else!?


If you specifically asked for UTF-8 text, but the octet stream you receive is not valid UTF-8 encoding, in many cases
you can probably assume that the incoming text/octets are ISO-8859-1/Latin-1 or Windows-1252. Decode with
Windows-1252, since it is a superset of ISO-8859-1.

Double Encoding
If you don't decode UTF-8 text/octets, Perl will assume they are encoded with N8CS (often ISO-8859-1/Latin-1).
This means that the individual octets of a multi-byte UTF-8 character are seen as separate characters (not good). If
these separate characters are later encoded to UTF-8 for output, a "double encoding" results. This is similar to
HTML double encoding — e.g., &amp;gt; instead of &gt;.
Perl Programming/Unicode UTF-8 14

Automatic Font Substitution


Most modern browsers and word processors perform font substitution [31], which means that if a character is not in
the current font, the application will search through all of your fonts until it finds one containing that character and it
will then display that character using the glyph in that font.
Sometimes IE7 and IE8 do not seem to perform font substitution correctly. One workaround is to specify a Unicode
font as the first font in the CSS font-family property. IE6 is not considered a modern browser, and it does not
perform font substitution.

Misc

Create Unicode characters


On Windows, you can always use the Character Map application to select, copy, and (switch to your application
then) paste a Unicode character. Ensure the "Character set" drop-down box is set to "Unicode". You can also use the
application to view fonts, characters, and Unicode codepoint values for each character.

In Perl
my $utf8_char = "\x{263a}"; # for codepoints above 0xFF
$utf8_char =~ /\x{263a}/; # same syntax for regex
my $cloud_char = chr(0x2601); # run-time, ord() does the reverse

If your Perl source code file is in UTF-8 format, you can enter the Unicode characters directly:

use utf8; # tells Perl this file is UTF-8 encoded


my $utf8_char = "☺"; # U+263a, "White Smiling Face"

In Web Forms
On Windows:
• To insert a character from the Windows-1252 codepage [29]: set the Num Lock key on, hold down Alt, then using
the numeric keypad, type 0 followed by the decimal value of the character you want.
• To insert a character from the current DOS code page (usually CP-437 [32]): follow the same steps as above, but
without the initial 0.
But wait, we wanted to insert a Unicode character, not a Windows-1252 or CP-437 character! Well, Windows will
convert those characters to Unicode/UTF-8 for us if the application expects UTF-8.
In a web form (textbox or textarea) type Alt-0147 to generate one of those pesky smart quotes from the
Windows-1252 character set. If the web page's character encoding is set for UTF-8, Windows should translate the
147 character into the corresponding UTF-8 encoding. (Internally, Windows probably translates the 0147 to UTF-16,
which is then translated into the character set in use by the application. In this scenario, the character set is Unicode,
and Windows-1252 character 147 is translated to its Unicode codepoint equivalent, U+201C.) When the form is
submitted, the character should be sent to the web server UTF-8 encoded as three octets: E2 80 9C — this is what
U+201C looks like when encoded with UTF-8.
If the web page's character encoding is instead set to Windows-1252, the character should be sent as a single octet:
0x93 (which is 147 decimal). If the web page's character encoding is instead set to ISO-8859-1, the character will
also be sent as a single octet, but the value may be either 0x93 or 0x22 (0x22 is the ASCII and ISO-8859-1 quote
character). If the browser uses the superset Windows-1252 encoding when ISO-8859-1 is specified, 0x93 is sent.
Otherwise, the character will be translated to the only quote character officially defined in ISO-8859-1, 0x22.
Perl Programming/Unicode UTF-8 15

Hopefully you see why it is imperative to know which encoding was used for the incoming form/text, so that it can
be decoded properly (as UTF-8 or Windows-1252) in your Perl program.
See also How do I enter ... [33] - Yahoo Answers.

UTF-8 vs utf8
As of Perl 5.8.7, UTF-8 is the strict, official UTF-8. The Encode module will complain if you try to encode or
decode invalid UTF-8, e.g.,

encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks

In contrast, utf8 is the liberal, lax, version, allowing just about any 4-byte values:

encode("utf8", "\x{FFFF_FFFF}", 1); # okay


encode_utf8("\x{FFFF_FFFF}", 1); # okay

Encode [34] as of version 2.10 knows the difference.


utf8::encode() and utf8::decode() use official UTF-8.

Encode Module vs Built-in/Core utf8::


To decode and encode UTF-8, you can use the Encode [17] module or the functions defined in the utf8:: [35]
package by the Perl core. The Encode module is more flexible, allowing different ways of handling malformed data.
However, the utf8:: package can do some different tricks.
You should be aware of a bug [36] in the Encode module: whenever text is decoded using the Encode module, the
UTF8 flag is always turned on. The documentation would lead you to believe that the UTF8 flag is off if the text
only contains ASCII characters and you are decoding UTF-8. This is not what happens — the flag is always turned
on, as the table below depicts.
There are performance gains to be had if the UTF8 flag can be kept off after decoding (and this is fine if the text only
contains ASCII octets). Use utf8::decode() to obtain this efficiency, since it does not turn the flag on if the
octet sequence only contains ASCII octets. (This is the decode function I normally use.)
[17]
Below, see Encode's documentation for CHECK options, which relate to how the module handles malformed
data.

UTF-8 Functions
Function UTF8 Description / Notes
flag

$flag = utf8::is_utf8($string); N/A Tests whether $string is internally encoded as UTF-8. Returns false if not; otherwise returns true.

$flag = depends Attempts to convert in-place the UTF-8 octet sequence into the corresponding N8CS or UTF-8
utf8::decode($utf8_octets); string, as appropriate. If $utf8_octets contains non-ASCII octets (i.e., multi-byte UTF-8 encoded
characters), the UTF8 flag is turned on, and the resulting string is UTF-8. Otherwise, the UTF8 flag
remains off, and the resulting string is N8CS. This is the only decode function that may result in
an N8CS byte string. Returns false if $utf8_string is not UTF-8 encoded properly; otherwise returns
true.

$utf8_string = decode('UTF-8', turned Decodes the UTF-8 octet sequence into a UTF-8 character string. Strict, official UTF-8 decoding
$utf8_octets [, CHECK]) on rules (see previous section for discussion) are followed.

$utf8_string = decode('utf8', turned Decodes the UTF-8 octet sequence into a UTF-8 character string. Lax, liberal decoding rules (see
$utf8_octets [, CHECK]) on previous section for discussion) are followed.
Perl Programming/Unicode UTF-8 16

$utf8_string = turned Decodes the UTF-8 octet sequence into a UTF-8 character string. Equivalent to decode("utf8",
decode_utf8($utf8_octets [, on $utf8_octets), hence lax decoding is employed.
CHECK])

$octet_count = turned Converts in-place the N8CS byte string into the corresponding UTF-8 character string. Returns the
utf8::upgrade($n8cs_string); on number of octets now used to represent the string internally as UTF-8. This function should be used
to convert N8CS byte strings with characters in the 0x80-0xFF range to UTF-8, thereby avoiding the
Perl 5 "Unicode Bug".

utf8::encode($string) turned Converts in-place the N8CS or UTF-8 $string into a UTF-8 octet sequence.
off

$utf8_octets = encode('UTF-8', turned Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Strict, official UTF-8 encoding
$string [, CHECK]) off rules (see previous section for discussion) are followed.

$utf8_octets = encode('utf8', turned Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Lax, liberal UTF-8 encoding rules
$string) off (see previous section for discussion) are followed. Since all possible characters have a lax utf8
representation, this function cannot fail.

$utf8_octets = turned Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Equivalent to encode("utf8",
encode_utf8($string) off $string), hence lax encoding is employed. Since all possible characters have a lax utf8 representation,
this function cannot fail.

$flag = turned Converts in-place the UTF-8 character string to the equivalent N8CS byte string. Fails if
utf8::downgrade($utf8_string [, off $utf8_string cannot be represented in N8CS encoding. On failure dies, unless FAIL_OK is true, then
FAIL_OK]); returns false. Returns true on success.

Perl Character encodings


To determine which character encodings your Perl supports:
perl -MEncode -le "print for Encode->encodings(':all')"
It is important to remember that Perl only uses two character encodings internally: native/byte and UTF-8/character.
Any characters encoded with something other than N8CS, the platform's native 8-bit character set (often
ISO-8859-1/Latin-1), must be decoded as it enters Perl.

What does Website "x" Use?


View a page, then in your browser, View->Character Encoding to see which encoding was selected. Also look at the
HTML source and see if the meta tag is present:
<meta http-equiv="Content-Type" content="text/html; charset='''UTF-8'''" />
You can also see what Content-Type header is being returned using:
$ lwp-request -de www.bing.com | grep Content
This wiki uses UTF-8.

HTML Character Entities


In your UTF-8 travels, you may come across HTML Character Entities. Starting with HTML 4.0, 252 character
entities [37] are supported. Each of these has a Unicode codepoint and an entity name. Either can be used in HTML
markup. For example, the registered sign can be represented in HTML as either &#174; or &reg;
Many fonts support this set of characters, and if the set is sufficient for your application, UTF-8 may not be required,
but your application will need to use the HTML encoding where ever a special character is needed.
Perl Programming/Unicode UTF-8 17

Operating Systems and Unicode


It is interesting to note which Unicode encoding popular Operating Systems use. From Wikipedia [38]: "Windows NT
(and its descendants, Windows 2000, Windows XP, Windows Vista and Windows 7), uses UTF-16 as the sole
internal character encoding. The Java and .NET bytecode environments, Mac OS X, and KDE also use it for internal
representation. UTF-8 has become the main storage encoding on most Unix-like operating systems (though others
are also used by some libraries) because it is a relatively easy replacement for traditional extended ASCII character
sets."

References
• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and
Character Sets (No Excuses!) [39] - by Joel Spolsky
• FMTYEWTK about Characters vs Bytes [40] - Perlmonks
• CGI::Application and UTF-8 Form Processing example [41] - by Mark Rajcok
• Perl Unicode tutorial [42]
• Perl Unicode FAQ [43]
• Perl utf8 pragma [26]
• Perl Encode module [17] - handles all character encoding and decoding
• Unicode [1] - Wikipedia
• Perl Unicode introduction [44]
• Unicode support in Perl [45]
• Unicode::Semantics [22] - work around the Perl 5 Unicode bug
• there are many Unicode:xxx modules [46] on CPAN
• UTF-8 round trip with MySQL [47] - Perlmonks
• CGI::Application - Which is the proper way of handling and outputting utf8 [48] - Perlmonks
• Understanding CGI.pm and UTF-8 handling [9] - Perlmonks
• UTF-8 and Unicode FAQ for Unix/Linux [49]
• Perl Unicode Mailing List <perl-unicode@perl.org>

Footnotes
^ - N8CS is a term that was coined for this document. Do not expect to see this term used elsewhere.

References
[1] http:/ / en. wikipedia. org/ wiki/ Unicode
[2] http:/ / en. wikipedia. org/ wiki/ ISO_8859
[3] http:/ / en. wikipedia. org/ wiki/ Mapping_of_Unicode_characters#Character_properties
[4] http:/ / en. wikibooks. org/ wiki/ Unicode/ Character_reference/ 0000-0FFF
[5] http:/ / en. wikipedia. org/ wiki/ UTF-8
[6] http:/ / en. wikipedia. org/ wiki/ ASCII
[7] http:/ / search. cpan. org/ perldoc?perlunicode#Speed
[8] http:/ / en. wikipedia. org/ wiki/ Perl_programming%2Funicode_utf-8#endnote_N8CS
[9] http:/ / www. perlmonks. org/ ?node_id=626470
[10] http:/ / search. cpan. org/ dist/ Template-Toolkit/ lib/ Template/ FAQ. pod#Why_do_I_get_rubbish_for_my_utf-8_templates?
[11] http:/ / en. wikipedia. org/ wiki/ Byte_Order_Mark
[12] http:/ / search. cpan. org/ perldoc?HTML::Template
[13] https:/ / rt. cpan. org/ Public/ Bug/ Display. html?id=30586
[14] http:/ / sourceforge. net/ mailarchive/ forum. php?thread_name=4607245C. 8030702%40netratings. com. au&
forum_name=html-template-users
[15] http:/ / search. cpan. org/ perldoc?CGI
Perl Programming/Unicode UTF-8 18

[16] http:/ / www. perlmonks. org/ ?node_id=651574


[17] http:/ / search. cpan. org/ perldoc?Encode
[18] http:/ / www. perlfoundation. org/ perl5/ index. cgi?the_utf8_perlio_layer
[19] http:/ / search. cpan. org/ perldoc?DBD::mysql#DATABASE_HANDLES
[20] http:/ / search. cpan. org/ perldoc?DBD::Pg#pg_enable_utf8_(boolean)
[21] http:/ / www. postgresql. org/ docs/ 8. 4/ interactive/ multibyte. html#AEN29751
[22] http:/ / search. cpan. org/ perldoc?Unicode::Semantics
[23] http:/ / perldoc. perl. org/ perlunicode. html#The-%22Unicode-Bug%22
[24] http:/ / www. mail-archive. com/ cgiapp@lists. erlbaum. net/ msg08043. html
[25] http:/ / cgi-app. org/ index. cgi
[26] http:/ / search. cpan. org/ perldoc?utf8
[27] http:/ / search. cpan. org/ perldoc?perldiag#Wide_character_in_%s
[28] http:/ / www. personal. psu. edu/ ejp10/ blogs/ gotunicode/ 2009/ 02/ when-apache-and-utf-8-fight. html
[29] http:/ / en. wikipedia. org/ wiki/ Windows-1252
[30] http:/ / www. microsoft. com/ typography/ fonts/ font. aspx?FMID=1081
[31] http:/ / en. wikipedia. org/ wiki/ Font_substitution
[32] http:/ / en. wikipedia. org/ wiki/ Code_page_437
[33] http:/ / answers. yahoo. com/ question/ index?qid=20081226081225AA2NMGi
[34] http:/ / search. cpan. org/ perldoc?Encode#UTF-8_vs. _utf8_vs. _UTF8
[35] http:/ / perldoc. perl. org/ utf8. html
[36] https:/ / rt. cpan. org/ Ticket/ Display. html?id=34259
[37] http:/ / www. alanwood. net/ demos/ ent4_frame. html
[38] http:/ / en. wikipedia. org/ wiki/ Unicode#Operating_systems
[39] http:/ / joelonsoftware. com/ articles/ Unicode. html
[40] http:/ / perlmonks. org/ ?node_id=330567
[41] http:/ / cgi-app. org/ index. cgi?Utf8Example
[42] http:/ / search. cpan. org/ perldoc?perlunitut
[43] http:/ / search. cpan. org/ perldoc?perlunifaq
[44] http:/ / search. cpan. org/ perldoc?perluniintro
[45] http:/ / search. cpan. org/ perldoc?perlunicode
[46] http:/ / search. cpan. org/ search?query=unicode
[47] http:/ / www. perlmonks. org/ index. pl?node_id=620803
[48] http:/ / www. perlmonks. org/ ?node_id=651403
[49] http:/ / www. cl. cam. ac. uk/ ~mgk25/ unicode. html
Article Sources and Contributors 19

Article Sources and Contributors


Perl Programming/Unicode UTF-8  Source: http://en.wikibooks.org/w/index.php?oldid=1957841  Contributors: Adrignola, Mrajcok, 5 anonymous edits

License
Creative Commons Attribution-Share Alike 3.0 Unported
http:/ / creativecommons. org/ licenses/ by-sa/ 3. 0/

You might also like