Professional Documents
Culture Documents
Unicode
Unicode [1] is a standard that specifies all of the characters for most of the world's writing systems. Each character is
assigned a unique codepoint, such as U+0030. The first 256 code points are the same as ISO-8859-1 [2] to make it
trivial to convert existing Western/Latin-1 text.
To view properties [3] for a particular codepoint:
If you view the Unicode character reference [4], you will notice that not every codepoint has an assigned character.
Also, because of backward compatibility with legacy encodings, some characters have multiple codepoints.
UTF-8
UTF-8 [5] is a specific encoding of Unicode — the most popular encoding. Other encodings include UTF-7, UTF-16,
UTF-32, etc. You will probably want to use UTF-8, if you decide to use Unicode.
An encoding defines how each Unicode codepoint maps to bits and bytes. In UTF-8 encoding, the first 128 Unicode
codepoints use one byte. These byte values are the same as US-ASCII [6], making UTF-8 encoding and ASCII
encoding interchangeable if only ASCII characters are used. The next 1,920 codepoints use 2-byte encoding in
UTF-8. Three or four bytes are needed to encode the remaining codepoints.
Note that although Unicode codepoints 128-255 are the same as ISO-8859-1, UTF-8 encodes each of these
codepoints differently. UTF-8 uses two bytes to encode each of these codepoints, whereas ISO-8859-1 only uses one
byte for each character in that range. Therefore, ISO-8859-1 and UTF-8 are not interchangeable. (If only ASCII
characters are used, then they are all interchangeable, since ASCII, ISO-8859-1, and UTF-8 all share the same
encoding for the first 128 Unicode codepoints.)
So, to reiterate, with UTF-8, not all characters are encoded into a single byte (unlike ASCII and ISO-8859-1). Think
about that for a moment... how might that affect editors (like Notepad), web pages and forms, databases, Perl itself,
Perl IO, your Perl source code (if you want to include a character with a multi-byte encoding)? How might that affect
passing strings around, if the strings contain characters with multi-byte encodings? Do regular expressions still
work?
Perl Programming/Unicode UTF-8 2
As you can see from the table above, codepoints 128-255 (0x80-0xff) are where you need to be careful. Later, you
will find out that codepoints 128-159 (0x80-0x9F) are even trickier, due to the fact that the popular Windows-1252
character set (another one-byte-per-character encoding) is incompatible with ISO-8859-1 in this range.
Do I need UTF-8?
If your web application will only ever need to use one language (or one character encoding), then you may not need
UTF-8. However, if maybe someday your application will need to handle multiple languages, it is better to start
using UTF-8 now, rather than later. Reason? It is easier to use UTF-8 everywhere, rather than in just a few places
(more about this below). And it is more difficult to convert all of the pieces later.
Terminology
A character is a logical entity. Characters must be encoded (using a character set) in order to be used, stored,
written, exchanged between programs, etc. Encoding turns a logic character into something we can use in a program.
Depending on which character set is used for encoding, a single character may be require one or more bytes to
represent it.
We'll use the term octets when referring to data passing into or out of a Perl program. An octet is a byte, 8 bits.
Encoded characters make up an octet stream. When an octet stream comes into Perl, the bytes should be decoded
(using the correct character set -- the character set they were encoded with) so that Perl can determine which logical
characters are contained in the encoded octet stream. Perl can then store these as strings -- a sequence of characters.
Binary data also comes in as an octet stream. It should not be decoded using a character set, because it likely either
doesn't contain any characters, or it contains information in addition to characters, and hence cannot be decoded with
a character set.
Perl strings/text
Internally, Perl stores each string in one of the following encodings:
• native encoding — byte encoding. It uses N8CS[8], the native 8-bit character set of the platform (often
ISO-8859-1/Latin-1). This is a one-byte-per-character encoding, and hence a maximum of only 255 characters
can be encoded. This is the default encoding for all incoming text/octets if Perl is not instructed to decode (bad
idea). Strings using this encoding are called byte strings or binary strings.
• UTF-8 encoding — character encoding. It uses (obviously) UTF-8. Strings using this encoding are called
character strings or text strings or Unicode strings.
When creating your own strings, Perl uses N8CS when possible (for backwards compatibility and efficiency
reasons). However, if a character can not be represented in N8CS, UTF-8 is used. In other words, if all code points in
a string are are <= 0xFF, N8CS is used, otherwise UTF-8 is used.
$native_string = "\xf1";
$native_string = "\x{00f1}"; # still N8CS, since <= 0xff
$native_string = chr(0xf1); # still N8CS, since <= 0xff
$utf8_string = "\x{0100}";
Your program can have a mix of strings in both of Perl's internal formats. Perl uses a "UTF8 flag" to keep track of
which encoding a string is internally using. Thankfully, the format/flag follows the string. Perl keeps a string in
N8CS as long as possible. However, when a N8CS/native string is used together with a UTF-8 string, the native
string is silently implicitly decoded using N8CS, and upgraded (encoded) to UTF-8. In other words, the native byte
string gets decoded with the native character set, and then it gets internally encoded into UTF-8. The resulting
character string will have the UTF8 flag set.
Perl Programming/Unicode UTF-8 4
Normally, you should not need to know about how Perl is internally storing/encoding text. An exception to this rule
is if you have a natively encoded string with bytes in the 0x80-0xFF range — in other words, a natively encoded
string with non-ASCII characters. In this case, some string operations may not work as expected — see Perl 5
"Unicode Bug". Normally, you should upgrade these N8CS byte strings to UTF-8 character strings using
utf8::upgrade(). String operations will then work as expected, although sometimes slower.
UTF-8 Flow
Any Perl IO needs to correctly handle decoding and encoding of strings/text. Since there are multiple character
encodings in use in the world, Perl can't correctly guess which character encoding was used to encode some
particular incoming text/octets, nor can it know which character encoding you want to use for outgoing text/octets.
An incoming stream of UTF-8 octets is not the same as, say, an incoming stream of Windows-1252 octets. For
example, Unicode character U+201c (left double quotation mark) is encoded in one byte in Windows-1252 (0x93),
but UTF-8 encodes it using three octets (0xE2 0x80 0x9C). If you want Perl to interpret your incoming text/octets
correctly, you must tell Perl which character set was used to encode them, so they can be decoded properly.
The typical flow of UTF-8 text/octets in to and out of a Perl program is as follows:
1. Receive an external UTF-8 encoded text/octet stream and correctly decode it — i.e., tell Perl which character set
the octets are encoded in (in this case, the encoding is UTF-8). Perl may check for malformed data (bad encoding)
while decoding, depending on which decoding method you select. Perl stores the string internally as N8CS or
UTF-8, depending on which decoding method you select, and what characters are found to be in the octet stream.
(Normally, the string will be internally stored as UTF-8.)
2. Process the string as you normally would.
3. Encode the string into a UTF-8 encoded octet stream and output it.
Another important point to make here: you need to know which encoding was used for each input text. Do not guess,
do not assume.
Do not use :utf8 since it does not check that your incoming text is valid UTF-8, it simply marks it as UTF-8 —
see Perlmonks [9].
package CGI::as_utf8;
BEGIN {
use strict;
use warnings;
use CGI 3.47; # earlier versions have a UTF-8 double-decoding bug
{
no warnings 'redefine';
my $param_org = \&CGI::param;
my $might_decode = sub {
my $p = shift;
# make sure upload() filehandles are not modified
return $p if !$p || ( ref $p && fileno($p) );
utf8::decode($p); # may fail, but only logs an error
Perl Programming/Unicode UTF-8 6
$p
};
*CGI::param = sub {
# setting a param goes through the original interface
goto &$param_org if scalar @_ != 2;
my ($q, $p) = @_; # assume object calls always
return wantarray
? map { $might_decode->($_) } $q->$param_org($p)
: $might_decode->( $q->$param_org($p) );
}
}
}
1
---
use CGI::as_utf8; # put this line in your app, e.g., in your
CGI::Application module(s)
The above is rhesa's solution [16] with a slight modification — utf8::decode() is used instead of Encode's [17]
decode_utf8(), as it is more efficient when only ASCII characters are involved (since the UTF8 flag is not set).
Note that the module assumes that web pages and forms are always UTF-8 encoded, and that the OO interface of
CGI.pm is always used.
Note, browsers should encode form data in the same character encoding that was used to display the form. So, if you
are sending UTF-8 forms, you should get UTF-8 encoded data back for text fields. You should not have to use
accept-charset in your HTML markup.
Input - STDIN
When a web form is POSTed, form data comes into Perl via STDIN. If you are using CGI.pm, text form data is
available via CGI.pm's param() method, and the previous section describes how to properly handle UTF-8 encoded
text form data.
If you don't have any file uploads (i.e., all of your data is text), then instead of the CGI::as_utf8 module, you could
add the following line of code to the beginning of your script to cause all data received on STDIN (i.e., all POSTed
form data) to be automatically decoded as UTF-8:
Do not use
<strike>
binmode STDIN, ":utf8"; # do NOT use this!
</strike>
since it does not check that your incoming text is valid UTF-8, it simply marks it as UTF-8 — see Perl 5 Wiki [18].
The approach in the previous section is preferred, since it will "do the right thing" if there is any binary form data
(file uploads).
If you are writing some other (non-CGI) program that receives data on STDIN, decode appropriately:
Input - Database
In the "use UTF-8 everywhere" model, configure your database to store values in UTF-8.
When reading data from a UTF-8 database, ensure incoming UTF-8 encoded string field data is UTF-8 decoded, but
do not decode incoming binary field data.
Input - MySQL
With MySQL, UTF-8 decoding (and encoding) of string field data is automatic if you use the
mysql_enable_utf8 database handle attribute [19]:
use DBI();
my $dbh = DBI->connect('dbi:mysql:test_db', $username, $password,
{mysql_enable_utf8 => 1}
);
This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field
data — the driver will do that for you. If the incoming data for a field only contains ASCII octets, the UTF8 flag is
not set for that field (so it appears to be using utf8::decode()). The driver is also smart enough to not decode
binary data.
Version 4.004 or higher of DBD::mysql is required. UTF-8 was first available in MySQL v4.1. As of v5.0, it is the
system default.
Input - PostgreSQL
With PostgreSQL, UTF-8 decoding (and encoding) of string field data is automatic if you use the
pg_enable_utf8 database handle attribute [20]:
use DBI();
my $dbh = DBI->connect('dbi:mysql:test_db', $username, $password,
{pg_enable_utf8 => 1}
);
This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field
data — the DBD::Pg driver will do that for you. The driver is also smart enough to not decode binary data.
You may (TBD: when?) also need to tell PostgreSQL to use UTF-8 when sending data out of the database:
SET CLIENT_ENCODING TO 'UTF8';
or
SET NAMES 'UTF8';
For example, with Rose::DB:
__PACKAGE__->register_db(
domain => 'development',
...
connect_options => {
pg_server_prepare => 0,
pg_enable_utf8 => 1,
},
post_connect_sql => "SET CLIENT_ENCODING TO 'UTF8';",
Perl Programming/Unicode UTF-8 8
);
See Automatic Character Set Conversion Between Server and Client [21]
2. Processing strings
Once all incoming strings have been decoded into UTF-8 internally, you can process your text as normal. Regular
expression will work (if using Perl v5.8 or higher).
If you create any strings in your source code that contain non-ASCII characters (characters above 0x7f), ensure you
upgrade them to internal UTF-8 encoding:
use Encode;
# suppose $windows1252_octets contains text from an external input, and
it contains the character
# "\xE0" (0xE0 = à). String $windows1252_octets will exhibit the
Unicode bug -- it won't match /\w/
my $utf8_string = decode('cp1252',$windows1252_octets); # no Unicode
bug, $utf8_string matches /\w/
Note that with internal UTF-8 encoding, \w represents a much, much larger set of characters, so regex operations
will be slower (vs. native encoding). TBD: what is the actual performance degradation? What is the character set for
\w with Unicode semantics?
See also Unicode::Semantics [22]. This issue should be fixed in Perl v5.12.
4/19/10 update: v5.12 is now available, and the "case changing component" has been fixed: "Perl 5.12 now bundles
Unicode 5.2. The “feature” pragma now supports the new “unicode_strings” feature:
This will turn on Unicode semantics for all case changing operations on strings, regardless of how they are currently
encoded internally." Read more [23].
Output - STDOUT
To ensure all output going back to the web browser (i.e., STDOUT) is UTF8-encoded, add the following near the top
of your Perl script:
If you want to be a little more efficient (but not follow "best practice"), you can opt to only encode the outgoing page
if it is flagged as UTF-8:
if(utf8::is_utf8($page)) {
utf8::encode($page);
}
# else, $page is natively encoded, so skip encoding for output
Here is a snippet [24] that can be used with the CGI::Application [25] framework:
__PACKAGE__->add_callback('postrun', sub {
my $self = shift;
# Make sure the output is utf8 encoded if it needs it
if($_[0] && ${$_[0]} && utf8::is_utf8(${$_[0]}) ){
utf8::encode( ${$_[0]} );
# ${$_[0]} .= 'utf8::encode() called'; # useful for debugging
}
});
The above code should be put into CGI::Application base class(es). Optionally, the code can be added to
cgiapp_postrun().
Note that all of the above encoding techniques will only work properly if all of the input UTF-8 octets were properly
decoded.
Perl Programming/Unicode UTF-8 10
Output - Database
As mentioned above, in the "use UTF-8 everywhere" model, configure your database to store values in UTF-8.
When writing data to a UTF-8 database (INSERT, UPDATE, etc.), ensure your UTF-8 strings get UTF-8 encoded
before being written to the database. Do not encode binary field data.
Output - MySQL
As mentioned above, UTF-8 encoding (and decoding) of string field data is automatic if you use the
mysql_enable_utf8 database handle attribute [19]. This means you should not call utf8::encode() (or
any other UTF-8 encode function) on your strings when using this attribute — the driver will do that for you. The
driver is also smart enough to not encode binary data.
Version 4.004 or higher of DBD::mysql is required. UTF-8 was first available in MySQL v4.1. As of v5.0, it is the
system default.
Output - PostgreSQL
As mentioned above, UTF-8 encoding (and decoding) of string field data is automatic if you use the
pg_enable_utf8 database handle attribute [20]. This means you should not call utf8::encode() (or any
other UTF-8 encode function) on your strings when using this attribute — the DBD::Pg driver will do that for you.
The driver is also smart enough to not encode binary data.
You may (TBD: when?) also need to tell PostgreSQL to expect UTF-8 coming into the database:
SET CLIENT_ENCODING TO 'UTF8';
or
SET NAMES 'UTF8';
See Automatic Character Set Conversion Between Server and Client [21]
my $smiley = "\x{263a}";
or
my $smiley = chr(0x263a);
utf8::upgrade($smiley); # convert to internal UTF-8 encoding
If you have a lot of Unicode characters, or you prefer to save your source code in UTF-8, then you need to tell Perl
that your source code is UTF-8 encoded. Do this by adding the following line to your source code:
This is the only reason your program should ever have the above line -- see utf8 [26].
If your source code is UTF-8 encoded, make sure your editor supports reading, editing, and writing in UTF-8!
Gotchas
Often you may not notice Unicode issues until characters with codepoints above 128 are used. This is because
ASCII, ISO-8859-1, Windows-1252 and UTF-8 are all encoded with the same one-byte values for the first 128
Unicode codepoints. To give your application a good Unicode test, try a character in the 0x80 - 0x9F (128-159)
range, and a character above 0xFF (255).
ISO-8859-1 vs Windows-1252
Since you are learning about character encodings, you need to be aware of the difference between ISO-8859-1 and
Windows-1252. From Windows-1252 [29]: "The Windows-1252 encoding is a superset of ISO-8859-1, but differs
from ISO-8859-1 by using displayable characters rather than control characters in the 0x80 to 0x9F (128 - 159)
range. It ... also contains all the printable characters that are in ISO-8859-15 (though some are mapped to different
code points). It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web
browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to
accommodate such mislabeling... the draft HTML 5 specification requires that documents advertised as ISO-8859-1
actually be parsed with the Windows-1252 encoding" since it is a superset of ISO-8859-1.
Here's a fun program to try:
What do you see? Do you see the Windows-1252 characters, no characters, square boxes? If you are using PuTTY,
Change Settings... Window, Translation and try selecting ISO-8859-1 or Windows-1252 and run the program again.
your computer, or the Unicode font does not have a glyph for that particular character.
Strange characters: ‘ ’ “ †• – —
These are the individual characters that correspond to the multi-byte UTF-8 encodings for the following
Windows-1252 characters:
‘’“”•–—
which are in the nebulous 0x80-0x9F (128-159) range. Usually, these characters appear because the HTML data is
UTF-8 encoded, but the browser was instructed to use ISO-8859-1 or Windows-1252. In your browser, try changing
the encoding to UTF-8 and see if that resolves the problem. If that doesn't resolve the problem, or if the encoding is
already set to UTF-8, there may be a double encoding problem somewhere.
Strange characters: â â â â ⢠â â
These also correspond to some of the characters in the nebulous 0x80-0x9F (128-159) range. If you see the above
sequences, it is likely that you forgot to decode incoming UTF-8 data (such as form data submitted from an UTF-8
encoded HTML form) in your Perl program and then you UTF-8 encoded it for output — a natively encoded string
was UTF-8 encoded (not good). Fix the problem by calling utf8::decode() on the incoming UTF-8 encoded
data.
Double Encoding
If you don't decode UTF-8 text/octets, Perl will assume they are encoded with N8CS (often ISO-8859-1/Latin-1).
This means that the individual octets of a multi-byte UTF-8 character are seen as separate characters (not good). If
these separate characters are later encoded to UTF-8 for output, a "double encoding" results. This is similar to
HTML double encoding — e.g., &gt; instead of >.
Perl Programming/Unicode UTF-8 14
Misc
In Perl
my $utf8_char = "\x{263a}"; # for codepoints above 0xFF
$utf8_char =~ /\x{263a}/; # same syntax for regex
my $cloud_char = chr(0x2601); # run-time, ord() does the reverse
If your Perl source code file is in UTF-8 format, you can enter the Unicode characters directly:
In Web Forms
On Windows:
• To insert a character from the Windows-1252 codepage [29]: set the Num Lock key on, hold down Alt, then using
the numeric keypad, type 0 followed by the decimal value of the character you want.
• To insert a character from the current DOS code page (usually CP-437 [32]): follow the same steps as above, but
without the initial 0.
But wait, we wanted to insert a Unicode character, not a Windows-1252 or CP-437 character! Well, Windows will
convert those characters to Unicode/UTF-8 for us if the application expects UTF-8.
In a web form (textbox or textarea) type Alt-0147 to generate one of those pesky smart quotes from the
Windows-1252 character set. If the web page's character encoding is set for UTF-8, Windows should translate the
147 character into the corresponding UTF-8 encoding. (Internally, Windows probably translates the 0147 to UTF-16,
which is then translated into the character set in use by the application. In this scenario, the character set is Unicode,
and Windows-1252 character 147 is translated to its Unicode codepoint equivalent, U+201C.) When the form is
submitted, the character should be sent to the web server UTF-8 encoded as three octets: E2 80 9C — this is what
U+201C looks like when encoded with UTF-8.
If the web page's character encoding is instead set to Windows-1252, the character should be sent as a single octet:
0x93 (which is 147 decimal). If the web page's character encoding is instead set to ISO-8859-1, the character will
also be sent as a single octet, but the value may be either 0x93 or 0x22 (0x22 is the ASCII and ISO-8859-1 quote
character). If the browser uses the superset Windows-1252 encoding when ISO-8859-1 is specified, 0x93 is sent.
Otherwise, the character will be translated to the only quote character officially defined in ISO-8859-1, 0x22.
Perl Programming/Unicode UTF-8 15
Hopefully you see why it is imperative to know which encoding was used for the incoming form/text, so that it can
be decoded properly (as UTF-8 or Windows-1252) in your Perl program.
See also How do I enter ... [33] - Yahoo Answers.
UTF-8 vs utf8
As of Perl 5.8.7, UTF-8 is the strict, official UTF-8. The Encode module will complain if you try to encode or
decode invalid UTF-8, e.g.,
In contrast, utf8 is the liberal, lax, version, allowing just about any 4-byte values:
UTF-8 Functions
Function UTF8 Description / Notes
flag
$flag = utf8::is_utf8($string); N/A Tests whether $string is internally encoded as UTF-8. Returns false if not; otherwise returns true.
$flag = depends Attempts to convert in-place the UTF-8 octet sequence into the corresponding N8CS or UTF-8
utf8::decode($utf8_octets); string, as appropriate. If $utf8_octets contains non-ASCII octets (i.e., multi-byte UTF-8 encoded
characters), the UTF8 flag is turned on, and the resulting string is UTF-8. Otherwise, the UTF8 flag
remains off, and the resulting string is N8CS. This is the only decode function that may result in
an N8CS byte string. Returns false if $utf8_string is not UTF-8 encoded properly; otherwise returns
true.
$utf8_string = decode('UTF-8', turned Decodes the UTF-8 octet sequence into a UTF-8 character string. Strict, official UTF-8 decoding
$utf8_octets [, CHECK]) on rules (see previous section for discussion) are followed.
$utf8_string = decode('utf8', turned Decodes the UTF-8 octet sequence into a UTF-8 character string. Lax, liberal decoding rules (see
$utf8_octets [, CHECK]) on previous section for discussion) are followed.
Perl Programming/Unicode UTF-8 16
$utf8_string = turned Decodes the UTF-8 octet sequence into a UTF-8 character string. Equivalent to decode("utf8",
decode_utf8($utf8_octets [, on $utf8_octets), hence lax decoding is employed.
CHECK])
$octet_count = turned Converts in-place the N8CS byte string into the corresponding UTF-8 character string. Returns the
utf8::upgrade($n8cs_string); on number of octets now used to represent the string internally as UTF-8. This function should be used
to convert N8CS byte strings with characters in the 0x80-0xFF range to UTF-8, thereby avoiding the
Perl 5 "Unicode Bug".
utf8::encode($string) turned Converts in-place the N8CS or UTF-8 $string into a UTF-8 octet sequence.
off
$utf8_octets = encode('UTF-8', turned Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Strict, official UTF-8 encoding
$string [, CHECK]) off rules (see previous section for discussion) are followed.
$utf8_octets = encode('utf8', turned Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Lax, liberal UTF-8 encoding rules
$string) off (see previous section for discussion) are followed. Since all possible characters have a lax utf8
representation, this function cannot fail.
$utf8_octets = turned Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Equivalent to encode("utf8",
encode_utf8($string) off $string), hence lax encoding is employed. Since all possible characters have a lax utf8 representation,
this function cannot fail.
$flag = turned Converts in-place the UTF-8 character string to the equivalent N8CS byte string. Fails if
utf8::downgrade($utf8_string [, off $utf8_string cannot be represented in N8CS encoding. On failure dies, unless FAIL_OK is true, then
FAIL_OK]); returns false. Returns true on success.
References
• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and
Character Sets (No Excuses!) [39] - by Joel Spolsky
• FMTYEWTK about Characters vs Bytes [40] - Perlmonks
• CGI::Application and UTF-8 Form Processing example [41] - by Mark Rajcok
• Perl Unicode tutorial [42]
• Perl Unicode FAQ [43]
• Perl utf8 pragma [26]
• Perl Encode module [17] - handles all character encoding and decoding
• Unicode [1] - Wikipedia
• Perl Unicode introduction [44]
• Unicode support in Perl [45]
• Unicode::Semantics [22] - work around the Perl 5 Unicode bug
• there are many Unicode:xxx modules [46] on CPAN
• UTF-8 round trip with MySQL [47] - Perlmonks
• CGI::Application - Which is the proper way of handling and outputting utf8 [48] - Perlmonks
• Understanding CGI.pm and UTF-8 handling [9] - Perlmonks
• UTF-8 and Unicode FAQ for Unix/Linux [49]
• Perl Unicode Mailing List <perl-unicode@perl.org>
Footnotes
^ - N8CS is a term that was coined for this document. Do not expect to see this term used elsewhere.
References
[1] http:/ / en. wikipedia. org/ wiki/ Unicode
[2] http:/ / en. wikipedia. org/ wiki/ ISO_8859
[3] http:/ / en. wikipedia. org/ wiki/ Mapping_of_Unicode_characters#Character_properties
[4] http:/ / en. wikibooks. org/ wiki/ Unicode/ Character_reference/ 0000-0FFF
[5] http:/ / en. wikipedia. org/ wiki/ UTF-8
[6] http:/ / en. wikipedia. org/ wiki/ ASCII
[7] http:/ / search. cpan. org/ perldoc?perlunicode#Speed
[8] http:/ / en. wikipedia. org/ wiki/ Perl_programming%2Funicode_utf-8#endnote_N8CS
[9] http:/ / www. perlmonks. org/ ?node_id=626470
[10] http:/ / search. cpan. org/ dist/ Template-Toolkit/ lib/ Template/ FAQ. pod#Why_do_I_get_rubbish_for_my_utf-8_templates?
[11] http:/ / en. wikipedia. org/ wiki/ Byte_Order_Mark
[12] http:/ / search. cpan. org/ perldoc?HTML::Template
[13] https:/ / rt. cpan. org/ Public/ Bug/ Display. html?id=30586
[14] http:/ / sourceforge. net/ mailarchive/ forum. php?thread_name=4607245C. 8030702%40netratings. com. au&
forum_name=html-template-users
[15] http:/ / search. cpan. org/ perldoc?CGI
Perl Programming/Unicode UTF-8 18
License
Creative Commons Attribution-Share Alike 3.0 Unported
http:/ / creativecommons. org/ licenses/ by-sa/ 3. 0/