SlideShare a Scribd company logo
Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet
CONVERSION CONFIGURATION MYSQL CODE
How to transform file encoding
Example with PHP files on Linux:
find . -name "*.php"
-exec iconv
-f ISO-8859-1 -t UTF-8
{} -o /path/to/utf8_files/{} ;
HTTP and HTML
In php.ini [1]:
default_charset = UTF-8
or in httpd.conf or .htaccess [5]:
AddDefaultCharset UTF-8
or in the PHP code [5]:
header('Content-type: text/html; charset=UTF-8');
Additionally, put this in you HTML <head> block:
<meta charset=UTF-8"/>
MySQL
Right after each connection, call1
[2]:
SET NAMES 'utf8';
Ordering in MySQL
Ordering in MySQL depends on the collation you
choose. Detailed information about this subject may
be found in the documentation on MySQL.com [2].
Look especially at the Unicode Character Sets section:
http://dev.mysql.com/doc/refman/5.5/en/charset-
unicode-sets.html.
Stored Procedures and Functions
Old function:
CREATE FUNCTION example_function (
IN parameter_name VARCHAR(255)
RETURNS varchar(255)
READS SQL DATA
BEGIN
DECLARE data VARCHAR(255);
...
RETURN data;
END;
New function:
CREATE FUNCTION example_function (
IN parameter_name VARCHAR(255)
CHARACTER SET utf8
RETURNS varchar(255)
CHARACTER SET utf8
READS SQL DATA
BEGIN
DECLARE data VARCHAR(255)
CHARACTER SET utf8;
...
RETURN data;
END;
How to transform character encoding
in MySQL databases
Procedure2
(use the INFORMATION_SCHEMA
database to build a script automatically):
● create a temporary, identical structure in a
new database,
● copy all data to that structure,
● drop the initial structure and
● recreate it with the new character encoding:
CHARACTER SET utf8
COLLATE utf8_general_ci
● Copy all data from the temporary structure
to the new structure, converting all texts
the new encoding, and finally
● Drop the temporary structure.
PHP
In php.ini [1.1]:
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On
mbstring.http_input = auto
mbstring.http_output = UTF-8
mbstring.detect_order = auto
mbstring.substitute_character = none
or in httpd.conf or .htaccess:
php_value <php.ini directive> <value>
or in the PHP code3
[1]:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_detect_order('auto');
mb_substitute_character('none');
Verifications4
Run this small PHP script:
if ( ! extension_loaded('mbstring'))
die('mb functions not loaded');
if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar))
die('PCRE is not compiled with UTF-8 support');
exit('ok');
1 Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, no matter what the actual encoding is. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP.
2 Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data.
3 Some php.ini directives cannot be modified in the PHP code.
4 Source: utf8.php in PHP UTF-8 library [3]
Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 1 / 3
PHP CODE (1/2)
Multibyte string functions [1]
Replace With
strlen() mb_strlen(); // How many characters
strlen(); // How many bytes
mb_strwidth(); // Monotype characters
substr() mb_substr()
strstr()
stristr()
mb_strstr()
mb_stristr()
strrchr() mb_strrchr()5
strpos()
stripos()
strrpos()
strripos()
mb_strpos()
mb_stripos()
mb_strrpos()
mb_strripos()
strtolower()
strtoupper()
mb_strtolower()
mb_strtoupper()
substr_count() mb_substr_count()6
String access by character [1]
Search all use of curly or square brackets to extract
single characters of strings:
$string{$position} // old syntax
$string[$position] // new syntax
Regular expressions to find them:
/$w(w|d)*{(d+|$w(w|d)*)}/
/$w(w|d)*[(d+|$w(w|d)*)]/
Replace
$char = $string{$pos};
with
$char = mb_substr($string, $pos, 1);
Replace
$string{$pos} = $char;
with
$string = mb_substr($string, 0, $pos)
. $char
. mb_substr($string, $pos + 1);
UTF-8-safe functions7
addslashes()
bin2hex()
explode() [4]
implode()
nl2br()
stripslashes()
strip_tags()
str_repeat()
str_replace() [4]
Escapement functions
The functions htmlentities()8
and
htmlspecialchars() both have a third parameter
which corresponds to the character set used during
conversion. Unlike with multibyte functions
(mb_*()), this 3rd parameter is mandatory if not
'ISO-8859-1', no matter what the internal encoding
is!
The functions urlencode() and rawurlencode()
do not have any character encoding parameter. The
safest solution is to put your UTF-8 strings in session
variables instead of URL arguments.
Comparing strings and sorting arrays
Use the Collator class:
http://www.php.net/manual/en/class.collator.php
SimpleXML
SimpleXML uses UTF-8 internally and converts all
XML content to UTF-8 [1.2], so usually nothing needs
to be done.
PRCE functions [1]
Search all PRCE function calls (preg_*) and append
the /u pattern modifier9
Storable representation of variables
The serialize() and unserialize() functions
can be used transparently. However, be careful when
reading or writing serialized UTF-8 strings with other
languages than PHP [4].
5 Note that the mb_strrchr() functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is a mb_strrichr() function, which has no equivalent in standard PHP functions.
6 Be careful because the 3rd and 4th arguments of substr_count() no longer exist with mb_substr_count(). You can use mb_substr() to circumvent this limitation.
7 The strcmp() function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: http://www.php.net/manual/en/collator.compare.php
8 The function htmlentities() converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “when
using UTF-8, you don’t need entities”.
9 However, there may still have problems, as explained in Handling UTF-8 with PHP [4].
Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 2 / 3
PHP CODE (2/2) CREDITS
String functions that are problematic and for which there is no built-in replacement function
Replace With a function from the PHP UTF8 Library10
[3] Comment
ord($chr)11
utf8_ord($chr)
sprintf() The x and X type specifiers could be an issue, according
to [4].
str_ireplace($search, $replace,
$subject [, &$count])
utf8_ireplace($search, $replace,
$subject [, &$count])
Alternatively, write your own implementation using
preg_replace().
str_pad($input, $length, $padStr, $type) utf8_str_pad($input, $length, $padStr, $type)
str_split($str, $split_len) utf8_str_split($str, $split_len) Alternatively, use this function:
http://www.php.net/manual/ref.mbstring.php#95192
strcasecmp($str1, $str2) Write your own implementation using
collator_compare() and mb_strtolower()
strncmp($str1, $str2, $len) Cut the two strings at the specified length, and use
collator_compare()
strncasecmp($str1, $str2, $len) Write your own implementation using your replacement
of strncmp() and mb_strtolower()
strspn($str1, $str2[, $start[, $len]])
strcspn($str1, $str2[, $start[, $len]])
utf8_strspn($str1, $str2[, $start[, $len]])
utf8_strcspn($str1, $str2[, $start[, $len]])
strrev($string) utf8_strrev($string)
strtr() This function doesn't work if any parameter is UTF-8.
Write your own implementation.
substr_replace() utf8_substr_replace()
trim($str, $charlist)
ltrim($str, $charlist)
rtrim($str, $charlist)
utf8_trim($str, $charlist)
utf8_ltrim($str, $charlist)
utf8_rtrim($str, $charlist)
The original functions trim(), ltrim() and rtrim()
are UTF-8-safe as long as the 2nd parameter is not used
[4].
ucfirst($str)
ucwords($str)
utf8_ucfirst($str)
utf8_ucwords($str)
wordwrap() Write your own implementation
Sources
[1] PHP.net documentation
[1.1] PHP.net, Multibyte String Runtime Configuration,
http://www.php.net/manual/en/mbstring.c
onfiguration.php
[1.2] A comment about SimpleXML on PHP.net:
http://www.php.net/manual/en/ref.simpl
exml.php#79258
[2] MySQL.com documentation
[3] Harry Fuecks, PHP UTF-8 library,
http://sourceforge.net/projects/phputf8
[4] Web Application Component Toolkit, Handling
UTF-8 with PHP,
http://www.phpwact.org/php/i18n/utf-8
[5] W3C, Setting the HTTP charset parameter,
http://www.w3.org/International/O-HTTP-charset.php
Author and Copyright
Copyright © François Cardinaux 2011
Feel free to contact me at:
http://www.linkedin.com/in/francoiscardinaux
License
Creative Commons
Attribution-Non-Commercial-Share Alike 3.0
10 Version 0.5
11 Underlined parameters fail if they are UTF-8-encoded
Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 3 / 3
Ad

Recommended

Unit 3
Unit 3
siddr
 
File Management in C
File Management in C
Paurav Shah
 
Unit 6
Unit 6
siddr
 
File Handling in C
File Handling in C
Subhanshu Maurya
 
File handling in c
File handling in c
mohit biswal
 
File_Management_in_C
File_Management_in_C
NabeelaNousheen
 
Unit v
Unit v
kannaki
 
Files in C
Files in C
Prabu U
 
file management in c language
file management in c language
chintan makwana
 
PHP Course (Basic to Advance)
PHP Course (Basic to Advance)
Coder Tech
 
File in c
File in c
Prabhu Govind
 
C UNIT-5 PREPARED BY M V BRAHMANANDA REDDY
C UNIT-5 PREPARED BY M V BRAHMANANDA REDDY
Rajeshkumar Reddy
 
File handling in C
File handling in C
Kamal Acharya
 
File Management in C
File Management in C
Munazza-Mah-Jabeen
 
C Programming Unit-5
C Programming Unit-5
Vikram Nandini
 
Php advance
Php advance
Rattanjeet Singh
 
Programming in C
Programming in C
nagathangaraj
 
C programming file handling
C programming file handling
argusacademy
 
Understanding c file handling functions with examples
Understanding c file handling functions with examples
Muhammed Thanveer M
 
File Handling and Command Line Arguments in C
File Handling and Command Line Arguments in C
Mahendra Yadav
 
Php
Php
Tohid Kovadiya
 
File Management
File Management
Ravinder Kamboj
 
File handling in c
File handling in c
thirumalaikumar3
 
Chapter 13.1.10
Chapter 13.1.10
patcha535
 
Module 03 File Handling in C
Module 03 File Handling in C
Tushar B Kute
 
File handling in C by Faixan
File handling in C by Faixan
ٖFaiXy :)
 
File in C language
File in C language
Manash Kumar Mondal
 
How PHP works
How PHP works
Atlogys Technical Consulting
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
Ray Paseur
 
Multibyte string handling in PHP
Multibyte string handling in PHP
Daniel_Rhodes
 

More Related Content

What's hot (20)

file management in c language
file management in c language
chintan makwana
 
PHP Course (Basic to Advance)
PHP Course (Basic to Advance)
Coder Tech
 
File in c
File in c
Prabhu Govind
 
C UNIT-5 PREPARED BY M V BRAHMANANDA REDDY
C UNIT-5 PREPARED BY M V BRAHMANANDA REDDY
Rajeshkumar Reddy
 
File handling in C
File handling in C
Kamal Acharya
 
File Management in C
File Management in C
Munazza-Mah-Jabeen
 
C Programming Unit-5
C Programming Unit-5
Vikram Nandini
 
Php advance
Php advance
Rattanjeet Singh
 
Programming in C
Programming in C
nagathangaraj
 
C programming file handling
C programming file handling
argusacademy
 
Understanding c file handling functions with examples
Understanding c file handling functions with examples
Muhammed Thanveer M
 
File Handling and Command Line Arguments in C
File Handling and Command Line Arguments in C
Mahendra Yadav
 
Php
Php
Tohid Kovadiya
 
File Management
File Management
Ravinder Kamboj
 
File handling in c
File handling in c
thirumalaikumar3
 
Chapter 13.1.10
Chapter 13.1.10
patcha535
 
Module 03 File Handling in C
Module 03 File Handling in C
Tushar B Kute
 
File handling in C by Faixan
File handling in C by Faixan
ٖFaiXy :)
 
File in C language
File in C language
Manash Kumar Mondal
 
How PHP works
How PHP works
Atlogys Technical Consulting
 

Similar to Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet (2011) (20)

Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
Ray Paseur
 
Multibyte string handling in PHP
Multibyte string handling in PHP
Daniel_Rhodes
 
Using unicode with php
Using unicode with php
Elizabeth Smith
 
Using unicode with php
Using unicode with php
Elizabeth Smith
 
Character Encoding issue with PHP
Character Encoding issue with PHP
Ravi Raj
 
Pl ams 2015_unicode_dveeden
Pl ams 2015_unicode_dveeden
Daniël van Eeden
 
50 shades of PHP
50 shades of PHP
Maksym Hopei
 
Tokens in php (php: Hypertext Preprocessor).pptx
Tokens in php (php: Hypertext Preprocessor).pptx
BINJAD1
 
Character sets
Character sets
Ligaya Turmelle
 
PHP Web Programming
PHP Web Programming
Muthuselvam RS
 
Manipulating strings
Manipulating strings
Nicole Ryan
 
PHP for Grown-ups
PHP for Grown-ups
Manuel Lemos
 
9780538745840 ppt ch03
9780538745840 ppt ch03
Terry Yoast
 
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
Andrei Zmievski
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
Bernt Marius Johnsen
 
Variables In Php 1
Variables In Php 1
Digital Insights - Digital Marketing Agency
 
lab4_php
lab4_php
tutorialsruby
 
lab4_php
lab4_php
tutorialsruby
 
UNIT II (7).pptx
UNIT II (7).pptx
DrDhivyaaCRAssistant
 
UNIT II (7).pptx
UNIT II (7).pptx
DrDhivyaaCRAssistant
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
Ray Paseur
 
Multibyte string handling in PHP
Multibyte string handling in PHP
Daniel_Rhodes
 
Character Encoding issue with PHP
Character Encoding issue with PHP
Ravi Raj
 
Tokens in php (php: Hypertext Preprocessor).pptx
Tokens in php (php: Hypertext Preprocessor).pptx
BINJAD1
 
Manipulating strings
Manipulating strings
Nicole Ryan
 
9780538745840 ppt ch03
9780538745840 ppt ch03
Terry Yoast
 
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
Andrei Zmievski
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
Bernt Marius Johnsen
 
Ad

Recently uploaded (20)

Top Mobile App Development Trends Shaping the Future
Top Mobile App Development Trends Shaping the Future
ChicMic Studios
 
Timeline Infographics Para utilização diária
Timeline Infographics Para utilização diária
meslellis
 
Internet Download Manager (IDM) 6.42.40 Crack Download
Internet Download Manager (IDM) 6.42.40 Crack Download
Puppy jhon
 
Expository Text Translation WEASDSD.pptx
Expository Text Translation WEASDSD.pptx
SURYAADIWINATA3
 
Internet & Protocols : A Blueprint of the Internet System
Internet & Protocols : A Blueprint of the Internet System
cpnabil59
 
In order to install and use the device software, your computer must meet the ...
In order to install and use the device software, your computer must meet the ...
raguclc
 
Lecture 3.1 Analysing the Global Business Environment .pptx
Lecture 3.1 Analysing the Global Business Environment .pptx
shofalbsb
 
SAP_S4HANA_ChatGPT_Integration_Presentation.pptx
SAP_S4HANA_ChatGPT_Integration_Presentation.pptx
vemulavenu484
 
3 years of Quarkus in production, what have we learned - Devoxx Polen
3 years of Quarkus in production, what have we learned - Devoxx Polen
Jago de Vreede
 
COMPUTER ETHICS AND CRIME.......................................................
COMPUTER ETHICS AND CRIME.......................................................
FOOLKUMARI
 
NOC Services for maintaining network as MSA.ppt
NOC Services for maintaining network as MSA.ppt
ankurnagar22
 
ChatGPT A.I. Powered Chatbot and Popularization.pdf
ChatGPT A.I. Powered Chatbot and Popularization.pdf
StanleySamson1
 
LpQuantueer rtwrt 1e erere errerqer m.ppt
LpQuantueer rtwrt 1e erere errerqer m.ppt
cyberesearchprof
 
Unlocking Business Growth Through Targeted Social Engagement
Unlocking Business Growth Through Targeted Social Engagement
Digital Guider
 
simple-presentationtestingdocument2007.pptx
simple-presentationtestingdocument2007.pptx
ashokjayapal
 
inside the internet - understanding the TCP/IP protocol
inside the internet - understanding the TCP/IP protocol
shainweniton02
 
cybercrime investigation and digital forensics
cybercrime investigation and digital forensics
goverdhankumar137300
 
CBUSDAW - Ash Lewis - Reducing LLM Hallucinations
CBUSDAW - Ash Lewis - Reducing LLM Hallucinations
Jason Packer
 
Dark Web Presentation - 1.pdf about internet which will help you to get to kn...
Dark Web Presentation - 1.pdf about internet which will help you to get to kn...
ragnaralpha7199
 
Common Pitfalls in Magento to Shopify Migration and How to Avoid Them.pdf
Common Pitfalls in Magento to Shopify Migration and How to Avoid Them.pdf
CartCoders
 
Top Mobile App Development Trends Shaping the Future
Top Mobile App Development Trends Shaping the Future
ChicMic Studios
 
Timeline Infographics Para utilização diária
Timeline Infographics Para utilização diária
meslellis
 
Internet Download Manager (IDM) 6.42.40 Crack Download
Internet Download Manager (IDM) 6.42.40 Crack Download
Puppy jhon
 
Expository Text Translation WEASDSD.pptx
Expository Text Translation WEASDSD.pptx
SURYAADIWINATA3
 
Internet & Protocols : A Blueprint of the Internet System
Internet & Protocols : A Blueprint of the Internet System
cpnabil59
 
In order to install and use the device software, your computer must meet the ...
In order to install and use the device software, your computer must meet the ...
raguclc
 
Lecture 3.1 Analysing the Global Business Environment .pptx
Lecture 3.1 Analysing the Global Business Environment .pptx
shofalbsb
 
SAP_S4HANA_ChatGPT_Integration_Presentation.pptx
SAP_S4HANA_ChatGPT_Integration_Presentation.pptx
vemulavenu484
 
3 years of Quarkus in production, what have we learned - Devoxx Polen
3 years of Quarkus in production, what have we learned - Devoxx Polen
Jago de Vreede
 
COMPUTER ETHICS AND CRIME.......................................................
COMPUTER ETHICS AND CRIME.......................................................
FOOLKUMARI
 
NOC Services for maintaining network as MSA.ppt
NOC Services for maintaining network as MSA.ppt
ankurnagar22
 
ChatGPT A.I. Powered Chatbot and Popularization.pdf
ChatGPT A.I. Powered Chatbot and Popularization.pdf
StanleySamson1
 
LpQuantueer rtwrt 1e erere errerqer m.ppt
LpQuantueer rtwrt 1e erere errerqer m.ppt
cyberesearchprof
 
Unlocking Business Growth Through Targeted Social Engagement
Unlocking Business Growth Through Targeted Social Engagement
Digital Guider
 
simple-presentationtestingdocument2007.pptx
simple-presentationtestingdocument2007.pptx
ashokjayapal
 
inside the internet - understanding the TCP/IP protocol
inside the internet - understanding the TCP/IP protocol
shainweniton02
 
cybercrime investigation and digital forensics
cybercrime investigation and digital forensics
goverdhankumar137300
 
CBUSDAW - Ash Lewis - Reducing LLM Hallucinations
CBUSDAW - Ash Lewis - Reducing LLM Hallucinations
Jason Packer
 
Dark Web Presentation - 1.pdf about internet which will help you to get to kn...
Dark Web Presentation - 1.pdf about internet which will help you to get to kn...
ragnaralpha7199
 
Common Pitfalls in Magento to Shopify Migration and How to Avoid Them.pdf
Common Pitfalls in Magento to Shopify Migration and How to Avoid Them.pdf
CartCoders
 
Ad

Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet (2011)

  • 1. Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet CONVERSION CONFIGURATION MYSQL CODE How to transform file encoding Example with PHP files on Linux: find . -name "*.php" -exec iconv -f ISO-8859-1 -t UTF-8 {} -o /path/to/utf8_files/{} ; HTTP and HTML In php.ini [1]: default_charset = UTF-8 or in httpd.conf or .htaccess [5]: AddDefaultCharset UTF-8 or in the PHP code [5]: header('Content-type: text/html; charset=UTF-8'); Additionally, put this in you HTML <head> block: <meta charset=UTF-8"/> MySQL Right after each connection, call1 [2]: SET NAMES 'utf8'; Ordering in MySQL Ordering in MySQL depends on the collation you choose. Detailed information about this subject may be found in the documentation on MySQL.com [2]. Look especially at the Unicode Character Sets section: http://dev.mysql.com/doc/refman/5.5/en/charset- unicode-sets.html. Stored Procedures and Functions Old function: CREATE FUNCTION example_function ( IN parameter_name VARCHAR(255) RETURNS varchar(255) READS SQL DATA BEGIN DECLARE data VARCHAR(255); ... RETURN data; END; New function: CREATE FUNCTION example_function ( IN parameter_name VARCHAR(255) CHARACTER SET utf8 RETURNS varchar(255) CHARACTER SET utf8 READS SQL DATA BEGIN DECLARE data VARCHAR(255) CHARACTER SET utf8; ... RETURN data; END; How to transform character encoding in MySQL databases Procedure2 (use the INFORMATION_SCHEMA database to build a script automatically): ● create a temporary, identical structure in a new database, ● copy all data to that structure, ● drop the initial structure and ● recreate it with the new character encoding: CHARACTER SET utf8 COLLATE utf8_general_ci ● Copy all data from the temporary structure to the new structure, converting all texts the new encoding, and finally ● Drop the temporary structure. PHP In php.ini [1.1]: mbstring.language = Neutral mbstring.internal_encoding = UTF-8 mbstring.encoding_translation = On mbstring.http_input = auto mbstring.http_output = UTF-8 mbstring.detect_order = auto mbstring.substitute_character = none or in httpd.conf or .htaccess: php_value <php.ini directive> <value> or in the PHP code3 [1]: mb_internal_encoding('UTF-8'); mb_http_output('UTF-8'); mb_detect_order('auto'); mb_substitute_character('none'); Verifications4 Run this small PHP script: if ( ! extension_loaded('mbstring')) die('mb functions not loaded'); if (1 != preg_match('/^.{1}$/u', "ñ", $UTF8_ar)) die('PCRE is not compiled with UTF-8 support'); exit('ok'); 1 Once this is done, PHP sees MySQL databases as if each TEXT, CHAR or VARCHAR field were encoded in UTF-8, no matter what the actual encoding is. Thus, there is no need to prepare the encoding of query parameters or to convert the results in PHP. 2 Trying to change the character encoding of TEXT, CHAR and VARCHAR fields directly with an ALTER TABLE will corrupt existing data. 3 Some php.ini directives cannot be modified in the PHP code. 4 Source: utf8.php in PHP UTF-8 library [3] Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 1 / 3
  • 2. PHP CODE (1/2) Multibyte string functions [1] Replace With strlen() mb_strlen(); // How many characters strlen(); // How many bytes mb_strwidth(); // Monotype characters substr() mb_substr() strstr() stristr() mb_strstr() mb_stristr() strrchr() mb_strrchr()5 strpos() stripos() strrpos() strripos() mb_strpos() mb_stripos() mb_strrpos() mb_strripos() strtolower() strtoupper() mb_strtolower() mb_strtoupper() substr_count() mb_substr_count()6 String access by character [1] Search all use of curly or square brackets to extract single characters of strings: $string{$position} // old syntax $string[$position] // new syntax Regular expressions to find them: /$w(w|d)*{(d+|$w(w|d)*)}/ /$w(w|d)*[(d+|$w(w|d)*)]/ Replace $char = $string{$pos}; with $char = mb_substr($string, $pos, 1); Replace $string{$pos} = $char; with $string = mb_substr($string, 0, $pos) . $char . mb_substr($string, $pos + 1); UTF-8-safe functions7 addslashes() bin2hex() explode() [4] implode() nl2br() stripslashes() strip_tags() str_repeat() str_replace() [4] Escapement functions The functions htmlentities()8 and htmlspecialchars() both have a third parameter which corresponds to the character set used during conversion. Unlike with multibyte functions (mb_*()), this 3rd parameter is mandatory if not 'ISO-8859-1', no matter what the internal encoding is! The functions urlencode() and rawurlencode() do not have any character encoding parameter. The safest solution is to put your UTF-8 strings in session variables instead of URL arguments. Comparing strings and sorting arrays Use the Collator class: http://www.php.net/manual/en/class.collator.php SimpleXML SimpleXML uses UTF-8 internally and converts all XML content to UTF-8 [1.2], so usually nothing needs to be done. PRCE functions [1] Search all PRCE function calls (preg_*) and append the /u pattern modifier9 Storable representation of variables The serialize() and unserialize() functions can be used transparently. However, be careful when reading or writing serialized UTF-8 strings with other languages than PHP [4]. 5 Note that the mb_strrchr() functions has one additional argument, that may be ignored since we just want to adapt existing function calls. Note also that there is a mb_strrichr() function, which has no equivalent in standard PHP functions. 6 Be careful because the 3rd and 4th arguments of substr_count() no longer exist with mb_substr_count(). You can use mb_substr() to circumvent this limitation. 7 The strcmp() function is UTF-8 safe as well. However, to perform a locale-aware comparison, use Collator::compare instead: http://www.php.net/manual/en/collator.compare.php 8 The function htmlentities() converts latins characters only [todo source]. Moreover, according to Handling UTF-8 with PHP [4]: “Using [htmlentities] on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output”, and “when using UTF-8, you don’t need entities”. 9 However, there may still have problems, as explained in Handling UTF-8 with PHP [4]. Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 2 / 3
  • 3. PHP CODE (2/2) CREDITS String functions that are problematic and for which there is no built-in replacement function Replace With a function from the PHP UTF8 Library10 [3] Comment ord($chr)11 utf8_ord($chr) sprintf() The x and X type specifiers could be an issue, according to [4]. str_ireplace($search, $replace, $subject [, &$count]) utf8_ireplace($search, $replace, $subject [, &$count]) Alternatively, write your own implementation using preg_replace(). str_pad($input, $length, $padStr, $type) utf8_str_pad($input, $length, $padStr, $type) str_split($str, $split_len) utf8_str_split($str, $split_len) Alternatively, use this function: http://www.php.net/manual/ref.mbstring.php#95192 strcasecmp($str1, $str2) Write your own implementation using collator_compare() and mb_strtolower() strncmp($str1, $str2, $len) Cut the two strings at the specified length, and use collator_compare() strncasecmp($str1, $str2, $len) Write your own implementation using your replacement of strncmp() and mb_strtolower() strspn($str1, $str2[, $start[, $len]]) strcspn($str1, $str2[, $start[, $len]]) utf8_strspn($str1, $str2[, $start[, $len]]) utf8_strcspn($str1, $str2[, $start[, $len]]) strrev($string) utf8_strrev($string) strtr() This function doesn't work if any parameter is UTF-8. Write your own implementation. substr_replace() utf8_substr_replace() trim($str, $charlist) ltrim($str, $charlist) rtrim($str, $charlist) utf8_trim($str, $charlist) utf8_ltrim($str, $charlist) utf8_rtrim($str, $charlist) The original functions trim(), ltrim() and rtrim() are UTF-8-safe as long as the 2nd parameter is not used [4]. ucfirst($str) ucwords($str) utf8_ucfirst($str) utf8_ucwords($str) wordwrap() Write your own implementation Sources [1] PHP.net documentation [1.1] PHP.net, Multibyte String Runtime Configuration, http://www.php.net/manual/en/mbstring.c onfiguration.php [1.2] A comment about SimpleXML on PHP.net: http://www.php.net/manual/en/ref.simpl exml.php#79258 [2] MySQL.com documentation [3] Harry Fuecks, PHP UTF-8 library, http://sourceforge.net/projects/phputf8 [4] Web Application Component Toolkit, Handling UTF-8 with PHP, http://www.phpwact.org/php/i18n/utf-8 [5] W3C, Setting the HTTP charset parameter, http://www.w3.org/International/O-HTTP-charset.php Author and Copyright Copyright © François Cardinaux 2011 Feel free to contact me at: http://www.linkedin.com/in/francoiscardinaux License Creative Commons Attribution-Non-Commercial-Share Alike 3.0 10 Version 0.5 11 Underlined parameters fail if they are UTF-8-encoded Unicode (UTF-8) with PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet, version 1.0 Date: 2011-05-12 Author, sources, copyright and license on page 3 Page 3 / 3