From: Michael Friedman <mfriedma@...>
Date: 2011-05-16T12:55:51+09:00
Subject: [ruby-core:36221] [Ruby 1.9 - Feature #2034] Consider the ICU Library for Improving and Expanding Unicode Support


Issue #2034 has been updated by Michael Friedman.


Hi.  I'm a newcomer to Ruby - studying it right now - but I've been writing multi-lingual systems for 15 years.  I think I can shed some light on internationalization issues.

First, I have to say that I was pretty amazed when I discovered that Ruby is not either a multi-character set system or native Unicode.  I just assumed that since it comes from Japan and is a relatively new language multi-byte and Unicode support would have been automatic for its developers.  Well, that's life and you can't go back in time and change things.

Today, for the vast majority of serious applications, Unicode support is just required.  It really doesn't matter what kind of system you are doing.  Think about almost anything.  Would it be reasonable if a Web mail system didn't allow Chinese e-mails?  If a bulletin board system didn't allow Japanese posts?  If a task management system didn't allow users to at least create their own tasks in Thai?  If a blogging platform didn't support Arabic?

I understand the concerns about backward compatibility if you convert String to a native Unicode type but I think the pain would be worth it.  Legacy applications could stay on old versions of Ruby if necessary.  New applications would run in native Unicode.

If you do go to native Unicode you have three realistic choices - UTF-8, UTF-16, and UTF-32.

o UTF-8 - Variable width encoding - ASCII is 1 byte, many characters are 2 byte, and some characters are 3 byte - significant performance impact but nice that it preserves ASCII semantics.  Biggest disadvantage is that it encourages lazy / ignorant programmers working in English to assume that CHAR = BYTE.

o UTF-16 - 2 bytes for most characters, but requires 4 bytes for characters in the Supplementary Planes.  (See http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane).  You can choose to ignore the Supplementary Planes (which was the initial design choice for Java) but that has significant impacts on your product's suitability for East Asian languages, especially name handling.  Java has been modified now into what I consider to be a bastard hybrid that supports Supplementary Planes but with great pain.  See http://java.sun.com/developer/technicalArticles/Intl/Supplementary/.  I strongly recommend against this approach.  Sun had legacy issues since they started with Unicode support that don't apply to Ruby.  Unless developers understand Unicode well enough to test with Supplementary Plane characters (and most don't) they're going to have all sorts of fun bugs when working with this approach.

o UTF-32 - 4 byte fixed width representation.  Definitely the fastest and simplest implementation but a very painful memory penalty - "Hello" will now be 20 bytes instead of 5.

If I was creating Ruby from scratch I would use two types - string_utf8 and string_utf32.  utf8 is optimized for storage and utf32 is optimized for speed.  "string" would be aliased to string_utf32 since most applications care more about speed of string operations than memory.  Documentation would strongly encourage storing files in UTF-8.

Next issue, of course, is sorting.  In the discussion above, no one has mentioned Locale and Collations.  A straight byte order sort in Unicode is usually useless.  In fact, multi-lingual sorts are different in every language.  For example, in English the correct sort is leap, llama, lover.  In Spanish, however, it would be leap, lover, llama - the "ll" is sorted after all other "l" words.  String sorts need to have an extra argument - locale or collation.  See how database systems like Oracle and SQL Server handle this to get a better understanding.  Note that multi-lingual sorts usually make no sense - how do you sort "������", "��������������", "English"? [Those are the strings "Chinese", "Arabic", "English" in native scripts - hopefully they will work on this forum... if not, well, here's an example of why Unicode support is necessary everywhere.]

String comparisons also require complex semantics.  Many Unicode characters can be encoded as a single code point or as multiple code points (for a base character and accents and similar items).  So you need to normalize before comparing - a pure bytewise comparison will give false negatives.  The suggestion that you can do bytewise comparisons on opaque strings above is just incorrect.

This leads into the next issue - handling other character sets.

If the world was full of smart foresightful multi-lingual people who did everything right from the beginning then we would have had Unicode from day one of computing and ASCII and other legacy character sets would not ever have existed.  Well, tough... they do and amazingly enough people are still creating web pages, files, etc. in ISO-8859-1, Big5-HKSCS, GB18030, etc.  YEEARGH.

For example, if I am writing a web spider, I need to pull in web pages in any character set, parse them for links, and then follow those links.  I must support any and all character sets.

So, you either need character set typed strings or the ability to convert any character set to your native types for processing.

See here for a list of character sets supported for conversion in Java: http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html.  Also, check out http://download.oracle.com/javase/6/docs/api/java/lang/String.html and the String constructors that take Charset arguments to properly convert byte arrays to native character set. See also http://download.oracle.com/javase/1.5.0/docs/api/java/io/InputStreamReader.html - "An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted."

I hope this is helpful.  Proper support for i18n is a big job but it is necessary if Ruby is really going to be a serious platform for building global systems.
----------------------------------------
Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support
http://redmine.ruby-lang.org/issues/2034

Author: Run Paint Run Run
Status: Assigned
Priority: Normal
Assignee: Yui NARUSE
Category: M17N
Target version: 2.0


=begin
 Has consideration been recently given to employing the ICU library (http://site.icu-project.org/) in Ruby? The bindings are in C and the library mature. My ignorance of the Ruby source not withstanding, this would allow existing String methods, among others, to support non-ASCII characters in an incremental manner. 
 
 For a trivial example, consider String#to_i. It currently understands only ASCII characters which represent digits. ICU provides a u_charDigitValue(code_point) function which returns the integer corresponding to the given Unicode codepoint. Were String#to_i to use this, it would work with non-ASCII counting systems, thus removing at least one of the "as long as it's ASCII" caveats currently associated with String methods.
 
 More generally, if it's desirable for String methods to properly support Unicode, and if the principle barrier is the difficulty of the implementation, then might there be at least a partial solution in marrying Ruby with ICU?
 
 If ICU is unfeasible, I'd appreciate understanding why. There are multiple approaches to what I term the second phase of Unicode support in Ruby, and it will be easier to choose between them if I understand the constraints. :-) (Of course, if a direction has already been determined, and work on it is underway, I will gladly bow out ;-)).
=end


-- 
http://redmine.ruby-lang.org