[#63592] [ruby-trunk - Bug #10009] IO operation is 10x slower in multi-thread environment — normalperson@...
Issue #10009 has been updated by Eric Wong.
3 messages
2014/07/08
[#63682] [ruby-trunk - Feature #10030] [PATCH] reduce rb_iseq_struct to 296 bytes — ko1@...
Issue #10030 has been updated by Koichi Sasada.
3 messages
2014/07/13
[#63703] [ruby-trunk - Feature #10030] [PATCH] reduce rb_iseq_struct to 296 bytes — ko1@...
Issue #10030 has been updated by Koichi Sasada.
3 messages
2014/07/14
[#63743] [ruby-trunk - Bug #10037] Since r46798 on Solaris, "[BUG] rb_vm_get_cref: unreachable" during make — ngotogenome@...
Issue #10037 has been updated by Naohisa Goto.
3 messages
2014/07/15
[#64136] Ruby 2.1.2 (and 2.1.1 and probably others) assumes a libffi with 3 version numbers in extconf.rb — "Jeffrey 'jf' Lim" <jfs.world@...>
As per subject.
4 messages
2014/07/31
[#64138] Re: Ruby 2.1.2 (and 2.1.1 and probably others) assumes a libffi with 3 version numbers in extconf.rb
— "Jeffrey 'jf' Lim" <jfs.world@...>
2014/07/31
On Thu, Jul 31, 2014 at 6:03 PM, Jeffrey 'jf' Lim <[email protected]>
[ruby-core:63987] [CommonRuby - Feature #10084] Add Unicode String Normalization to String class
From:
duerst@...
Date:
2014-07-24 06:58:26 UTC
List:
ruby-core #63987
Issue #10084 has been updated by Martin D=C3=BCrst. Nobuyoshi Nakada wrote: > What will happen for a non-unicode string, raising an exception? This is a very good question. I'm okay with whatever Matz and the community= think is best. There are many potential approaches. In general, these will be: 1) Make the operation a no-op. 2) Convert to UTF-8, normalize, then convert back. 3) Implement normalization directly in the encoding. 4) Raise an exception. There is also the question of what a "non-unicode" or "unicode" string is. UTF-8 is the preferred way to handle Unicode in Ruby, and is where normaliz= ation is really needed and will be used. For the other encodings, unless we go with 1) or 4), the following consider= ations apply. UTF8-Mac, UTF8-DoCoMo, UTF8-KDDI and UTF8-Softbank are essentially UTF-8 bu= t with slightly different character conversions. For these encodings, the e= asiest thing to do is force_encoding to UTF-8, normalize, and force_encodin= g back. A C-level implementation may not actually need force_encoding, but = a Ruby implementation does. There are some questions about what normalizing= UTF8-Mac means, so that may have to be treated separately. The DoCoMo/KDDI= /Softbank variants are mostly about emoji, which as far as I know are not a= ffected by normalization. Then there are UTF-16LE/BE and UTF-32LE/BE. For these, it depends on the im= plementation. A Ruby-level implementation (unless very slow) may want to co= nvert to UTF-8 and back. A C-level implementation may not need to do this. Then there is also GB18030. Conversion to UTF-8 and back seems to be the be= st solution. Doing normalization directly in GB18030 will need too much dat= a. For other, truely non-unicode encodings, implementing noramlization directl= y in the encoding would mean the following: Analyze to what extent the norm= alization applies to the encoding in question, and apply this part. As an e= xample, '=E2=91=A0'.nfkc produces '1' in UTF-8, it could do the same in Win= dows-31J. The analysis might take some time (but can be automated), and the= data needed for each encoding would mostly be just very small. ---------------------------------------- Feature #10084: Add Unicode String Normalization to String class https://bugs.ruby-lang.org/issues/10084#change-48005 * Author: Martin D=C3=BCrst * Status: Open * Priority: Normal * Assignee:=20 * Category:=20 * Target version:=20 ---------------------------------------- Unicode string normalization is a frequent operation when comparing or norm= alizing strings. This should be available directly on the String class. The proposed syntax is: 'string'.normalize # normalize 'string' according to NFC (most fre= quent on the Web) 'string'.normalize :nfc # normalize 'string' according to NFC; :nfd, :n= fkc, :nfkd also usable 'string'.nfc # shorter variant, but maybe too many methods There are several "unofficial" but convenient normalization variants that c= ould be offered, e.g.: =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20 'string'.normalize :mac # use MacIntosh file system normalization varia= nt Implementations are already available in pure Ruby (easy for other Ruby imp= lementations; e.g. eprun: https://github.com/duerst/eprun) and in C (unf,= =E2=80=A6, http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicod= e-normalization-alternatives/) ---Files-------------------------------- Normalization.pdf (576 KB) --=20 https://bugs.ruby-lang.org/