[ruby-core:63987] [CommonRuby - Feature #10084] Add Unicode String Normalization to String class

From: duerst@...
Date: 2014-07-24 06:58:26 UTC
List: ruby-core #63987
Issue #10084 has been updated by Martin D=C3=BCrst.


Nobuyoshi Nakada wrote:
> What will happen for a non-unicode string, raising an exception?

This is a very good question. I'm okay with whatever Matz and the community=
 think is best.

There are many potential approaches. In general, these will be:
1) Make the operation a no-op.
2) Convert to UTF-8, normalize, then convert back.
3) Implement normalization directly in the encoding.
4) Raise an exception.

There is also the question of what a "non-unicode" or "unicode" string is.

UTF-8 is the preferred way to handle Unicode in Ruby, and is where normaliz=
ation is really needed and will be used.

For the other encodings, unless we go with 1) or 4), the following consider=
ations apply.

UTF8-Mac, UTF8-DoCoMo, UTF8-KDDI and UTF8-Softbank are essentially UTF-8 bu=
t with slightly different character conversions. For these encodings, the e=
asiest thing to do is force_encoding to UTF-8, normalize, and force_encodin=
g back. A C-level implementation may not actually need force_encoding, but =
a Ruby implementation does. There are some questions about what normalizing=
 UTF8-Mac means, so that may have to be treated separately. The DoCoMo/KDDI=
/Softbank variants are mostly about emoji, which as far as I know are not a=
ffected by normalization.

Then there are UTF-16LE/BE and UTF-32LE/BE. For these, it depends on the im=
plementation. A Ruby-level implementation (unless very slow) may want to co=
nvert to UTF-8 and back. A C-level implementation may not need to do this.

Then there is also GB18030. Conversion to UTF-8 and back seems to be the be=
st solution. Doing normalization directly in GB18030 will need too much dat=
a.

For other, truely non-unicode encodings, implementing noramlization directl=
y in the encoding would mean the following: Analyze to what extent the norm=
alization applies to the encoding in question, and apply this part. As an e=
xample, '=E2=91=A0'.nfkc produces '1' in UTF-8, it could do the same in Win=
dows-31J. The analysis might take some time (but can be automated), and the=
 data needed for each encoding would mostly be just very small.


----------------------------------------
Feature #10084: Add Unicode String Normalization to String class
https://bugs.ruby-lang.org/issues/10084#change-48005

* Author: Martin D=C3=BCrst
* Status: Open
* Priority: Normal
* Assignee:=20
* Category:=20
* Target version:=20
----------------------------------------
Unicode string normalization is a frequent operation when comparing or norm=
alizing strings.

This should be available directly on the String class.

The proposed syntax is:

   'string'.normalize       # normalize 'string' according to NFC (most fre=
quent on the Web)
   'string'.normalize :nfc  # normalize 'string' according to NFC; :nfd, :n=
fkc, :nfkd also usable
   'string'.nfc             # shorter variant, but maybe too many methods

There are several "unofficial" but convenient normalization variants that c=
ould be offered, e.g.:
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20
   'string'.normalize :mac  # use MacIntosh file system normalization varia=
nt

Implementations are already available in pure Ruby (easy for other Ruby imp=
lementations; e.g. eprun: https://github.com/duerst/eprun) and in C (unf,=
=E2=80=A6, http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicod=
e-normalization-alternatives/)

---Files--------------------------------
Normalization.pdf (576 KB)


--=20
https://bugs.ruby-lang.org/

In This Thread

Prev Next