[ruby-core:64092] [CommonRuby - Feature #10084] Add Unicode String Normalization to String class

From: duerst@...
Date: 2014-07-28 05:32:20 UTC
List: ruby-core #64092
Issue #10084 has been updated by Martin D=C3=BCrst.


copying notes from 2014/7/26 developer's meeting (Google docs):

Proposed method names by Matz: unicode_normalize or normalize_kd,... (not t=
oo short)
How to deal with non-Unicode encodings: Matz: raise Exception
Other than UTF-8: UTF8-Mac: return type should be UTF-8, or deal with it as=
 legacy (not really Unicode). UTF8-DoCoMo,..? Yui should decide. UTF-16/32:=
 Needed data,... differs by whether implementation is internal ( C) or pure=
 Ruby.
Todo (for eprun): measure load time, compare with unf, avoid Module Normali=
ze

require =E2=80=9Cunicode_normalize=E2=80=9D
method name: String#unicode_normalize(form)
form: :nfc, :nfd, :nfkc, :nfkd
encodng: UTF-32BE/LE, UTF-16BE/LE, UTF-8
allow UTF8-MAC is confusing.

----------------------------------------
Feature #10084: Add Unicode String Normalization to String class
https://bugs.ruby-lang.org/issues/10084#change-48104

* Author: Martin D=C3=BCrst
* Status: Open
* Priority: Normal
* Assignee:=20
* Category:=20
* Target version:=20
----------------------------------------
Unicode string normalization is a frequent operation when comparing or norm=
alizing strings.

This should be available directly on the String class.

The proposed syntax is:

   'string'.normalize       # normalize 'string' according to NFC (most fre=
quent on the Web)
   'string'.normalize :nfc  # normalize 'string' according to NFC; :nfd, :n=
fkc, :nfkd also usable
   'string'.nfc             # shorter variant, but maybe too many methods

There are several "unofficial" but convenient normalization variants that c=
ould be offered, e.g.:
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20
   'string'.normalize :mac  # use MacIntosh file system normalization varia=
nt

Implementations are already available in pure Ruby (easy for other Ruby imp=
lementations; e.g. eprun: https://github.com/duerst/eprun) and in C (unf,=
=E2=80=A6, http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicod=
e-normalization-alternatives/)

---Files--------------------------------
Normalization.pdf (576 KB)


--=20
https://bugs.ruby-lang.org/

In This Thread

Prev Next