[#65451] [ruby-trunk - Feature #10333] [PATCH 3/1] optimize: "yoda literal" == string — ko1@...

Issue #10333 has been updated by Koichi Sasada.

9 messages 2014/10/07

[ruby-core:65460] [CommonRuby - Feature #10084] Add Unicode String Normalization to String class

From: duerst@...
Date: 2014-10-07 11:00:32 UTC
List: ruby-core #65460
Issue #10084 has been updated by Martin D=C3=BCrst.

Assignee changed from Nobuyoshi Nakada to Yukihiro Matsumoto

This feature is going to add one or more methods to class String (String#un=
icode_normalize and probably String#unicode_normalize! and String#unicode_n=
ormalized?).

The implementation also internally needs various additional methods and con=
stants that an end user should not ever want or need to use. Just adding th=
em to class String is the easiest solution, but this may confuse a user whe=
n e.g. calling String.instance_methods(false). In the current implementatio=
n, these methods and constants are put in module Normalize (see https://git=
hub.com/duerst/eprun/blob/master/lib/normalize.rb).

In order to proceed with the implementation of this feature, I'd like to ge=
t advice from Matz (and others) on the best alternative. I have thought thr=
ough the following alternatives:

1) Use a standalone module (probably better to change the name, e.g. to Uni=
codeNormalizeImplementation or so)

2) Use a module inside String, e.g. String::Normalize. Advantage: module na=
me can be shorter, because local.

3) Use an anonymous module. This is possible, but it requires that all the =
related code and data is in the same physical file, which restricts potenti=
al memory optimizations. Also, it requires that all the code is re-written =
e.g. replacing 'def' with 'define_method', which will look rather clumsy.

4) Just add the necessary methods and constants to String, but use longer, =
more explicit names. This should be slightly faster, because currently many=
 of the methods take an explicit string parameter, but this would just be t=
he receiver. We can also make the methods private to reduce user temptation.

5) Use a refinement. The advantages are that this can be distributed over m=
ore than one file, and that we can directly call the methods on Strings (se=
e 4). The disadvantages are that we still need a public module as the refin=
ement container.

I personally would like to avoid 3) if at all possible. I don't have much p=
references among the other solutions. There may also be other solutions tha=
t I haven't thought about yet.

I would like to get Matz's preference(s) as soon as possible to proceed wit=
h the implementation. Any advice from others, e.g. with respect to similar =
cases, performance tradeoffs, other ideas, and so on, are also greatly appr=
eciated.


----------------------------------------
Feature #10084: Add Unicode String Normalization to String class
https://bugs.ruby-lang.org/issues/10084#change-49242

* Author: Martin D=C3=BCrst
* Status: Open
* Priority: Normal
* Assignee: Yukihiro Matsumoto
* Category:=20
* Target version: Ruby 2.2.0
----------------------------------------
Unicode string normalization is a frequent operation when comparing or norm=
alizing strings.

This should be available directly on the String class.

The proposed syntax is:

   'string'.normalize       # normalize 'string' according to NFC (most fre=
quent on the Web)
   'string'.normalize :nfc  # normalize 'string' according to NFC; :nfd, :n=
fkc, :nfkd also usable
   'string'.nfc             # shorter variant, but maybe too many methods

There are several "unofficial" but convenient normalization variants that c=
ould be offered, e.g.:
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20
   'string'.normalize :mac  # use MacIntosh file system normalization varia=
nt

Implementations are already available in pure Ruby (easy for other Ruby imp=
lementations; e.g. eprun: https://github.com/duerst/eprun) and in C (unf,=
=E2=80=A6, http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicod=
e-normalization-alternatives/)

---Files--------------------------------
Normalization.pdf (576 KB)


--=20
https://bugs.ruby-lang.org/

In This Thread

Prev Next