From: Run Paint Run Run <redmine@...>
Date: 2009-09-13T09:21:22+09:00
Subject: [ruby-core:25540] [Bug #2095] Oniguruma No Longer Understands Unihan Characters

Bug #2095: Oniguruma No Longer Understands Unihan Characters
http://redmine.ruby-lang.org/issues/show/2095

Author: Run Paint Run Run
Status: Open, Priority: High
ruby -v: ruby 1.9.2dev (2009-09-11) [i686-linux]

As Oniguruma was undocumented, the recent update was based mainly on guesswork. While working on a Unicode library to create an exhaustive test suite I noticed that the update introduced a serious regression. We based the update on UnicodeData.txt and Scripts.txt, but as the former omits Unihan characters their properties are no longer recognized. To fix this we can have tool/enc-unicode.rb parse Unihan.txt (or, rather, the files to which it is divided over as of Unicode 5.2). However, I'd prefer instead to update the script to use the new XML dump Unicode has made available. This is comprehensive and the simpler, standardized file format means parsing bugs are far less likely. In addition it makes it easier to expand our Unicode support in the feature simply by selecting additional attributes. Unfortunately, both approaches preclude storing the data file(s) in SVN (as we currently do with UnicodeData.txt and Scripts.txt) because the Unihan.txt file alone is 28MB uncompresse!
 d. (The XML dump is, of course, even bigger).

In the next 24 hours I will update the script to download the latest XML dump and parse it.


----------------------------------------
http://redmine.ruby-lang.org