From: "Ondřej Bílka" Date: 2012-01-14T05:42:20+09:00 Subject: [ruby-core:42122] Re: [ruby-trunk - Bug #5871] regexp \W matches some word characters when inside a case-insensitive character class So regular expessions dont offer level1:basic unicode support? See http://unicode.org/reports/tr18/ On Tue, Jan 10, 2012 at 06:07:13PM +0900, Yui NARUSE wrote: > > Issue #5871 has been updated by Yui NARUSE. > > > Martin D��rst wrote: > > Shohei Urabe writes: > > > > > Martin D��rst wrote: > > > > Shouhei Urabe writes: > > > > > > > > > Quite generally speaking you are advised not to use /i in Unicode. > > > > > > > > Are there other examples where /i is advised against? If yes, please let's look at them and try to fix them, too. > > > > > > /D��kstra/i.match("DIJKSTRA") or something like that. > > > > What about /D��kstra/.match("Dijkstra") ? > > $ ruby -e "puts /D\u0133kstra/.match('Dijkstra').inspect" > > nil > > It is not an issue of case equivalence. > > > If this doesn't match without case equivalence, why should it match with case equivalence? > > (I'm assuming that matching is transitive and that matching by /i should be a superset of matching without.) > > irb(main):005:0> /[^a-z]/=~"A" > => 0 > irb(main):006:0> /[^a-z]/i=~"A" > => nil > ---------------------------------------- > Bug #5871: regexp \W matches some word characters when inside a case-insensitive character class > https://bugs.ruby-lang.org/issues/5871 > > Author: Gareth Adams > Status: Rejected > Priority: Normal > Assignee: > Category: > Target version: > ruby -v: ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-darwin10.8.0] > > > =begin > The following replacement, which should do nothing, has removed the upper- and lower-case "K"s and "S"s from the result: > > > "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".gsub(/[\W]/i,"") > => "ABCDEFGHIJLMNOPQRTUVWXYZabcdefghijlmnopqrtuvwxyz" > > The result is correct (the same as the input string) if I remove either the character class: > > > "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".gsub(/\W/i,"") > => "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" > > or the case insensitive flag: > > > "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".gsub(/[\W]/,"") > => "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" > > This has been observed in two separate ruby 1.9 installs: > > * ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-darwin10.8.0] > * ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0] > > but works correctly in 1.8 > =end > > > > -- > http://bugs.ruby-lang.org/ -- old inkjet cartridges emanate barium-based fumes