Project

General

Profile

Actions

Bug #19867

closed

Unicode line and paragraph separator are not stripped

Added by iainbeeston (Iain Beeston) almost 2 years ago. Updated almost 2 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]
[ruby-core:114662]

Description

Unicode newline and paragraph separators are not removed by any of the strip methods:

"\u2028\u2029\u0000\t\n\v\f\r ".strip # => "\u2028\u2029"

I would have expected strip (and lstrip, rstrip) to remove unicode whitespace as well. It looks like #7154 reported something similar but for regular expressions and way back In ruby 1.9.

I think that fixing this should be simple (just checking for \x2028 and \x2029 in ctype.h) but I'm not sure if it's supposed to behave this way or if changing it could introduce unexpected consequences.

Updated by iainbeeston (Iain Beeston) almost 2 years ago

I can see that the [[:space:]] regex class does match unicode whitespace characters ("\u2028" =~ /[[:space:]]/ # => 0) but \s does not ("\u2028" =~ /\s/ # => nil)

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Yes, \s, \w etc match only single-byte ASCII characters.
I don't think changing the behavior by default is good idea.
An optional (keyword) argument may be better.

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

As for the implementation, changing ctype.h is not desirable.
There is rb_enc_isspace function for such purpose already.

Actions #4

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

  • Status changed from Open to Rejected
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0