[ruby-core:117427] [Ruby master Misc#20406] Question about Regexp encoding negotiation
From:
"Eregon (Benoit Daloze) via ruby-core" <ruby-core@...>
Date:
2024-04-03 10:29:28 UTC
List:
ruby-core #117427
Issue #20406 has been updated by Eregon (Benoit Daloze).
I found another case which does not seem to respect those rules:
```
$ ruby -ve 'p /#{"=E9".dup}/e.encoding'
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
#<Encoding:UTF-8>
$ ruby -e 'p /a#{"=E9".dup}b/e.encoding' =20
#<Encoding:UTF-8>
```
It seems to behave a bit like string interpolation here, but that's very co=
nfusing when mixed with the above rules.
How to know which rule is applied when and what has precedence?
When mixing two incompatible encodings there is an error, which makes sense:
```
$ ruby -e 'p /a#{"=E9".dup}\xc2\xa1/e.encoding'
-e:1:in `<main>': encoding mismatch in dynamic regexp : UTF-8 and EUC-JP (R=
egexpError)
$ ruby -e '"=E9" + "\xc2\xa1".force_encoding("EUC-JP")'
-e:1:in `+': incompatible character encodings: UTF-8 and EUC-JP (Encoding::=
CompatibilityError)
```
Without the `.dup` there is a compile error:
```
$ ruby -e 'p /#{"=E9"}/e.encoding'=20
-e:1: regexp encoding option 'e' differs from source encoding 'UTF-8'
-e:1: regexp encoding option 'e' differs from source encoding 'UTF-8'
-e: compile error (SyntaxError)
```
Which is not so nice because this breaks referential transparency (e.g. a s=
tring literal can be replaced by a variable referencing that string literal=
) and adds more edge cases. I think the compiler should not look inside `#{=
}` for interpolated regexps.
OTOH this error seems OK, because it's something that can be detected at pa=
rse time:
```
$ ruby -e 'p /=E9/e.encoding'=20
-e:1: regexp encoding option 'e' differs from source encoding 'UTF-8'
-e: compile error (SyntaxError)
```
I think it would be tempting semantically to assign the encoding of the sta=
tic parts of an interpolated regexp with a `/nesu` flag to that encoding.
IOW, `/nesu` would take precedence over the source encoding for the "static=
parts/static string literals" in an interpolated regexp.
That would however allow `/=E9/e`, unless there is an extra check for such =
parts being all 7-bit or not, which seems OK to have.
----------------------------------------
Misc #20406: Question about Regexp encoding negotiation
https://bugs.ruby-lang.org/issues/20406#change-107802
* Author: andrykonchin (Andrew Konchin)
* Status: Open
----------------------------------------
I am wondering what are the rules to calculate Regexp literal encoding in c=
ase an encoding modifier is specified.
>From the documentstion:
> By default, a regexp with only US-ASCII characters has US-ASCII encoding:
> ...
> A regular expression containing non-US-ASCII characters is assumed to use=
the source encoding. This can be overridden with one of the following modi=
fiers.
> //n ...
> //u ...
> //e ...
> //s ...
Looking at the following examples I would assume that these rules are follo=
wed except one case:
```ruby
p /\xc2\xa1/e .encoding # EUC-JP
p /#{ }\xc2\xa1/e .encoding # EUC-JP
p /a/e .encoding # EUC-JP
p /a #{} a/e .encoding # EUC-JP
p /#{} a/e .encoding # US-ASCII
```
The last Regexp `/#{} a/e` is supposed to have `EUC-JP` encoding but has `U=
S-ASCII`. So I am wondering what rule is applied in this case.
--=20
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- [email protected]
To unsubscribe send an email to [email protected]
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-c=
ore.ml.ruby-lang.org/