ruby-core

NARUSE, Yui wrote on 2010-11-15 11:07:
> This is what Japanese people often say "Americans don't consider=20
> non-ASCII

Sure, many people want to handle non-ASCII text. But:

* I would consider all non-Unicode character sets to be legacy, do you
disagree?
* ruby 1.9's model doesn't handle stateful encodings like ISO-2022-JP,
so these need transcoding at the edge anyway
* hence why not just transcode everything that's not Unicode into Unicode=
?

I would choose UTF-8 as the internal Unicode representation, since the
majority of external Unicode data already UTF-8.  (*)

Then you end up with the design used by both Python 3.0 and Erlang: you h=
ave
two data types, one for binary strings, and one for UTF-8 text.  (I shoul=
d
add this to the document as an explicit alternative)

This would wipe out most of the complexity associated with ruby 1.9 at a
stroke. What you would lose is:

* the ability to handle things like EUC-JP or GB2312 "natively", that is,
without transcoding them to UTF-8 and back
* the ability to write ruby programs in non-UTF-8 character sets

How big a loss is that?

(*) There's an argument which says use UTF-16 or UTF-32 internally as it'=
s
better suited to character indexing.  I would say that this is outweighed=
 by
the extra RAM bandwidth used, and the fact that most data is UTF-8 so wou=
ld
have to be transcoded.

> > Have a universally-compatible "BINARY" encoding.
> > Any operation between BINARY and FOO gives encoding BINARY,
> > and transcoding between BINARY and any other encoding is a null opera=
tion.
>=20
> This will hide unexpectedly mixed BINARY string.
> You'll realize hard to debug such strings.

I would much rather have a program which outputs a plausible binary strin=
g
from its inputs than one which crashes given unexpected data.  ruby 1.9
hugely magnifies the number of unit test cases to achieve coverage of the=
se
edge cases.

> > Treat invalid characters in the same way as String#[] does,
> > i.e. never raise an exception. In particular, regexp matching always =
succeeds.
>=20
> This will raise security issue.

In what way is it a security issue?  Why is it not a security issue that
String#[] doesn't error?  Why is it not a security issue that 'sed' handl=
es
such files successfully?

Roger Pack wrote:
> Maybe another option is a command line switch like --binary_only
> (avoids all encodings, does only ASCII-8BIT) or make it so that
> setting Encoding.default_external and default_internal to ASCII-8BIT
> would have the same effect...
> Just thinking out loud.

I think that's a bad idea, because then your program behaves in
different ways depending on how it's launched, and that breaks libraries
which depend on this global flag being set a particular way, and applicat=
ions
which are run in different environments. This problem already affects 1.8=
:
see http://www.ruby-forum.com/topic/216511

I think if you wanted this behaviour to be opt-in you'd have a completely
different encoding, call it 'true-binary' for sake of argument.  Then you=
'd
say:

  # encoding:true-binary

at the top of the source file to make all your string literals be this
univeral binary encoding.

Martin J. D=FCrst wrote:
> But the fact that US-ASCII and BINARY (=3D=3DASCII8BIT)
> currently mix easily will need some very careful thought and work (it
> has been discussed many times in great detail on ruby-dev). The short
> summary of why US-ASCII and BINARY currently mix easily is that if you
> want to check the magic number at the start of a GIF file, you want to
> write that
>     ~=3D /^GIF8/
> and not
>     ~=3D /^\x47\x49\x46\x38/b

I think you mean:

      =3D~ /\AGIF8/

Since this is binary data, you'll have read it using 'read' not 'gets'.=20
This demonstrates a far more dangerous default behaviour in Ruby: ^ doesn=
't
only match start of string, so the regexp you wrote will match things you
didn't expect.  (This isn't really relevant to discussion in hand, except
where people have said it's somehow dangerous to allow binary strings to
interact with strings in different encodings without raising an error)

Regards,

Brian.

Thread

Prev Next

In This Thread

Prev Next