Bug #21765
openstop using the C runtime _read() on Windows
Description
When creating an IO instance in Windows, the default data mode is text mode.
In reality, the IO encoding conversion mechanism is not used when encoding conversion is not performed. Instead, the CRLF conversion provided by the C runtime's _read() is used.
This is explicitly for speed.
https://bugs.ruby-lang.org/issues/6401#note-4
As a trade-off, SET_BINARY_MODE(fptr) and SET_BINARY_MODE_WITH_SEEK_CUR(fptr) are used in various places within io.c, altering the state of the file descriptor.
This made the flow of operations difficult to understand and changes hard to implement, especially for developers on other platforms.
Additionally, the issues I recently reported were discovered while verifying the impact of modifying the CRLF conversion to utilize the encoding conversion mechanism.
#21691 On Windows some of binary read functions of IO are not functional
#21687 IO#pos goes wrong after EOF character(ctrl-z) met
#21634 Combining read(1) with eof? causes dropout of results unexpectedly on Windows.
These issues arise because data read into the rbuf does not match the stream due to newline conversion, or because the buffer end and file position do not align when CTRLZ is detected.
As a fix for Bug #21687, I created PR #15216. However, this relies on the internal behavior of the C runtime's _read() function, and it seems there is no way to avoid this dependency.
I propose removing the use of C runtime _read().
Reason for Proposal
- The mismatch between rbuf and stream contents complicates io_unread() and makes maintenance difficult.
- Changing the O_BINARY/O_TEXT state of the file descriptor in various places hinders understanding of the behavior and makes modifications difficult.
Two methods to remove C runtime _read() while maintaining current behavior
- Interpret CRLF and CTRLZ when reading rbuf within io.c.
- Interpret CRLF and CTRLZ within the encoding conversion framework.
My initial idea was to implement the second, using encoding conversion.
However, this internally changes the read operation from rbuf to cbuf, resulting in a change to the behavior of ungetc.
The proposal in Bug #21682 attempted to generalize this change to minimize its impact.
https://bugs.ruby-lang.org/issues/21682
This issue proposes the first method, crlf conversion during rbuf read.
Problems caused by inconsistencies between the rbuf and stream contents are avoided, and io_unread() becomes the same as on other platforms.
Compared to implementing it as an encoding conversion, the advantage is that there is no change in behavior.
On the other hand, since each read method in io.c requires individual handling, using encoding conversion results in more localized changes.
Updated by YO4 (Yoshinao Muramatsu) 18 days ago
I created PR#15408
For comparison, there is a patch [WIP] that utilizes encoding conversion.
https://github.com/YO4/ruby/commits/avoid-windows-crt-read/
Updated by YO4 (Yoshinao Muramatsu) 14 days ago
The current io.c code for windows is intended to avoid performance limitations of the encoding converter.
Therefore, performance is important. Here are some benchmarks before and after the patch.
(some empty lines removed manually)
** generate fixture **
> c:\ruby34-x64\bin\ruby -e "open('long.txt', 'wb') { |f| 8.times { f.write 'x'*1024*1024 } }"
> c:\ruby34-x64\bin\ruby -e "open('medium.txt', 'w') { |f| 65536.times { f.puts 'x'*80 } }"
> c:\ruby34-x64\bin\ruby -e "open('dense.txt', 'w') { |f| (65536*8).times { f.puts '' } }"
** IO#read with no line breaks **
> master\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 128.times { f.read; f.rewind }; puts Time.now - t }"
1.5247651
> patched\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 128.times { f.read; f.rewind }; puts Time.now - t }"
1.260006
** IO#read with moderate line breaks **
> master\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 128.times { f.read; f.rewind }; puts Time.now - t }"
0.8651427
> patched\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 128.times { f.read; f.rewind }; puts Time.now - t }"
1.0337131
** IO#read with dense line breaks **
> master\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 1024.times { f.read; f.rewind }; puts Time.now - t }"
1.2129726
> patched\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 1024.times { f.read; f.rewind }; puts Time.now - t }"
1.3608866
** IO#readlines with no line breaks **
> master\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 1.times { f.readlines; f.rewind }; puts Time.now - t }"
0.9637844
> patched\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 1.times { f.readlines; f.rewind }; puts Time.now - t }"
0.9633248
** IO#readlines with moderate line breaks **
> master\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 64.times { f.readlines; f.rewind }; puts Time.now - t }"
1.2625658
> patched\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 64.times { f.readlines; f.rewind }; puts Time.now - t }"
1.2737932
** IO#readlines with dense line breaks **
> master\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 32.times { f.readlines; f.rewind }; puts Time.now - t }"
1.8690925
> patched\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 32.times { f.readlines; f.rewind }; puts Time.now - t }"
1.7237312
** IO#gets with no line breaks **
> master\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 1.times { while f.gets do end; f.rewind }; puts Time.now - t }"
0.9798694
> patched\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 1.times { while f.gets do end; f.rewind }; puts Time.now - t }"
1.0012916
** IO#gets with moderate line breaks **
> master\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 64.times { while f.gets do end; f.rewind }; puts Time.now - t }"
1.3087455
> patched\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 64.times { while f.gets do end; f.rewind }; puts Time.now - t }"
1.3204693
** IO#gets with dense line breaks **
> master\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 16.times { while f.gets do end; f.rewind }; puts Time.now - t }"
1.2261222
> patched\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 16.times { while f.gets do end; f.rewind }; puts Time.now - t }"
1.1575592
** IO#getc with no line breaks **
> master\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 1.times { while f.getc do end; f.rewind }; puts Time.now - t }"
0.8297943
> patched\ucrt64\bin\ruby -e "open('long.txt', 'r') { |f| t = Time.now; 1.times { while f.getc do end; f.rewind }; puts Time.now - t }"
0.767664
** IO#getc with moderate line breaks **
> master\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 2.times { while f.getc do end; f.rewind }; puts Time.now - t }"
1.0667623
> patched\ucrt64\bin\ruby -e "open('medium.txt', 'r') { |f| t = Time.now; 2.times { while f.getc do end; f.rewind }; puts Time.now - t }"
0.9615283
** IO#getc with dense line breaks **
> master\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 16.times { while f.getc do end; f.rewind }; puts Time.now - t }"
0.9435967
> patched\ucrt64\bin\ruby -e "open('dense.txt', 'r') { |f| t = Time.now; 16.times { while f.getc do end; f.rewind }; puts Time.now - t }"
0.8731051
** ruby versions **
> master\ucrt64\bin\ruby -v
ruby 4.0.0dev (2025-12-04T09:08:23Z master 8099e9d2d1) +PRISM [x64-mingw-ucrt]
> patched\ucrt64\bin\ruby -v
ruby 4.0.0dev (2025-12-04T12:54:48Z remove_crt_read a72e32e5e4) +PRISM [x64-mingw-ucrt]
The variation in IO#read results after the patch is due to the characteristics of memchr.
For other results, it is thought to be a balance between increased overhead and the reduction in interlocks used internally by set_mode().