From: "Dan0042 (Daniel DeLorme) via ruby-core" Date: 2023-06-14T02:05:04+00:00 Subject: [ruby-core:113902] [Ruby master Feature#19315] Lazy substrings in CRuby Issue #19315 has been updated by Dan0042 (Daniel DeLorme). Bumping this because it's kinda shocking to me that strings don't already work this way. My mental model of ruby strings has always been that ``` m = rx.match(very_large_string) before, match, after = m.pre_match, m[0], m.post_match ``` is memory-wise a cheap operation because we only allocate 3 objects slots which point to the same string data. I have a lot of code built on this assumption. But it turns out this was false! The `before` and `match` strings actually copy the string data as well. Same thing for `File.read(very_large_file).split("\n")` which I assumed allocated one large blob and then had pointers to various parts of that blob for each string of the resulting array. But actually it needs double the memory. Allocating and copying memory is not free; I expect fixing this will lead to a large performance improvement. ---------------------------------------- Feature #19315: Lazy substrings in CRuby https://bugs.ruby-lang.org/issues/19315#change-103553 * Author: Eregon (Benoit Daloze) * Status: Open * Priority: Normal ---------------------------------------- CRuby should implement lazy substrings, i.e., "abcdef"[1..3] must not copy bytes. Currently CRuby only reuse the char* if the substring is until the end of the buffer. But it should also work wherever the substring starts and ends. Yes, it means RSTRING_PTR() might need to allocate to \0-terminate, so be it, it's worth it. There is already code for this (`SHARABLE_MIDDLE_SUBSTRING`), but it's disabled by default and `RSTRING_PTR()` needs to be changed to deal with this. It seems a good idea to introduce a variant of `RSTRING_PTR` which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy. There are countless workarounds for this missing optimization, all not worth it with lazy substring and all less readable: * https://bugs.ruby-lang.org/issues/19314 * https://bugs.ruby-lang.org/issues/18598#note-3 * https://github.com/ruby/net-protocol/pull/14 * Manual lazy substrings which track string + index + length * More but I don't remember all now, feel free to comment or link more urls/tickets. -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/