A Deeper Look at Ruby Strings (2022)

0
45
A Deeper Look at Ruby Strings (2022)


Contributing to any of the Ruby implementations can be a daunting task. A lot of internal functionality has evolved over the years or been ported from one implementation to another, and much of it is undocumented. This post is an informal look at what makes encoding-aware strings in Ruby functional and performant. I hope it’ll help you get started digging into Ruby on your own or provide some additional insight into all the wonderful things the Ruby VM does for you.

Ruby has an incredibly flexible, if not unusual, string representation. Ruby strings are generally mutable, although the core library has both immutable and mutable variants of many operations. There’s also a mechanism for freezing strings that makes String objects immutable on a per-object or per-file basis. If a string literal is frozen, the VM will use an interned version of the string. Additionally, strings in Ruby are encoding-aware, and Ruby ships with 100+ encodings that can be applied to any string, which is in sharp contrast to other languages that use one universal encoding for all its strings or prevent the construction of invalid strings.

Depending on the context, different encodings are applied when creating a string without an explicit encoding. By default, the three primary ones used are UTF-8, US-ASCII, and ASCII-8BIT (aliased as BINARY). The encoding associated with a string can be changed with or without validation. It is possible to create a string with an underlying byte sequence that is invalid in the associated encoding.

The Ruby approach to strings allows the language to adapt to many legacy applications and esoteric platforms. The cost of this flexibility is the runtime overhead necessary to consider encodings in nearly all string operations. When two strings are appended, their encodings must be checked to see if they’re compatible. For some operations, it’s critical to know whether the string has valid data for its attached encoding. For other operations, it’s necessary to know where the character or grapheme boundaries are.

Depending on the encoding, some operations are more efficient than others. If a string contains only valid ASCII characters, each character is one byte wide. Knowing each character is only a byte allows operations like String#[], String#chr, and String#downcase to be very efficient. Some encodings are fixed width—each “character” is exactly N bytes wide. (The term “character” is vague when it comes to Unicode. Ruby strings (as of Ruby 3.1) have methods to iterate over bytes, characters, code points, and grapheme clusters. Rather than get bogged down in the minutiae of each, I’ll focus on the output from String#each_char and use the term “character” throughout.) Many operations with fixed-width encodings can be efficiently implemented as character offsets are trivial to calculate. In UTF-8, the default internal string encoding in Ruby (and many other languages), characters are variable width, requiring 1 – 4 bytes each. That generally complicates operations because it’s not possible to determine character offsets or even the total number of characters in the string without scanning all of the bytes in the string. However, UTF-8 is backwards-compatible with ASCII. If a UTF-8 string consists of only ASCII characters, each character will be one byte wide, and if the runtime knows that it can optimize operations on such strings the same as if the string had the simpler ASCII encoding.

Code Ranges

In general, the only way to tell if a string consists of valid characters for its associated encoding is to do a full scan of all the bytes. This is an O(n) process, and while not the least efficient operation in the world, it is something we want to avoid. Languages that don’t allow invalid strings only need to do the validation once, at string creation time. Languages that ahead-of-time (AOT) compile can validate string literals during compilation. Languages that only have immutable strings can guarantee that once a string is validated, it can never become invalid. Ruby has none of those properties, so its solution to reducing unnecessary string scans is to cache information about each string in a field known as a code range.

There are four code range values:

  • ENC_CODERANGE_UNKNOWN
  • ENC_CODERANGE_7BIT
  • ENC_CODERANGE_VALID
  • ENC_CODERANGE_BROKEN

The code range occupies an odd place in the runtime. As a place for the runtime to record profile information, it’s an implementation detail. There is no way to request the code range directly from a string. However, since the code range records information about validity, it also impacts how some operations perform. Consequently, a few String methods allow you to derive the string’s code range, allowing you to adapt your application accordingly.

The mappings are:

Code range

Ruby code equivalent

ENC_CODERANGE_UNKNOWN

No representation*

ENC_CODERANGE_7BIT

str.ascii_only?

ENC_CODERANGE_VALID

str.valid_encoding? && !str.ascii_only?

ENC_CODERANGE_BROKEN

!str.valid_encoding?

Table 1:  Mapping of internal code range values to public Ruby methods.

* – Code ranges are lazily calculated in most cases. However, when requesting information about a property that a code range encompasses, the code range is calculated on demand. As such, you may pass strings around that have an ENC_CODERANGE_UNKNOWN code range, but asking information about its validity or other methods that require the code range, such as a string’s character length, will calculate and cache the code range before returning a value to the caller.

Given its odd standing as somewhat implementation detail, somewhat not, every major Ruby implementation associates a code range with a string. If you ever work on a Ruby implementation’s internals or a native extension involving String objects, you’ll almost certainly run into working with and potentially managing the code range value.

Semantics

In MRI, the code range value is stored as an int value in the object header with bitmask flags representing the values. Each of the values is mutually exclusive from one another. This is important to note because logically, every string with an ASCII-compatible encoding and consists of only ASCII characters is a valid string. However, such a string will never have a code range value of ENC_CODERANGE_VALID. You should use the ENC_CODERANGE(obj) macro to extract the code range value and then compare it against one of the defined code range constants, treating the code range constants essentially the same as an enum (e.g., if (cr == ENC_CODERANGE_7BIT) { ... }).

If you try to use the code range values as bitmasks directly, you’ll have very confusing and difficult to debug results. Due to the way the masks are defined, if a string is annotated as being both ENC_CODERANGE_7BIT and ENC_CODERANGE_VALID it will appear to be ENC_CODERANGE_BROKEN. Conversely, if you try to branch on a combined mask like if (cr & (ENC_CODERANGE_7BIT | ENC_CODERANGE_VALID)) { … }, that will include ENC_CODERANGE_BROKEN strings. This is because the four valid values are only represented by two bits in the object header. The compact representation makes efficient use of the limited space in the object header but can be misleading to anyone used to working with bitmasks to match and set attributes.

To help illustrate the point a bit better, I’ve ported some of the relevant C code to Ruby (see Listing 1):

Listing 1: MRI’s original C code range representation recreated in Ruby.

JRuby has a very similar implementation to MRI, storing the code range value as an int compactly within the object header, occupying only two bits. In TruffleRuby, the code range is represented as an enum and stored as an int in the object’s shape. The enum representation takes up additional space but prevents the class of bugs from misapplication of bitmasks.

String Operations and Code Range Changes

The object’s code range is a function of both its sequence of bytes and the encoding associated with the object to interpret those bytes. Consequently, when either the bytes change or the encoding changes, the code range value has the potential to be invalidated. When such an operation occurs, the safest thing to do is to perform a complete code range scan of the resulting string. To the best of our ability, however, we want to avoid recalculating the code range when it is not necessary to do so.

MRI avoids unnecessary code range scans via two primary mechanisms. The first is to simply scan for the code range lazily by changing the string’s code range value to ENC_CODERANGE_UNKNOWN. When an operation is performed that needs to know the real code range, MRI calculates it on demand and updates the cached code range with the new result. If the code range is never needed, it’s never calculated. (MRI will calculate the code range eagerly when doing so is cheap. In particular, when lexing a source file, MRI already needs to examine every byte in a string and be aware of the string’s encoding, so taking the extra step to discover and record the code range value is rather cheap.)

The second way MRI avoids code range scans is to reason about the code range values of any strings being operated on and how an operation might result in a new code range. For example, when working with strings with an ENC_CODERANGE_7BIT code range value, most operations can preserve the code range value since all ASCII characters stay within the 0x00 – 0x7f range. Whether you take a substring, change the casing of characters, or strip off whitespace, the resulting string is guaranteed to also have the ENC_CODERANGE_7BIT value, so performing a full code range scan would be wasteful. The code in Listing 1 demonstrates some operations on a string with an ENC_CODERANGE_7BIT code range and how the resulting string always has the same code range.

Listing 2: Changing the case of a string with an ENC_CODERANGE_7BIT code range will always result in a string that also has an ENC_CODERANGE_7BIT code range.

Sometimes the code range value on its own is insufficient for a particular optimization, in which case MRI will consider additional context. For example, MRI tracks whether a string is “single-byte optimizable.” A string is single-byte optimizable if its code range is ENC_CODERANGE_7BIT or if the associated encoding uses characters that are only one-byte wide, such as is the case with the ASCII-8BIT/BINARY encoding used for I/O. If a string is single-byte optimizable, we know that String#reverse must retain the same code range because each byte corresponds to a single character, so reversing the bytes can’t change their meaning.

Unfortunately, the code range is not always easily derivable, particularly when the string’s code range is ENC_CODERANGE_VALID or ENC_CODERANGE_BROKEN, in which case a full code range scan may prove to be necessary. Operations performed on a string with an ENC_CODERANGE_VALID code range might result in an ENC_CODERANGE_7BIT string if the source string’s encoding is ASCII-compatible; otherwise, it would result in a string with an ENC_CODERANGE_VALID encoding. (We’ve deliberately set aside the case of String#setbyte which could cause a string to have an ENC_CODERANGE_BROKEN code range value. Generally, string operations in Ruby are well-defined and won’t result in a broken string.) In Listing 2, you can see some examples of operations performed against a string with an ENC_CODERANGE_VALID code range resulting in strings with either an ENC_CODERANGE_7BIT code range or an ENC_CODERANGE_VALID coderange.

Listing 3: Changing the case of a string with an ENC_CODERANGE_VALID code range might result in a string with a different code range.

Since the source string may have an ENC_CODERANGE_UNKNOWN value and the operation may not need the resolved code range, such as String#reverse called on a string with the ASCII-8BIT/BINARY encoding, it’s possible to generate a resulting string that also has an ENC_CODERANGE_UNKNOWN code range. That is to say, it’s quite possible to have a string that is ASCII-only but which has an unknown code range that, when operated on, still results in a string that may need to have a full code range scan performed later on. Unfortunately, this is just the trade-off between lazily computing code ranges and deriving the code range without resorting to a full byte scan of the string. To the end user, there is no difference because the code range value will be computed and be accurate by the time it is needed. However, if you’re working on a native extension, a Ruby runtime’s internals, or are just profiling your Ruby application, you should be aware of how a code range can be set or deferred.

TruffleRuby and Code Range Derivations

As a slight digression, I’d like to take a minute to talk about code ranges and their derivations in TruffleRuby. Unlike other Ruby implementations, such as MRI and JRuby, TruffleRuby eagerly computes code range values so that strings never have an ENC_CODERANGE_UNKNOWN code range value. The trade-off that TruffleRuby makes is that it may calculate code range values that are never needed, but string operations are simplified by never needing to calculate a code range on-demand. Moreover, TruffleRuby can derive the code range of an operation’s result string without needing to perform a full byte scan in more situations than MRI or JRuby can.

While eagerly calculating the code range may seem wasteful, it amortizes very well over the lifetime of a program due to TruffleRuby’s extensive reuse of string data. TruffleRuby uses ropes as its string representation, a tree-based data structure where the leaves look like a traditional C-style string, while interior nodes represent string operations linking other ropes together. (If you go looking for references to “rope” in TruffleRuby, you might be surprised to see they’re mostly gone. TruffleRuby still very much uses ropes, but the TruffleRuby implementation of ropes was promoted to a top-level library in the Truffle family of language implementations, which TruffleRuby has adopted. If you use any other language that ships with the GraalVM distribution, you’re also using what used to be TruffleRuby’s ropes.) A Ruby string points to a rope, and a rope holds the critical string data.

For instance, on a string concatenation operation, rather than allocate a new buffer and copy data into it, with ropes we create a “concat rope” with each of the strings being concatenated as its children (see Fig.1). The string is then updated to point at the new concat rope. While that concat rope does not contain any byte data (delegating that to its children), it does store a code range value, which is easy to derive because each child rope is guaranteed to have both a code range value and an associated encoding object.

Figure 1: A sample rope for the result of “Hello “ + “François”

Moreover, rope metadata are immutable, so getting a rope’s code range value will never incur more overhead than a field read. TruffleRuby takes advantage of that property to use ropes as guards in inline caches for its JIT compiler. Additionally, TruffleRuby can specialize string operations based on the code ranges for any argument strings. Since most Ruby programs never deal with ENC_CODERANGE_BROKEN strings, TruffleRuby’s JIT will eliminate any code paths that deal with that code range. If a broken string does appear at runtime, the JIT will deoptimize and handle the operation on a slow path, preserving Ruby’s full semantics. Likewise, while Ruby supports 100+ encodings out of the box, the TruffleRuby JIT will optimize a Ruby application for the small number of encodings it uses.

A String By Any Other Name

Often string performance discussions are centered around web template rendering or text processing. While important use cases, strings are also used extensively within the Ruby runtime. Every symbol or regular expression has an associated string, and they’re consulted for various operations. The real fun comes with Ruby’s metaprogramming facilities: strings can be used to access instance variables, look up methods, send messages to objects, evaluate code snippets, and more. Improvements (or degradations) in string performance can have large, cascading effects.

Backing up a step, I don’t want to oversell the importance of code ranges for fast metaprogramming. They are an ingredient in a somewhat involved recipe. The code range can be used to quickly disqualify strings known not to match, such as those with the ENC_CODERANGE_BROKEN code range value. In the past, the code range was used to fail fast when particular identifiers were only allowed to be ASCII-only. While not currently implemented in MRI, such a check could be used to dismiss strings with the ENC_CODERANGE_VALID code range when all identifiers are known to be ENC_CODERANGE_7BIT, and vice versa. However, once a string passes the code range check, there’s still the matter of seeing if it matches an identifier (instance variable, method, constant, etc.). With TruffleRuby, that check can be satisfied quickly because its immutable ropes are interned and can be compared by reference. In MRI and JRuby, the equality check may involve a linear pass over the string data as the string is interned. Even that process gets murky depending on whether you’re working with a dynamically generated string or a frozen string literal. If you’re interested in a deep dive on the difficulties and solutions to making metaprogramming fast in Ruby, Chris Seaton has published a paper about the topic and I’ve presented a talk about it at RubyKaigi.

Conclusion

More so than many other contemporary languages, Ruby exposes functionality that is difficult to optimize but which grants developers a great deal of expressivity. Code ranges are a way for the VM to avoid repeated work and optimize operations on a per-string basis, guiding away from slow paths when that functionality isn’t needed. Historically, that benefit has been most keenly observed when running in the interpreter. When integrated with a JIT with deoptimization capabilities, such as TruffleRuby, code ranges can help eliminate generated code for the types of strings used by your application and the VM internally.

Knowing what code ranges are and what they’re used for can help you debug issues, both for performance and correctness. At the end of the day, a code range is a cache, and like all caches, it may contain the wrong value. While such instances within the Ruby VM are rare, they’re not unheard of. More commonly, a native extension manipulating strings may fail to update a string’s code range properly. Hopefully, with a firm understanding of code ranges, you’ll find Ruby’s handling of strings less daunting.

Kevin is a Staff Developer on the Ruby & Rails Infrastructure team where he works on TruffleRuby. When he’s not working on Ruby internals, he enjoys collecting browser tabs, playing drums, and hanging out with his family.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.



Source link

Leave a reply

Please enter your comment!
Please enter your name here