Jul 24 2013
TL;DR: Subclassing core classes in Ruby can lead to unexpected side effects. I suggest composition over inheritance in all these cases.
Subclassing Review
If you’re familiar with the concept of subclassing, skip down to “The Problem.”
In Ruby, you can make your own classes:
class List
end
You can also make subclasses of those classes:
class OrderedList < List
end
puts OrderedList.new.kind_of?(List) # => true
Now, subclassing represents an “is a” relationship. This means that our OrderedList
should be a List
in every respect, but with some added behavior. The Liskov Substitution Principle is one formulation of this idea.
The Problem
Ruby has two major bits of code that it provides for your use: the core library and the standard library. The core library can be found here, and contains cllasses that you know and love, like String
, Hash
, and Array
. The standard library can be found here, and contains your favorite hits, like CSV
, JSON
, and Logger
.
One way to think about the difference between core and the standard library is that core is written in C, while the standard library is written in Ruby. Core are the classes that are used the most, so they’re implemented in as low-level a fashion as possible. They’ll be in every single Ruby program, so might as well make them fast! The standard library only gets pulled in by bits and pieces; another way of thinking about the difference is that you need to require
everything in the standard library, but nothing in core.
What do you think this code should do?
class List < Array
end
puts List.new.to_a.class
If you said “it prints Array
,” you’d be right. This behavior really confuses me, though, because List
is already an Array
; in my mind, this operation shouldn’t suddenly change the class.
Why does this happen? Let’s check out the implementation of [Array#to_a](https://github.com/ruby/ruby/blob/trunk/array.c#L2064-L2082)
:
static VALUE
rb_ary_to_a(VALUE ary)
{
if (rb_obj_class(ary) != rb_cArray) {
VALUE dup = rb_ary_new2(RARRAY_LEN(ary));
rb_ary_replace(dup, ary);
return dup;
}
return ary;
}
If the class is not an Array
, (represented by rb_cArray
), then we make a new array of the same length, call replace
on it, and then return the new array. If this C scares you, here’s a direct port to pure Ruby:
def array_to_a(ary)
if ary.class != Array
dup = []
dup.replace(ary)
return dup
end
return ary
end
array_to_a(List.new).class # => Array
So why do this? Well, again, this class will be used all over the place. For example, I made a brand new Rails 4 application, generated a controller and view, and put this in it:
ObjectSpace.count_objects[:T_ARRAY]: <%= ObjectSpace.count_objects[:T_ARRAY] %>
ObjectSpace allows you to inspect all of the objects that exist in the system. Here’s the output:
rails arrays
That’s a lot of arrays! This kind of shortcut is generally worth it: 99.99% of the time, this code is perfect.
That last 0.01% is the problem. If you don’t know exactly how these classes operate at the C level, you’re gonna have a bad time. In this case, this behavior is odd enough that someone was kind enough to document it.
Here’s the Ruby version of what I’d expect to happen:
def array_to_a2(ary)
return ary if ary.is_a?(Array)
dup = []
dup.replace(ary)
dup
end
array_to_a2(List.new).class # => List
This has the exact same behavior except when we’re already dealing with an Array, which is what I’d expect.
Let’s take another example: reverse.
l = List.new
l << 1
l << 2
puts l.reverse.class # => Array
I would not expect that calling #reverse
on my custom Array
would change its class. Let’s look at the C again:
static VALUE
rb_ary_reverse_m(VALUE ary)
{
long len = RARRAY_LEN(ary);
VALUE dup = rb_ary_new2(len);
if (len > 0) {
const VALUE *p1 = RARRAY_RAWPTR(ary);
VALUE *p2 = (VALUE *)RARRAY_RAWPTR(dup) + len - 1;
do *p2-- = *p1++; while (--len > 0);
}
ARY_SET_LEN(dup, RARRAY_LEN(ary));
return dup;
}
We get the length of the array, make a new blank array of the same length, then do some pointer stuff to copy everything over, and return the new copy. Unlike #to_a
, this behavior is not currently documented.
Now: you could make the case that this behavior is expected, in both cases: after all, the point of the non-bang methods is to make a copy. However, there’s a difference to me between “make a new array with this stuff in it” and “make a new copy with this stuff in it”. Most of the time, I get the same class back, so I expect the same class back in these circumstances.
Let’s talk about a more pernicious issue: Strings.
As you know, the difference between interpolation and concatenation is that interpolation calls #to_s
implicitly on the object it’s interpolating:
irb(main):001:0> "foo" + 2
TypeError: no implicit conversion of Fixnum into String
from (irb):1:in `+'
from (irb):1
from /opt/rubies/ruby-2.0.0-p195/bin/irb:12:in `<main>'
irb(main):002:0> "foo#{2}"
=> "foo2"
irb(main):001:0> class MyClass
irb(main):002:1> def to_s
irb(main):003:2> "yup"
irb(main):004:2> end
irb(main):005:1> end
=> nil
irb(main):006:0> "foo#{MyClass.new}"
=> "fooyup"
So what about a custom String
?
class MyString < String
def to_s
"lol"
end
end
s = MyString.new
s.concat "Hey"
puts s
puts s.to_s
puts "#{s}"
What does this print?
$ ruby ~/tmp/tmp.rb HeylolHey
That’s right! With String
s, Ruby doesn’t call #to_s
: it puts the value in directly. How does this happen?
Well, dealing with string interpolation deals with the parser, so let’s check out the bytecode that Ruby generates. Thanks to Aaron Patterson for suggesting this approach. <3
irb(main):013:0> x = RubyVM::InstructionSequence.new(%q{puts "hello #{'hey'}"})
=> <RubyVM::InstructionSequence:<compiled>@<compiled>>
irb(main):014:0> puts x.disasm
== disasm: <RubyVM::InstructionSequence:<compiled>@<compiled>>==========
0000 trace 1 ( 1)
0002 putself
0003 putstring "hello hey"
0005 opt_send_simple <callinfo!mid:puts, argc:1, FCALL|ARGS_SKIP>
0007 leave
=> nil
irb(main):015:0> x = RubyVM::InstructionSequence.new(%q{puts "hello #{Object.new}"})
=> <RubyVM::InstructionSequence:<compiled>@<compiled>>
irb(main):016:0> puts x.disasm
== disasm: <RubyVM::InstructionSequence:<compiled>@<compiled>>==========
0000 trace 1 ( 1)
0002 putself
0003 putobject "hello "
0005 getinlinecache 12, <ic:0>
0008 getconstant :Object
0010 setinlinecache <ic:0>
0012 opt_send_simple <callinfo!mid:new, argc:0, ARGS_SKIP>
0014 tostring
0015 concatstrings 2
0017 opt_send_simple <callinfo!mid:puts, argc:1, FCALL|ARGS_SKIP>
0019 leave
=> nil
You can see with a string, the bytecode actually puts the final concatenated string. But with an object. it ends up calling tostring
, and then concatstrings
.
Again, 99% of the time, this is totally fine, and much faster. But if you don’t know this trivia, you’re going to get bit.
Here is an example from an older version of Rails. Yes, you might think “Hey idiot, there’s no way it will store your custom String
class,” but the whole idea of subclassing is that it’s a drop-in replacement.
I know that there’s some case where Ruby will not call your own implementation of #initialize
on a custom subclass of String
, but I can’t find it right now. This is why this problem is so tricky: most of the time, things are fine, but then occasionally, something strange happens and you wonder what’s wrong. I don’t know about you, but my brain needs to focus on more important things than the details of the implementation.
Since I first wrote this post, James Edward Gray II helped me remember what this example is. One of the early exercises in http://exercism.io/ is based on making a DNA type, and then doing some substitution operations on it. Many people inherited from String
when doing their answers, and while the simple case that passes the tests works, this case won’t:
class Dna < String
def initialize(*)
super
puts "Building Dna: #{inspect}"
end
end
result = Dna.new("CATG").tr(Dna.new("T"), Dna.new("U"))
p result.class
p result
This prints:
Building Dna: "CATG"
Building Dna: "T"
Building Dna: "U"
Dna
"CAUG"
It never called our initializer for the new string. Let’s check the source of #tr
:
static VALUE
rb_str_tr(VALUE str, VALUE src, VALUE repl)
{
str = rb_str_dup(str);
tr_trans(str, src, repl, 0);
return str;
}
rb_str_dup
has a pretty simple definition:
VALUE
rb_str_dup(VALUE str)
{
return str_duplicate(rb_obj_class(str), str);
}
and so does str_duplicate
:
static VALUE
str_duplicate(VALUE klass, VALUE str)
{
VALUE dup = str_alloc(klass);
str_replace(dup, str);
return dup;
}
So there you have it: MRI doesn’t go through the whole initialization process when duplicating a string: it just allocates the memory and then replaces the contents.
If you re-open String
, it’s also weird:
class String
alias_method :string_initialize, :initialize
def initialize(*args, &block)
string_initialize(*args, &block)
puts "Building MyString: #{inspect}"
end
end
result = String.new("CATG").tr("T", "U") # => Building MyString: "CATG"
p result.class # => String
p result # => "CAUG"
Again, unless you know exactly how this works at a low level, surprising things happen.
The Solution
Generally speaking, subclassing isn’t the right idea here. You want a data structure that uses one of these core classes internally, but isn’t exactly like one. Rather than this:
class Name < String
end
do this:
require 'delegate'
class Name < SimpleDelegator
def initialize
super("")
end
end
This allows you to do the same thing, but without all of the pain:
class Name
def to_s
"hey"
end
end
"#{Name.new}" # => "hey"
However, this won’t solve all problems:
require 'delegate'
class List < SimpleDelegator
def initialize
super([])
end
end
l = List.new
l << 1
l << 2
puts l.reverse.class # => Array
In general, I’d prefer to delegate things manually, anyway: a Name
is not actually a drop-in for a String
it’s something different that happens to be a lot like one:
class List
def initialize(list = [])
@list = list
end
def <<(item)
@list << item
end
def reverse
List.new(@list.reverse)
end
end
l = List.new
l << 1
l << 2
puts l.reverse.class # => List
You can clean this up by using Forwardable
to only forward the messages you want to forward:
require 'forwardable'
class List
extend Forwardable
def_delegators :@list, :<<, :length # and anything else
def initialize(list = [])
@list = list
end
def reverse
List.new(@list.reverse)
end
end
l = List.new
l << 1
l << 2
puts l.reverse.class # => List
Now you know! Be careful out there!