Ruby Oneliner: Benchmarking a string concordance

Just the other week in one of my university Comp. Sci. classes I was asked to use a supplied Linked List to create a Concordance from standard input (in C I might add). The problem wasn’t necessarily hard, in fact, it was simple enough some friends and I realized it was a great Ruby one-liner candidate; Sure enough this was the result after no more than a minute of jabbering:

hash = Hash.new(0); str.split.each { |m| hash[m] += 1}

Well thats all fine and dandy… A plain old Ruby one-liner. My friend Stef, however, suggested this close alternative:

hash = Hash.new(0); str.scan(/\w+/m) { |m| hash[m] += 1}

Whats different? Well Stef’s code uses a regex scan of a “m”ultiline string, then adds 1 to each match in the hash. His regex takes series of 1 or more “\w”ord characters to be a match. Whereas my code uses Ruby’s built-in “split” method to split on whitespace, then iterate over the resultant array.

This is how split works:

str = "My name is Ryan."
str.split #=> ["My","name","is","Ryan."]

For simple strings, like “My name is Ryan.”, Stef’s regex scan works almost identically. For this example we will ignore the fact that “\w” won’t match things like ‘-’, its not really that important at the moment.

As any good Computer Scientists our divergence of methods lead to a great argument… Which one was better? From my point of view, “split.each” is much more readable, clearly (without a regex) splits on whitespace, and is nearly as terse as the regex equivalent. From Stef’s point of view he A) didn’t have to use “each” and B) had more control over the split. We agreed to disagree, clearly each works best in different situations. Split is best for a simple split, but Scan is far more versatile.

Having put semantics aside we began wrestling over which one would be faster. We threw together this “Benchmark” script:

require 'benchmark'

str = "1 2 3 4-5 6 7 8-9"
Benchmark.bm do |bm|
bm.report("split: ") {10000.times do hash = Hash.new(0); str.split.each { |m| hash[m] += 1}; end  }
bm.report("scan: (\\w+) ") { 10000.times do hash = Hash.new(0); str.scan(/\w+/m) { |m| hash[m] = 1} end  }
bm.report("scan: (\w+(-\w+)?) ") { 10000.times do hash = Hash.new(0); str.scan(/(\w+(-\w+)?)/m) { |m| hash[m] += 1} end  }
end

Here’s the result of those benchmarks on various Ruby versions:

## Native environment tests - 1.8.7
#Creating one hash and clear it: (hash.clear instead of hash = Hash.new(0))
#                     user     system      total        real
# split:            0.490000   0.150000   0.640000 (  0.656165)
# scan: (\w+)       0.800000   0.180000   0.980000 (  1.003529)
# scan: (w+(-w+)?)  1.390000   0.340000   1.730000 (  1.745792)

#Creating a new hash every time:
#                     user     system      total        real
# split:            0.470000   0.140000   0.610000 (  0.643760)
# scan: (\w+)       0.800000   0.180000   0.980000 (  0.989383)
# scan: (w+(-w+)?)  1.170000   0.260000   1.430000 (  1.457280)

## Variety tests by Stef Penner
#mbp:rubinius stefan$ ruby -v
# -> ruby 1.8.7 (2008-06-20 patchlevel 22) [i686-darwin9.3.0]
#mbp:rubinius stefan$ macruby -v
# -> MacRuby version 0.3 (ruby 1.9.0 2008-06-03) [universal-darwin9.0]
#mbp:rubinius stefan$ jruby -v
# -> ruby 1.8.6 (2008-06-22 rev 6555) [i386-jruby1.1.1]
#mbp:rubinius stefan$ rbx -v
# -> rubinius 0.9.0 (ruby 1.8.6 compatible) (8038487c4) (10/19/2008) [i686-apple-darwin9.5.0]

## Variety tests by Stef Penner
# $ rubinous regx.rb
#                   user     system      total        real
# split:           1.422384   0.000000   1.422384 (  1.422366)
# scan: (\w+)      1.458300   0.000000   1.458300 (  1.458299)
# scan: (w+(-w+)?) 2.127930   0.000000   2.127930 (  2.127929)

# $ ruby regx.rb
#                   user     system      total        real
# split:           0.410000   0.140000   0.550000 (  0.559599)
# scan: (\w+)      0.670000   0.180000   0.850000 (  0.862585)
# scan: (w+(-w+)?) 0.990000   0.270000   1.260000 (  1.268065)

# $ ruby1.9 regx.rb
#                   user     system      total        real
# split:           0.090000   0.000000   0.090000 (  0.096752)
# scan: (\w+)      0.170000   0.000000   0.170000 (  0.164321)
# scan: (w+(-w+)?) 0.280000   0.000000   0.280000 (  0.291374)
# $ macruby regx.rb
#                   user     system      total        real
# split:           0.440000   0.030000   0.470000 (  0.490660)
# scan: (\w+)      4.310000   0.050000   4.360000 (  4.449849)
# scan: (w+(-w+)?) 4.380000   0.040000   4.420000 (  4.503897)

# $ jruby regx.rb
#                   user     system      total        real
# split:           0.456000   0.000000   0.456000 (  0.456000)
# scan: (\w+)      0.261000   0.000000   0.261000 (  0.260000)
# scan: (w+(-w+)?) 0.369000   0.000000   0.369000 (  0.369000)

# $jruby 1.1.3 regx.rb
#                   user     system      total        real
# split:           0.235000   0.000000   0.235000 ( 0.234993)
# scan: (\w+)      0.228000   0.000000   0.228000 ( 0.228318)
# scan: (w+(-w+)?) 0.329000   0.000000   0.329000 ( 0.328884)

Its rather interesting to see how each version of ruby compares, yes Rubinius is slower, but WOW, Ruby 1.9.1 takes only 16% the time 1.8.7 takes!