Just the other week in one of my university Comp. Sci. classes I was asked to use a supplied Linked List to create a Concordance from standard input (in C I might add). The problem wasn’t necessarily hard, in fact, it was simple enough some friends and I realized it was a great Ruby one-liner candidate; Sure enough this was the result after no more than a minute of jabbering:
hash = Hash.new(0); str.split.each { |m| hash[m] += 1}
Well thats all fine and dandy… A plain old Ruby one-liner. My friend Stef, however, suggested this close alternative:
hash = Hash.new(0); str.scan(/\w+/m) { |m| hash[m] += 1}
Whats different? Well Stef’s code uses a regex scan of a “m”ultiline string, then adds 1 to each match in the hash. His regex takes series of 1 or more “\w”ord characters to be a match. Whereas my code uses Ruby’s built-in “split” method to split on whitespace, then iterate over the resultant array.
This is how split works:
str = "My name is Ryan." str.split #=> ["My","name","is","Ryan."]
For simple strings, like “My name is Ryan.”, Stef’s regex scan works almost identically. For this example we will ignore the fact that “\w” won’t match things like ‘-’, its not really that important at the moment.
As any good Computer Scientists our divergence of methods lead to a great argument… Which one was better? From my point of view, “split.each” is much more readable, clearly (without a regex) splits on whitespace, and is nearly as terse as the regex equivalent. From Stef’s point of view he A) didn’t have to use “each” and B) had more control over the split. We agreed to disagree, clearly each works best in different situations. Split is best for a simple split, but Scan is far more versatile.
Having put semantics aside we began wrestling over which one would be faster. We threw together this “Benchmark” script:
require 'benchmark'
str = "1 2 3 4-5 6 7 8-9"
Benchmark.bm do |bm|
bm.report("split: ") {10000.times do hash = Hash.new(0); str.split.each { |m| hash[m] += 1}; end }
bm.report("scan: (\\w+) ") { 10000.times do hash = Hash.new(0); str.scan(/\w+/m) { |m| hash[m] = 1} end }
bm.report("scan: (\w+(-\w+)?) ") { 10000.times do hash = Hash.new(0); str.scan(/(\w+(-\w+)?)/m) { |m| hash[m] += 1} end }
end
Here’s the result of those benchmarks on various Ruby versions:
## Native environment tests - 1.8.7 #Creating one hash and clear it: (hash.clear instead of hash = Hash.new(0)) # user system total real # split: 0.490000 0.150000 0.640000 ( 0.656165) # scan: (\w+) 0.800000 0.180000 0.980000 ( 1.003529) # scan: (w+(-w+)?) 1.390000 0.340000 1.730000 ( 1.745792) #Creating a new hash every time: # user system total real # split: 0.470000 0.140000 0.610000 ( 0.643760) # scan: (\w+) 0.800000 0.180000 0.980000 ( 0.989383) # scan: (w+(-w+)?) 1.170000 0.260000 1.430000 ( 1.457280) ## Variety tests by Stef Penner #mbp:rubinius stefan$ ruby -v # -> ruby 1.8.7 (2008-06-20 patchlevel 22) [i686-darwin9.3.0] #mbp:rubinius stefan$ macruby -v # -> MacRuby version 0.3 (ruby 1.9.0 2008-06-03) [universal-darwin9.0] #mbp:rubinius stefan$ jruby -v # -> ruby 1.8.6 (2008-06-22 rev 6555) [i386-jruby1.1.1] #mbp:rubinius stefan$ rbx -v # -> rubinius 0.9.0 (ruby 1.8.6 compatible) (8038487c4) (10/19/2008) [i686-apple-darwin9.5.0] ## Variety tests by Stef Penner # $ rubinous regx.rb # user system total real # split: 1.422384 0.000000 1.422384 ( 1.422366) # scan: (\w+) 1.458300 0.000000 1.458300 ( 1.458299) # scan: (w+(-w+)?) 2.127930 0.000000 2.127930 ( 2.127929) # $ ruby regx.rb # user system total real # split: 0.410000 0.140000 0.550000 ( 0.559599) # scan: (\w+) 0.670000 0.180000 0.850000 ( 0.862585) # scan: (w+(-w+)?) 0.990000 0.270000 1.260000 ( 1.268065) # $ ruby1.9 regx.rb # user system total real # split: 0.090000 0.000000 0.090000 ( 0.096752) # scan: (\w+) 0.170000 0.000000 0.170000 ( 0.164321) # scan: (w+(-w+)?) 0.280000 0.000000 0.280000 ( 0.291374) # $ macruby regx.rb # user system total real # split: 0.440000 0.030000 0.470000 ( 0.490660) # scan: (\w+) 4.310000 0.050000 4.360000 ( 4.449849) # scan: (w+(-w+)?) 4.380000 0.040000 4.420000 ( 4.503897) # $ jruby regx.rb # user system total real # split: 0.456000 0.000000 0.456000 ( 0.456000) # scan: (\w+) 0.261000 0.000000 0.261000 ( 0.260000) # scan: (w+(-w+)?) 0.369000 0.000000 0.369000 ( 0.369000) # $jruby 1.1.3 regx.rb # user system total real # split: 0.235000 0.000000 0.235000 ( 0.234993) # scan: (\w+) 0.228000 0.000000 0.228000 ( 0.228318) # scan: (w+(-w+)?) 0.329000 0.000000 0.329000 ( 0.328884)
Its rather interesting to see how each version of ruby compares, yes Rubinius is slower, but WOW, Ruby 1.9.1 takes only 16% the time 1.8.7 takes!
khelll | 06-Nov-08 at 1:16 am | Permalink
what about jruby 1.1.5 ?
neufelry | 06-Nov-08 at 6:28 am | Permalink
My apologies about that. My friend Stef, who ran those tests, noted MacPorts wasn’t up to date when he ran the tests, and he wasn’t about to compile from source or update the Port file.
lopex | 06-Nov-08 at 8:36 am | Permalink
That’s good to hear jruby is comparable with 1.9 even with client vm! With server vm and rehearsal run jruby is almost 2x faster than 1.9 (I think ‘benchmark’ is doing something bad against jruby performance in your benchmark).
for http://pastie.org/308721: (last run)
1.9: (ruby 1.9.0 (2008-11-06 revision 20114) [i686-linux])
0.178986392
0.214989566
0.680372902
jruby trunk: (jruby 1.1.5 (ruby 1.8.6 patchlevel 114) (2008-11-06 rev 8003) [i386-java])
0.096919
0.137741
0.375569