Tuesday, November 24, 2009

Squeezing a String

Ruby has a nice string function called squeeze which replaces redundant runs of a given character with a single instance of that character. This is handy for things like uniformly formatting whitespace in a string. You can easily accomplish something similar with Clojure by using a simple regexp, but just for fun, I took a stab at doing the same thing using reduce.
(defn squeeze [string character]
    (fn [x y]
      (if (and (= (last (str x)) character) 
               (= y character))
        (str x)
        (str x y)))
    (seq string)))
Here's an example invocation:
user=> (squeeze "  foo  bar " \space)
" foo bar "
A regexp equivalent for squeezing spaces would be:
(use 'clojure.contrib.str-utils)
(re-gsub #"\s+" " " some-string)
In this case, the regexp version out performs the reduce version by an order of magnitude.
(use 'clojure.contrib.str-utils)

(def long-seq (take 10000 (cycle " a  b  c ")))
(def long-str (apply str long-seq))

(time (re-gsub #"\s+" " " long-str))
(time (squeeze long-str \space))
"Elapsed time: 23.255887 msecs"
"Elapsed time: 1638.638565 msecs"
Wondering why this was the case, I went on the Clojure IRC channel and asked. Rich Hickey was kind enough to offer some help. He explained that, "building strings by successive concatenation is a bad idea in every programming language", and he made the suggestion to build the data up and then apply str to it at the end of the transformation. He also explained that using the last function is dangerous because it runs in linear time. With that advice, I wrote a revised squeeze function which took advantage of the speed of list head insertion and then reversed the final result before converting to a string.
(defn squeeze [string character]
  (apply str
             (fn [x y]
               (if (and (= (first x) character) 
                        (= y character))
                 (conj x y)))
             (cons '() (seq string))))))
Re-running the benchmark showed a remarkable 27X improvement in performance.
"Elapsed time: 35.504864 msecs"
"Elapsed time: 60.679711 msecs"
27X is nice, but why settle for that when you can push the envelope even farther? Chousuke contribed the following snippit which makes use of Clojure's new-ish transient feature.
(defn squeeze-transient [string character]
  (apply str
             (fn [x y]
               (if (and (= (nth x 0 nil) character)
                        (= y character))
                 (conj! x y)))
             (transient []) (seq string)))))
Running all three implementations against a million character string produced the following data.
"Elapsed time: 437.926123 msecs" # regexp version
"Elapsed time: 1145.87153 msecs" # reversed list version
"Elapsed time: 265.120159 msecs" # transient version
As you can see, with a little effort, you can get some very appreciable performance gains with Clojure.

1 comment:

  1. Please note that the transient version is broken. It always compares to the first character in accumulator. But conj adds at the end for vectors! Here is my try:

    (defn squeeze-string
    [string character]
    (->> string
    (reduce (fn [[accum prev :as state] ch]
    (if (= prev ch character)
    [(conj! accum ch) ch]))
    (transient []))
    (apply str)))

    (Sorry, dunno how to format code here...)