Tuesday, November 24, 2009

Squeezing a String

Ruby has a nice string function called squeeze which replaces redundant runs of a given character with a single instance of that character. This is handy for things like uniformly formatting whitespace in a string. You can easily accomplish something similar with Clojure by using a simple regexp, but just for fun, I took a stab at doing the same thing using reduce.
(defn squeeze [string character]
  (reduce 
    (fn [x y]
      (if (and (= (last (str x)) character) 
               (= y character))
        (str x)
        (str x y)))
    (seq string)))
Here's an example invocation:
user=> (squeeze "  foo  bar " \space)
" foo bar "
A regexp equivalent for squeezing spaces would be:
(use 'clojure.contrib.str-utils)
(re-gsub #"\s+" " " some-string)
In this case, the regexp version out performs the reduce version by an order of magnitude.
(use 'clojure.contrib.str-utils)

(def long-seq (take 10000 (cycle " a  b  c ")))
(def long-str (apply str long-seq))

(time (re-gsub #"\s+" " " long-str))
(time (squeeze long-str \space))
"Elapsed time: 23.255887 msecs"
"Elapsed time: 1638.638565 msecs"
Wondering why this was the case, I went on the Clojure IRC channel and asked. Rich Hickey was kind enough to offer some help. He explained that, "building strings by successive concatenation is a bad idea in every programming language", and he made the suggestion to build the data up and then apply str to it at the end of the transformation. He also explained that using the last function is dangerous because it runs in linear time. With that advice, I wrote a revised squeeze function which took advantage of the speed of list head insertion and then reversed the final result before converting to a string.
(defn squeeze [string character]
  (apply str
         (reverse 
           (reduce 
             (fn [x y]
               (if (and (= (first x) character) 
                        (= y character))
                 x
                 (conj x y)))
             (cons '() (seq string))))))
Re-running the benchmark showed a remarkable 27X improvement in performance.
"Elapsed time: 35.504864 msecs"
"Elapsed time: 60.679711 msecs"
27X is nice, but why settle for that when you can push the envelope even farther? Chousuke contribed the following snippit which makes use of Clojure's new-ish transient feature.
(defn squeeze-transient [string character]
  (apply str
         (persistent!
           (reduce
             (fn [x y]
               (if (and (= (nth x 0 nil) character)
                        (= y character))
                 x
                 (conj! x y)))
             (transient []) (seq string)))))
Running all three implementations against a million character string produced the following data.
"Elapsed time: 437.926123 msecs" # regexp version
"Elapsed time: 1145.87153 msecs" # reversed list version
"Elapsed time: 265.120159 msecs" # transient version
As you can see, with a little effort, you can get some very appreciable performance gains with Clojure.

Tuesday, October 27, 2009

Clojure Syntax

Not many languages are simple enough to have their entire syntax conveniently summarized in a table that fits on one screen. Luckily, Clojure is. The following is a summary put together based on the Clojure reader documentation. I'm a fan of tabular data, so I put this together for my own personal reference.

Reader forms
Symbolsatom, foo-bar, *foo*, etc.
Literals42, "foo", nil, true, false, \c, :foo
Keywords:foo (like symbols, prefixed with colon)
Lists(a b c)
Vectors[a b c]
Maps (hashes){:a 1 :b 2} or {:a 1, :b 2}
Sets#{:a :b :c}
Macro Characters
Quote'form
Character\a \b \c
Comment; txt
Meta^a ^b ^c
Deref@form
Syntax-quote` (backtick)
Unquote~ (tilde)
Unquote-splicing~@ (tilde at)
Macro Characters (Dispatch)
Sets#{:a :b :c}
Regexp#"pattern"
Metadata#^{:a 1 :b 2} [1 2 3] or #^Type x
Var-quote#'x
Anonymous function#(fn [args] (...))
Ignore next form#_

Thursday, October 15, 2009

Crushing PNGs With Ruby

Nothing too special here... just a quick hack. I needed to recursively run pngcrush over a directory structure, and I couldn't find any examples of how to do this online. The following solves the problem nicely. Hopefully someone else finds this helpful.
#!/usr/bin/ruby

require 'fileutils'
require 'find'

input_dir = ARGV[0]
rewrite_dir = "/tmp/crush/"

raise("Usage #{$0} dirname") unless input_dir
Dir.mkdir(rewrite_dir) unless test(?d, rewrite_dir)

Find.find(input_dir) do |file|
  if file =~ /\.png$/
    base_dir = File.dirname(file)
    base_file = File.basename(file)
    new_dir = rewrite_dir + base_dir
    FileUtils.mkdir_p(new_dir) unless test(?d, new_dir)
    puts(%Q!pngcrush -reduce -brute "#{file}" "#{new_dir}/#{base_file}"!)
  end
end
From there, you can just do something like: crush.rb images | sh.

Tuesday, September 8, 2009

Hot Code Swapping with Clojure

Reading the Clojure mailing list, you'll notice that people frequently ask about updating code at runtime without restarting their program. Erlang includes this feature out of the box and has dubbed it, "Hot Code Swapping". It's not baked into Clojure nor is it regularly advertised as a feature, but it's easy to do and potentially very useful.

Before we go forward with an example, I'd like to remind you that this isn't Erlang, and that yanking the rug out from under a library and shoving in a new one could potentially lead to disastrous results if you haven't planned ahead in your program design.

If your library is purely functional with no side effects, then updating it at runtime is going to be much safer than swapping in a file containing mutable data. There's also an issue of atomicity to consider. I've come to find out that reloading a library is not an atomic operation. This could potentially cause some issues depending on your application structure. The def function and defn macro are both atomic which will prevent the possibility of partially defined functions creeping in, but you'll need to watch out for functions that depend on each other being temporarily out of date. With the warnings out of the way, let's get to it.

The easiest way to update a running program is to embed a remote REPL into your code. This is a relatively simple thing to do since Clojure's builtin repl function is designed to allow custom input and output streams. The clojure.contrib.server-socket library actually includes some code to assist with spawning a remote REPL, but I tend to take the hard road everywhere, so I just rolled my own solution.
; file: repl_server.clj
(ns repl-server
  (:import
     (java.net ServerSocket Socket)
     (java.io InputStreamReader PrintWriter)
     (java.util.concurrent ThreadPoolExecutor TimeUnit LinkedBlockingQueue)
     (java.net BindException)
     (clojure.lang LineNumberingPushbackReader)))

(defn bind-server [port backlog]
  "binds to the given port with the backlog specified"
   (new #^ServerSocket ServerSocket port backlog))

(defn server-loop [tasklet socket min-threads max-threads]
  "accepts connections and delegates them to the specified task"
  (let [exec (ThreadPoolExecutor. min-threads max-threads 60 TimeUnit/SECONDS
                                  (LinkedBlockingQueue.))]
    (loop []
      (let [accepted-socket (.accept #^ServerSocket socket)]
        (.submit exec #^Callable #(with-open [accepted-socket accepted-socket]
                                             (tasklet accepted-socket))))
      (recur))))

(defn spawn-repl [socket]
  "spawns a repl on a given socket"
  (let [input (new LineNumberingPushbackReader 
                   (new InputStreamReader (.getInputStream socket)))
        output (new PrintWriter (.getOutputStream socket) true)]
    (binding [*in*  input
              *out* output
              *err* output]
      (clojure.main/repl))))

(def #^ServerSocket *ss* (bind-server 12345 25))
(def repl-server (future (server-loop spawn-repl *ss* 5 100)))
Here we have a simple REPL server ready for embedding into any program we see fit. It runs in a loop accepting connections and then invoking spawn-repl for any client that connects. We use the binding form to set *in*, *out*, and *err* temporarily and then return them to their original values. We set the server to run on port 12345 and then run it in it's own thread via a call to future. It's worth noting that setting the port and actually invoking the server would be better done elsewhere, but this is just a trivial example, so don't crucify me.

Next we'll create a stupidly simple library which we'll be hot swapping.
; file: greeting.clj
(ns greeting)

(def message "hello world")

(defn print-message []
  (println "message:" message))
We follow this with a main file that ties it all together.
(ns main
  (:use greeting
        repl-server))

(defn main []
  (loop []
    (print-message)
    (Thread/sleep 1000)
    (recur)))

(main)
The main function simply runs in an infinite loop calling print-message every one second. From here we're ready to test some hot code swapping.
$ clj main.clj
message: hello world
message: hello world
...
So far so good, now let's connect to the REPL.
$ nc localhost 12345
clojure.core=> (ns greeting)
nil
greeting=> (def message "hola amigo!")
#'greeting/message
At this point the main window is now printing the updated message.
message: hola amigo!
message: hola amigo!
...
Now let's make a change to greeting.clj.
; file: greeting.clj
(ns greeting)

(def message "8675309")

(defn print-message []
  (println "jenny:" message))
With the new file in place, we request a reload. Keep in mind that this isn't atomic (as previously stated), but it is convenient for loading larger chunks of code.
$ nc localhost 12345
clojure.core=> (require 'greeting :reload)
The output of our main loop confirms that the reload worked as intended.
jenny: 8675309
jenny: 8675309
jenny: 8675309
...
This has obviously been a dead-simple example, but it should demonstrate the flexibility that Clojure provides when it comes to changing things at runtime.

Tuesday, August 18, 2009

Flipping Coins with Clojure

So... there was a question on Reddit today regarding flipping a series of coins. Quite a bit of dispute arose about the actual phrasing of the question, so rather than plagiarize one of the many suggested variations, I'll spell it out here in plain english.

Bob has a serious problem affecting dozens around the U.S. today. He's a habitual coin flipper. You see... Bob's life fell apart a few years ago after his wife Janet left him, and now he sits around his house flipping coins and writing down how they land. Let's take a peek into Bob's journal for a couple days and see how things have been.

Day 1, Monday:

Woke up today and put some more vaseline on the tip of my thumb. All this flipping has been good, but there's some things that go along with it. Damned blisters keep flaring, and my thumbnail is cracked. Oh well, time to pour us up a drink and see what's what. Today I'll be flipping just to see what the big man upstairs has in store. I'm gonna flip until the sun sets and write it down everytime the coins go head-tails-head.

Day 2, Tuesday:

Well, shit, yesterday was a good time, but all this flippin' just makes me wanna flip more. Time to pour us up a drink and see what ol' mister flipper has ready for me. I'm gonna flip till my flipper-flapper is all tuckered out and write it down everytime the coins go head-tails-tails.

If Bob only knew what Clojure could do for him, he could get his old life back, and Janet and the kids would come home, but I digress... Let's explore Bob's habit and try to answer the following question: "In an infinite sequence of coin flips, which pattern will occur more often: HTH or HTT?" (ok... so I plagiarized a little).

We'll start off by defining the patterns we're looking for and the number of samples we want to take. In this case, we'll be representing a head as zero and a tail as one.
(def head-tail-head [0 1 0])
(def head-tail-tail [0 1 1])
(def num-samples 100000)
Next we'll need a function to flip the coin and calculate the average number of tries before Bob cracks the champagne.
(defn flip-coin []
  (rand-int 2))

(defn avg [coll]
  (/ (apply + coll) (count coll)))
Super easy so far, right? If only Bob knew... Now for something slightly more advanced, our main worker function, which I'll break down in pieces.
(defn num-flips-to-match-pattern [pattern]
  (let [coin-flips (repeatedly flip-coin)]
    (loop [x 0]
      (if (= pattern (nth-flip-sequence x coin-flips))
        (+ 3 x)
        (recur (inc x))))))
This function takes a given pattern and returns the number of flips required to match the pattern. The flipping is obviously random, but with a large enough sample size, some pattern should emerge. Side note: I add three to the return result because it takes a minimum number of three flips to get a valid result and we're starting with zero (offset math FTW).

To make sure we have enough coin flips at our disposal, we'll generate them from an infinite lazy sequence (on line #2). Clojure makes these kinds of infinite sets trivial to define, and they can be very convenient when you're not sure how much data you'll need.

Next up you'll see a yet to be defined function, nth-flip-sequence. I would have defined it up above, but it makes better sense in context, so I'll define it now.
(defn nth-flip-sequence [x coin-flips]
  (nth (partition 3 1 coin-flips) x))
The nth-flip-sequence function takes an offset and an infinite sequence (in this case a sequence of coin flips). From here, it uses partition to grab three elements with a step of one. As an example, it would break up a dataset as follows.
(def coin-flips [0 1 1 0 0 1 1 1])
(nth (partition 3 1 coin-flips) 0)
(0 1 1)
(nth (partition 3 1 coin-flips) 1)
(1 1 0)
(nth (partition 3 1 coin-flips) 2)
(1 0 0)
As usual, Clojure has some powerful batteries included to do most of the work for us. With the hard part out of the way, we write a function to average the number of flips required to match a given pattern for a specified number of samples.
(defn take-samples [pattern num-samples]
  (float (avg (for [x (range num-samples)]
                   (num-flips-to-match-pattern pattern)))))
We'll go ahead and define Bob's two incredibly fulfilling days.
(def monday (future (take-samples head-tail-head num-samples)))
(def tuesday (take-samples head-tail-tail num-samples))
If you noticed the future function above, good eye. Bob likes to get things done fast, so this is a trivial way to calculate both days in parallel. We are using Clojure afterall...

After this, we simply print the results and cleanup the thread pool (since we used future).
(println "mon: " @monday)
(println "tue: " tuesday)
(shutdown-agents)
A sample run confirms that the reasoning in the original article is sound and hopefully brings Bob some peace.

mon:  10.03817 # avg num of coin flips to hit head-tails-head
tue:  8.00416 # avg num of coin flips to hit head-tails-tails

Wednesday, July 22, 2009

Sweeping Networks with Clojure

One of the great things about Clojure is the Java toolbox that it brings to the table. Socket programming has never been one of Lisp's strong points, and although it has always been possible to write network applications in Lisp, platform specific details and portability are generally a concern. Luckily, the JVM provides one of the most tested networking stacks on the planet, so we can cast any of these historical concerns to the wind.

Enter my situation this evening. I was recently given a new workstation at my place of employment. Our DHCP server delegates IPs by MAC address, so I was assigned a new IP, which I haven't committed to memory. I glanced at the output from ifconfig this afternoon before leaving and briefly noticed that my workstation fell somewhere in the dot one-hundred range. After getting home and noticing a bug in the application I've been working on, I needed to access my workstation to recompile.

Not knowing the IP, my only option was to hop on the VPN and sweep the network for machines running ssh. There are obviously tools available to do this rather easily (nmap for example), but what fun is that?

I'm generally a fan of keeping functions as pure as possible and avoiding side-effects at all costs. That being said, socket programming like all forms of IO is inherently side-effect laden, so in this case I've opted to employ certain facilities I might otherwise avoid.

Kicking things off with a bit of ceremony, we'll import the required objects.
(import '(java.io IOException)
        '(java.net Socket)
        '(java.net InetSocketAddress)
        '(java.net SocketTimeoutException)
        '(java.net UnknownHostException))
With that out of the way, we'll create a function to see if a given host / port combination is connectable. To avoid indefinite blocking, we'll make it so the connection can timeout (thanks to nikkomega from reddit for helping me improve this function).
(defn host-up? [hostname timeout port]
  (let [sock-addr (InetSocketAddress. hostname port)]
    (try
     (with-open [sock (Socket.)]
       (. sock connect sock-addr timeout)
       hostname)
     (catch IOException e false)
     (catch SocketTimeoutException e false)
     (catch UnknownHostException e false))))
As you can see, the use of with-open ensures that the connection is closed regardless of the outcome. Any exceptions that may occur result in a return value of false. We'll use this later to filter through the relevant results. To avoid ruining the flexibility of the host-up? function, we'll add a second function to test specifically for ssh servers running on port 22.
(defn ssh-host-up? [hostname]
      (host-up? hostname 5000 22))
The timeout is hardcoded at 5000 milliseconds, which is probably much longer than needed. Performance will suffer in a single-threaded application, but we'll address this later. With the hard work out of the way, we'll simply apply the functions to the desired data.
(def network "192.168.1.")
; scan 192.168.1.1 - 192.168.1.254
(def ip-list (for [x (range 1 255)] (str network x)))
(doseq [host (filter ssh-host-up? ip-list)]
       (println (str host " is up")))
After running this, I was able to retrieve the desired results and locate my machine; however, it took over twelve minutes to sweep the entire network. This is due to the long timeout and the fact that we're testing each host in a serial fashion. Seeing as we're using Clojure, a few small changes should improve the situation dramatically.

Before multi-threading the program:
real    12m19.390s
user    0m1.684s
sys     0m0.364s
There are varying ways to add concurrency to a Clojure app, but agents provide a send-off function specifically designed for blocking tasks. Given the fact that we're sitting around waiting for most of these hosts to timeout, agents are a logical choice in this case. Since the first part of our program was written in a generic fashion, all we need to change is the application of the functions.
(def network "192.168.1.")
; scan 192.168.1.1 - 192.168.1.254
(def ip-list (for [x (range 1 255)] (str network x)))
(def agents (for [ip ip-list] (agent ip)))

(doseq [agent agents]
  (send-off agent ssh-host-up?))

(apply await agents)

(doseq [host (filter deref agents)]
  (println (str @host " is up")))

(shutdown-agents)
Running the modified code reduces the runtime from twelve minutes to six seconds. Who can argue with that?
real 0m6.731s
user 0m1.996s
sys 0m0.268s

Thursday, June 18, 2009

Clojure's Agents: Scientists, Monkeys, and MD5

Just recently, I decided to make my first attempt at doing something useful with Clojure's agents. Now please don't confuse useful with helpful. I use the term useful in the most pejorative sense imaginable and reserve it for things like solar powered flashlights and inflatable dartboards. That being said, I did actually write a program that does something marginally interesting, but before we dive into that, let's take a step back.

The official description of agents written by Rich Hickey himself says the following, "Agents provide independent, asynchronous change of individual locations. Agents are bound to a single storage location for their lifetime, and only allow mutation of that location (to a new state) to occur as a result of an action. Actions are functions (with, optionally, additional arguments) that are asynchronously applied to an Agent's state and whose return value becomes the Agent's new state."

This is an accurate description of what agents do, but I had to read that paragraph many times to really get it. If your eyes are still returning to fixed position, cast your fears aside, and let me take you by the hand. We're about to take baby steps through what agents offer followed by an incredibly contrived (and somewhat long-winded) example.

At the most basic level, agents provide an easy way to work with some data in the background (another thread). To create an agent, you bind a variable to it and give it a starting value (zero in this case).

user=> (def counter (agent 0))
#'user/counter

An agents state is modified by sending it a function. In the case of a counter, we'd typically want a function that takes a value and increments it by one. Additional arguments can be passed to this function, but we're keeping things simple for now.

user=> (defn add-one [x] (inc x))
#'user/add-one

We could have used an anonymous function, but I'm being overly explicit for the sake of example. Note that the agent will pass its current state as the first argument to our function. Having composed the appropriate function, we can fire it off to the agent.

user=> (send counter add-one)
#<Agent@8edf3c: 1>

The agent passes its current state (zero) to add-one as the first argument. The add-one function returns the new result (one) back to the agent, and it is stored as the agent's new state. The send function's return value is the agent in its new state. If we want to examine the agent's current state, we simply dereference it.

user=> @counter
1

It's worth mentioning that you can retrieve an agent's state from any thread at any time without blocking. Once send is called, the agent begins its processing in the background and program execution continues. If you anticipate that your agent will engage in blocking operations, you should use send-off rather than send. The usage of send-off verses send has implications on how threads are managed in the background and should be explored in the privacy of your own home.

With the introductions out of the way, I will make good on my promise to present an incredibly contrived example. I was just going to lay out some code and explain it, but I had spicy mustard on my pancakes this morning, so I've decided to evolve this into a full blown scenario.

Imagine for a moment that you've been kidnapped by a mad scientist. Let's call him Dr. Moreau. Being completely insane, Dr. Moreau has decided to transplant your brain with that of a chimpanzee and then conduct experiments involving extended programming sessions with ASP. He leaves you alone with the chimp in his laboratory while he goes to fetch his bone saw.

Luckily, the chimp knows sign language and explains that Dr. Moreau stores all his passwords as MD5 sums of lowercase four letter strings in the character range of a-z. The doctor has OCD, so he has sixty passwords. If you can decrypt them all and feed them into his mainframe in the next five minutes, the lab will self-destruct and you and the monkey will go free.

The monkey gestures towards a terminal and reminds you that time is of the essence, so the job must be done in parallel. He also lets you know that the machine has four processors and one programming language installed. That language is Clojure.

Dispelling disbelief that a situation so incredibly contrived is actually happening, you decide a proof of concept would be the best place to start. You roll up your sleeves and begin programming.

First things first, it's obvious that this program will need to be capable of generating MD5 sums given an input string. Thankfully, this is a very common task. The monkey directs you to g-o-o-g-l-e-dot-com, and you quickly find the following nugget.

(ns agentmd5
(:refer-clojure)
(:import
   (java.security
     NoSuchAlgorithmException
     MessageDigest)
   (java.math BigInteger)))

; computes an MD5 sum of a string
; (http://www.holygoat.co.uk/blog/entry/2009-03-26-1)
(defn md5-sum
"Compute the hex MD5 sum of a string."
[#^String str]
(let [alg (doto (MessageDigest/getInstance "MD5")
            (.reset)
            (.update (.getBytes str)))]
  (try
    (.toString (new BigInteger 1 (.digest alg)) 16)
    (catch NoSuchAlgorithmException e
      (throw (new RuntimeException e))))))

Knowing that this program will run on multiple cores and that identical MD5s may be requested several times, you decide it might be wise to memoize this function so that redundant calls will come from cache.

; memoized version of MD5
(def md5-memo (memoize md5-sum))
That problem out of the way, the next step is to brute force all possible combinations of four character strings. Having done some Perl and Ruby programming in the past, you initially think, "I'll just use the range operator." Unfortunately, range operators aren't on the menu today, so we'll be doing things the old fashioned way. For your first attempt, you adapt some Java code you find online.
; starts at "aaaa" and rotates string based on `n'
; "aaaa", "aaab", "aaac", etc...
(defn base26 [n]
    (let [seed-string "aaaa"
                      s (new StringBuilder seed-string)
                      a_val (int \a)]
      (loop [pos 3 x (int n)]
            (when (pos? x)
              (let [digit (char (+ a_val (unchecked-remainder x 26)))]
                (.setCharAt s pos digit)
                (when (pos? pos)
                  (recur (int (dec pos)) (unchecked-divide x 26))))))
      (.toString s)))
This code works just fine initially, and thanks to some quick help from some friendly Clojurians it runs nearly as fast as the Java code from which it was adapted. You're pleased with this until you realize that an obscure bug prevents this code from working on the Java 5 SE VM. What if Dr. Moreau's mainframe is running Java 5 SE?! Never satisfied with marginal quality even when your life is on the line, you make a second attempt.
(defn string-to-base36 [string]
          (Integer/parseInt string 36))

(doall (for [x (range (string-to-base36 "aaaa")
                       (string-to-base36 "zzzz"))]
                       (Integer/toString x 36)))
This works, but it's much slower, and it includes a slew of character combinations outside of what's needed. The monkey reassuringly squeezes your shoulder and encourages one last attempt.
; returns the next iteration of a given character sequence
(defn next-string [#^"[C" chars start-char end-char]
  (loop [i (int (- (alength chars) 1))]
    (if (>= i 0)
      (let [last-char (= (aget #^"[C" chars i) (char end-char))
            recur-val (int (if last-char (- i 1) -1))]
        (if last-char
          (aset chars i (char start-char))
          (aset chars i (char (inc (int (aget chars i))))))
        (recur recur-val))))
  (new String chars))

; returns a sequence of strings given a length, range, and result size
; length=4 start-char=a end-char=z result-size=5 would yield
; ["aaaa" "aaab" "aaac" "aaad" "aaae"]
(defn alpha-gen [length start-char end-char result-size]
  (let [#^"[C" chars (make-array (Character/TYPE) length)]
    ; initialize the array
    (loop [i 0]
      (if (< i length)
        (do (aset chars (int i) (char start-char))
          (recur (inc i)))))
    (for [i (range result-size)]
      (next-string chars start-char end-char))))

The code is somewhat dense and sprinkled with type hints, but it does exactly what you need. You benchmark it and see that it's as fast as your initial attempt, and it works beautifully on Java 5 and Java 6 (as it should...).

Thinking that these generated values will be needed multiple times, you decide to pre-calculate all the combinations and then make them available to your agents via cache. This will use more memory, but hey, your life is on the line! We could just use Clojure's memoize function once more, but we're having fun here, so let's make a closure instead and make use of Clojure's refs at the same time.

; creates an initial cached copy of all character combinations
; subsequent calls return the cached copy rather than calculating
(def string-perms
  (let [xs (ref [])]
    (fn []
      (if (empty? @xs)
        (dosync (ref-set xs (doall (alpha-gen 4 \a \z (Math/pow 26 4)))))
        @xs))))

Next you need a function to find the string that corresponds to a given MD5. The hard work has already been done, so we just glue some of our functions together.

; cycles through character combinations for a given md5 string
; until the matching result is found
(defn decode-md5 [md5]
   (first (filter #(= md5 (md5-memo %1)) (string-perms))))

It's worth mentioning that filter is lazy. This means that it will only provide as much data as it's asked for. Once the first match is found, the calculation will cease to prevent wasting additional resources.

Feeling very confident in your programming prowess, you quickly slop together a function to apply your decode-md5 function to all possible character combinations. The monkey looks at you with a faint air of suspicion but shrugs his shoulders and goes back to eating a banana.

; decodes a bucket
(defn decode [bucket]
   (map decode-md5 bucket))

You decide that the best way to divide up the work load across agents will be to break the input data into buckets. Not worrying about how this will actually work, you throw all your concrete instincts to the wind.

; returns a collection of agents for a given set of work buckets
(defn spawn-agents [agents buckets]
  (if (empty? buckets)
    agents
    (recur (conj agents (agent (first buckets)))
         (rest buckets))))

The monkey can hardly contain himself. You've done all the work so far in just under three minutes (did I mention you're in a time warp?). With two minutes to spare, you take care of the initializing the data your program will use. For your initial dry run, you decide to populate your dataset with an MD5'd test string of "cloj".

; initialize string permutations
(string-perms)

; number of tasks to doll out to the agents
(def num-work-units 60)

; generate requested number of work units
(def work-units (map md5-sum (for [x (range num-work-units)] "cloj")))

; number of agents to spawn
(def num-agents 4)

; divide the units into buckets for each agent
(def work-buckets (partition (int (/ (count work-units)
                                   num-agents)) work-units))

You realize that the strategy you're using to partition work units into work buckets allows for some data loss if things don't divide perfectly but then chuckle at the ultra-controlled and incredibly contrived situation you're in.

With your task nearly complete and almost a minute left before Dr. Moreau returns, you spawn the agents and send them their jobs.

; create an agent for each bucket
(def agents (spawn-agents [] work-buckets))

; send each agent a job to do
(doseq [agent agents]
  (send agent decode))

; wait on the agents to finish
(apply await agents)

; view the results
(doseq [agent agents]
  (doseq [result @agent] (println result)))

; clean up
(shutdown-agents)

Dr. Moreau runs Linux, so you fire up top. Eager to see his incredibly evil CPU pegged at 400%, you invoke your program. Text begins streaming across the terminal, but to your complete horror only one processor is being used. Luckily, you're the world's greatest debugger, so the thirty seconds you have left is more than ample.

In this case, the problem lies in the same laziness that we took advantage of with filter earlier. In many instances, Clojure will only calculate what it believes you will need to speed up processing in the present. In some cases, this can work against you.

The decode function above passes the decode-md5 function to map with a bucket as the input data. The map function sees that you're not asking for any specific number of items, so it says, "I'll take care of it later. Baywatch is on."

When you begin requesting results from your agents, none of the work has been done. To make matters worse, you're iterating over the agents in series, so the lazy map function is being called in a completely serial fashion.

With only fifteen seconds left, the monkey begins banging on the keyboard. You're about to crawl under a table when you realize the error of your ways. If you could force the values from the map to be materialized when it's initially called, all your problems would go away (well, there's that IRS thing, but I digress...).

Invoking the Vim editor, you make a small change.

; decodes a bucket
(defn decode [bucket]
  (doall (map decode-md5 bucket)))

The doall function tells Clojure we want the work done now. With just seconds to spare, you test the revised copy. All four processors are utterly consumed as bytes begin spinning out of the ether. There's an electric charge in the monkey's eye, and the test completes rapidly.

Some additional testing shows that processing a 6,000 item dataset takes one minute twenty-six seconds with four agents, and three minutes fifteen seconds with one agent. It's also helpful to consider that Dr. Moreau's co-worker has MySQL running buckwild on the same machine, and it frequently pegs two or three cores. This makes accurate benchmarking difficult.

Triumphs aside, there's bad news. You've wasted all your time screwing around with generating characters and testing. Dr. Moreau has returned with the bone saw. As he approaches, you think back to ancient times and realize that these kinds of situations used to be reconciled without Clojure. You grab a nearby stapler and begin flogging the doctor. The monkey joins in and inflicts a painful bite. The doctor rapidly succumbs.

Exhausted after a hard day's work, you take the monkey to Ben & Jerry's for some well deserved ice cream. You enjoy a nice evening together chatting about Dr. Moreau, Clojure, and possible content for upcoming blog articles.

The full source code for the example program is available here. No animals were harmed during the development of this program.

Tuesday, June 2, 2009

Markov Chaining

A friend of mine wrote a very nice Markov Chainer in Python a while back. I had a lot of fun playing with it and decided to try my hand at rewriting it in Clojure. This proved to be trickier than I expected as it was my first Clojure program beyond simple factorial and fibonacci generators, but I learned a lot in the process. The main issue I ran into was my own misunderstanding of how Clojure adds elements to it's various sequence types.

It's fast to add elements to the head of a list and the tail of a vector. Clojure takes advantage of this by adding elements to the optimal position of a sequence based on it's type. Being totally oblivious to this fact, I spent quite a while trying to pin down the difference in behavior between my Python reference implementation and my own version. Once I discovered the source of the program, I had a classic hacker aha moment and a great deal of satisfaction.

Note that this program stays in the realm of immutable data 100% of the time. I believe this may be the reason that it's much slower than the Python version, but I could be wrong. For whatever reason, there's a pretty large speed difference between the two programs. If anyone has any suggestions, I'll be happy to try them.

(ns markov 
  (use clojure.contrib.str-utils)) 

(set! *warn-on-reflection* true)

(defn flatten 
  "Takes any nested combination of sequential things (lists, vectors, 
  etc.) and returns their contents as a single, flat sequence. 
  (flatten nil) returns nil." 
  [x] 
  (filter (complement sequential?) 
          (rest (tree-seq sequential? seq x))))

(defn rand-elt 
  "Return a random element of this seq" 
  [s] 
  (nth s (rand-int (count s)))) 

(defn clean [txt] 
  "clean given txt for symbols disruptive to markov chains" 
  (let [new-txt (re-gsub #"[:;,^\"()]" "" txt) 
        new-txt (re-gsub #"'(?!(d|t|ve|m|ll|s|de|re))" "" new-txt)] new-txt)) 

(defn chain-lengths [markov-chain] 
  "return a set of lengths for each element in the collection" 
  (let [markov-keys (map keys markov-chain)] 
    (set (for [x markov-keys] (count x))))) 

(defn max-chain-length [markov-chain] 
  "return the length lf the longest chain" 
  (apply max (chain-lengths markov-chain))) 

(defn chain 
  "Take a list of words and build a markov chain out of them. 
  The length is the size of the key in number of words." 
  ([words] 
   (chain words 3)) 
  ([words length] 
   (loop [markov-chain {} 
          keychain (for [x (range length)] nil) 
          words (map clean words)] 
     (let [first-word (first words)] 
       (if (seq words) 
         (recur (assoc markov-chain keychain 
                       (cons first-word (get markov-chain keychain))) 
                (concat (rest keychain) [first-word]) 
                (rest words)) 
         (assoc markov-chain keychain [])))))) 

(defn split-sentence [text] 
  "Convert a string to a collection on common boundaries" 
  (filter seq (re-split #"[,.!?()\d]+\s*" text))) 

(defn file-chain 
  "Create a markov chain from the contents of a given file" 
  ([file] 
   (file-chain file 3)) 
  ([file length] 
   (let [sentences (split-sentence (slurp file)) 
         flatten-list (fn [& x] (flatten (list x)))] 
     (loop [markov-chain {} words sentences] 
       (if (seq words) 
         (recur (merge-with flatten-list 
                            markov-chain 
                            (chain (re-split #"\s+" (first words)))) 
                (rest words)) 
         markov-chain))))) 

(defn construct-sentence 
   "Build a sentence from a markov chain structure.  Given a 
   Markov chain (any size key),  Seed (to start the sentence) and 
   Proc (a function for choosing the next word), returns a sentence 
   composed until is reaches the end of a chain (an end of sentence)." 
  ([markov-chain] 
   (construct-sentence markov-chain nil rand-elt)) 
  ([markov-chain seed] 
   (construct-sentence markov-chain seed rand-elt)) 
  ([markov-chain seed proc] 
   (loop [words (if seed seed (rand-elt (keys markov-chain))) 
          sentence (str-join " " (filter identity words))] 
     (if (seq (markov-chain words)) 
       (let [word-new (proc (markov-chain words))] 
         (recur (concat (rest words) [word-new]) 
                (str-join " " (into [sentence] [word-new])))) 
       sentence))))
Here's some example usage:
(ns main (use markov))
(def markov (file-chain "corpus.txt"))
(doseq [x (range 100)]
  (doseq [x (range 3)] (println (construct-sentence markov)))
  (println))

SyntaxHighlighter For Clojure

I'm really surprised that Blogger doesn't include a solution for syntax highlighting as it hosts a ton of blogs related to programming. Fortunately, there are plenty of third party solutions out there. I've settled on SyntaxHighlighter, and while it's not perfect, it does a good enough job. With Clojure being as new as it is, SyntaxHighlighter doesn't include a brush file for the language. This should change in the future since I have submitted a brush to the author and have been informed that it will be included in the next release. In the meantime, here's the file which provides highlighting support for a good number of common functions.
SyntaxHighlighter.brushes.Clojure = function()
{
  // Contributed by Travis Whitton

  var funcs = ':arglists :doc :file :line :macro :name :ns :private :tag :test new alias alter ' +
              'and apply assert class cond conj count def defmacro defn defstruct deref do '     +
              'doall dorun doseq dosync eval filter finally find first fn gen-class gensym if '  +
              'import inc keys let list loop map ns or print println quote rand recur reduce '   +
              'ref repeat require rest send seq set sort str struct sync take test throw '       +
              'trampoline try type use var vec when while';

  this.regexList = [
          { regex: new RegExp(';.*$', 'gm'),                               css: 'comments' },
          { regex: SyntaxHighlighter.regexLib.multiLineDoubleQuotedString, css: 'string' },
          { regex: /\[|\]/g,                                               css: 'keyword' },
          { regex: /'[a-z][A-Za-z0-9_\-]*/g,                               css: 'color1' }, // symbols
          { regex: /:[a-z][A-Za-z0-9_\-]*/g,                               css: 'color2' }, // keywords
          { regex: new RegExp(this.getKeywords(funcs), 'gmi'),             css: 'functions' }
      ];

  this.forHtmlScript(SyntaxHighlighter.regexLib.aspScriptTags);
}

SyntaxHighlighter.brushes.Clojure.prototype     = new SyntaxHighlighter.Highlighter(); 
SyntaxHighlighter.brushes.Clojure.aliases       = ['clojure', 'Clojure', 'clj'];

Monday, June 1, 2009

Method Introspection

As a Ruby hacker, I enjoy the some of the introspection features it provides. When I first started with Clojure, I missed the ability to introspect on methods easily when reaching into the Java world. Luckily, it only took a few lines of code to get what I wanted.

(defn class-methods [x]
 (let [c (if (class? x) x (class x))]
  (distinct (sort (seq 
                   (map #(.getName %1)
                    (.getMethods c)))))))

(println (let [x (new Integer 1)] (class-methods x)))

Tuesday, May 5, 2009

Clojure 1.0 Released

Yesterday marked the official release of Clojure 1.0. After several years of heavy development, the 1.0 branch will enable people to consume a stable version of Clojure and move to bugfix-only incremental versions while preserving API stability. Clojure was produced over the course of a self-funded sabbatical by it's author, Rich Hickey, and aims to provide a modern Lisp running on the JVM with excellent support for concurrency and Java interoperability.

Thursday, April 16, 2009

Happy Birthday Blog

And so it begins... A nascent blogs takes it's first breath of fresh air and sets foot onto the internet. The possibilities are endless, but the competition is fierce. The only real chance of survival is for the young blog to carve out a niche in the unforgiving wilderness. Luckily, this blog has a steward to guide it along it's way. He takes pity on the poor child and shares his benevolent wisdom. His name is Travis Whitton. Introductions aside, this is my blog. I'll be talking about whatever I want here, but most of what I share will be related to technology. I have a strong interest in computer programming and spend a good amount of time studying functional programming languages such as Haskell, Clojure, Erlang, Scheme, and Common Lisp. I'll be talking about these things as I continue to tinker, but I make no claims regarding competence. Some of these topics are dense, and I welcome feedback. For the morbidly curious, I've been a professional programmer and Linux systems administrator for approximately ten years. I have another blog specifically related to the Vim text editor, which you might have seen. One of the main purposes of this blog is to partition my writings and keep my Vim blog on topic. I work on a number of projects at Grooveshark which I will touch upon from time to time. I'm also a musician, and I play in a couple of bands. Now that the long-winded first post is out of the way, it's time to get down to business. Stay tuned!