Thursday, June 18, 2009

Clojure's Agents: Scientists, Monkeys, and MD5

Just recently, I decided to make my first attempt at doing something useful with Clojure's agents. Now please don't confuse useful with helpful. I use the term useful in the most pejorative sense imaginable and reserve it for things like solar powered flashlights and inflatable dartboards. That being said, I did actually write a program that does something marginally interesting, but before we dive into that, let's take a step back.

The official description of agents written by Rich Hickey himself says the following, "Agents provide independent, asynchronous change of individual locations. Agents are bound to a single storage location for their lifetime, and only allow mutation of that location (to a new state) to occur as a result of an action. Actions are functions (with, optionally, additional arguments) that are asynchronously applied to an Agent's state and whose return value becomes the Agent's new state."

This is an accurate description of what agents do, but I had to read that paragraph many times to really get it. If your eyes are still returning to fixed position, cast your fears aside, and let me take you by the hand. We're about to take baby steps through what agents offer followed by an incredibly contrived (and somewhat long-winded) example.

At the most basic level, agents provide an easy way to work with some data in the background (another thread). To create an agent, you bind a variable to it and give it a starting value (zero in this case).

user=> (def counter (agent 0))
#'user/counter

An agents state is modified by sending it a function. In the case of a counter, we'd typically want a function that takes a value and increments it by one. Additional arguments can be passed to this function, but we're keeping things simple for now.

user=> (defn add-one [x] (inc x))
#'user/add-one

We could have used an anonymous function, but I'm being overly explicit for the sake of example. Note that the agent will pass its current state as the first argument to our function. Having composed the appropriate function, we can fire it off to the agent.

user=> (send counter add-one)
#<Agent@8edf3c: 1>

The agent passes its current state (zero) to add-one as the first argument. The add-one function returns the new result (one) back to the agent, and it is stored as the agent's new state. The send function's return value is the agent in its new state. If we want to examine the agent's current state, we simply dereference it.

user=> @counter
1

It's worth mentioning that you can retrieve an agent's state from any thread at any time without blocking. Once send is called, the agent begins its processing in the background and program execution continues. If you anticipate that your agent will engage in blocking operations, you should use send-off rather than send. The usage of send-off verses send has implications on how threads are managed in the background and should be explored in the privacy of your own home.

With the introductions out of the way, I will make good on my promise to present an incredibly contrived example. I was just going to lay out some code and explain it, but I had spicy mustard on my pancakes this morning, so I've decided to evolve this into a full blown scenario.

Imagine for a moment that you've been kidnapped by a mad scientist. Let's call him Dr. Moreau. Being completely insane, Dr. Moreau has decided to transplant your brain with that of a chimpanzee and then conduct experiments involving extended programming sessions with ASP. He leaves you alone with the chimp in his laboratory while he goes to fetch his bone saw.

Luckily, the chimp knows sign language and explains that Dr. Moreau stores all his passwords as MD5 sums of lowercase four letter strings in the character range of a-z. The doctor has OCD, so he has sixty passwords. If you can decrypt them all and feed them into his mainframe in the next five minutes, the lab will self-destruct and you and the monkey will go free.

The monkey gestures towards a terminal and reminds you that time is of the essence, so the job must be done in parallel. He also lets you know that the machine has four processors and one programming language installed. That language is Clojure.

Dispelling disbelief that a situation so incredibly contrived is actually happening, you decide a proof of concept would be the best place to start. You roll up your sleeves and begin programming.

First things first, it's obvious that this program will need to be capable of generating MD5 sums given an input string. Thankfully, this is a very common task. The monkey directs you to g-o-o-g-l-e-dot-com, and you quickly find the following nugget.

(ns agentmd5
(:refer-clojure)
(:import
   (java.security
     NoSuchAlgorithmException
     MessageDigest)
   (java.math BigInteger)))

; computes an MD5 sum of a string
; (http://www.holygoat.co.uk/blog/entry/2009-03-26-1)
(defn md5-sum
"Compute the hex MD5 sum of a string."
[#^String str]
(let [alg (doto (MessageDigest/getInstance "MD5")
            (.reset)
            (.update (.getBytes str)))]
  (try
    (.toString (new BigInteger 1 (.digest alg)) 16)
    (catch NoSuchAlgorithmException e
      (throw (new RuntimeException e))))))

Knowing that this program will run on multiple cores and that identical MD5s may be requested several times, you decide it might be wise to memoize this function so that redundant calls will come from cache.

; memoized version of MD5
(def md5-memo (memoize md5-sum))
That problem out of the way, the next step is to brute force all possible combinations of four character strings. Having done some Perl and Ruby programming in the past, you initially think, "I'll just use the range operator." Unfortunately, range operators aren't on the menu today, so we'll be doing things the old fashioned way. For your first attempt, you adapt some Java code you find online.
; starts at "aaaa" and rotates string based on `n'
; "aaaa", "aaab", "aaac", etc...
(defn base26 [n]
    (let [seed-string "aaaa"
                      s (new StringBuilder seed-string)
                      a_val (int \a)]
      (loop [pos 3 x (int n)]
            (when (pos? x)
              (let [digit (char (+ a_val (unchecked-remainder x 26)))]
                (.setCharAt s pos digit)
                (when (pos? pos)
                  (recur (int (dec pos)) (unchecked-divide x 26))))))
      (.toString s)))
This code works just fine initially, and thanks to some quick help from some friendly Clojurians it runs nearly as fast as the Java code from which it was adapted. You're pleased with this until you realize that an obscure bug prevents this code from working on the Java 5 SE VM. What if Dr. Moreau's mainframe is running Java 5 SE?! Never satisfied with marginal quality even when your life is on the line, you make a second attempt.
(defn string-to-base36 [string]
          (Integer/parseInt string 36))

(doall (for [x (range (string-to-base36 "aaaa")
                       (string-to-base36 "zzzz"))]
                       (Integer/toString x 36)))
This works, but it's much slower, and it includes a slew of character combinations outside of what's needed. The monkey reassuringly squeezes your shoulder and encourages one last attempt.
; returns the next iteration of a given character sequence
(defn next-string [#^"[C" chars start-char end-char]
  (loop [i (int (- (alength chars) 1))]
    (if (>= i 0)
      (let [last-char (= (aget #^"[C" chars i) (char end-char))
            recur-val (int (if last-char (- i 1) -1))]
        (if last-char
          (aset chars i (char start-char))
          (aset chars i (char (inc (int (aget chars i))))))
        (recur recur-val))))
  (new String chars))

; returns a sequence of strings given a length, range, and result size
; length=4 start-char=a end-char=z result-size=5 would yield
; ["aaaa" "aaab" "aaac" "aaad" "aaae"]
(defn alpha-gen [length start-char end-char result-size]
  (let [#^"[C" chars (make-array (Character/TYPE) length)]
    ; initialize the array
    (loop [i 0]
      (if (< i length)
        (do (aset chars (int i) (char start-char))
          (recur (inc i)))))
    (for [i (range result-size)]
      (next-string chars start-char end-char))))

The code is somewhat dense and sprinkled with type hints, but it does exactly what you need. You benchmark it and see that it's as fast as your initial attempt, and it works beautifully on Java 5 and Java 6 (as it should...).

Thinking that these generated values will be needed multiple times, you decide to pre-calculate all the combinations and then make them available to your agents via cache. This will use more memory, but hey, your life is on the line! We could just use Clojure's memoize function once more, but we're having fun here, so let's make a closure instead and make use of Clojure's refs at the same time.

; creates an initial cached copy of all character combinations
; subsequent calls return the cached copy rather than calculating
(def string-perms
  (let [xs (ref [])]
    (fn []
      (if (empty? @xs)
        (dosync (ref-set xs (doall (alpha-gen 4 \a \z (Math/pow 26 4)))))
        @xs))))

Next you need a function to find the string that corresponds to a given MD5. The hard work has already been done, so we just glue some of our functions together.

; cycles through character combinations for a given md5 string
; until the matching result is found
(defn decode-md5 [md5]
   (first (filter #(= md5 (md5-memo %1)) (string-perms))))

It's worth mentioning that filter is lazy. This means that it will only provide as much data as it's asked for. Once the first match is found, the calculation will cease to prevent wasting additional resources.

Feeling very confident in your programming prowess, you quickly slop together a function to apply your decode-md5 function to all possible character combinations. The monkey looks at you with a faint air of suspicion but shrugs his shoulders and goes back to eating a banana.

; decodes a bucket
(defn decode [bucket]
   (map decode-md5 bucket))

You decide that the best way to divide up the work load across agents will be to break the input data into buckets. Not worrying about how this will actually work, you throw all your concrete instincts to the wind.

; returns a collection of agents for a given set of work buckets
(defn spawn-agents [agents buckets]
  (if (empty? buckets)
    agents
    (recur (conj agents (agent (first buckets)))
         (rest buckets))))

The monkey can hardly contain himself. You've done all the work so far in just under three minutes (did I mention you're in a time warp?). With two minutes to spare, you take care of the initializing the data your program will use. For your initial dry run, you decide to populate your dataset with an MD5'd test string of "cloj".

; initialize string permutations
(string-perms)

; number of tasks to doll out to the agents
(def num-work-units 60)

; generate requested number of work units
(def work-units (map md5-sum (for [x (range num-work-units)] "cloj")))

; number of agents to spawn
(def num-agents 4)

; divide the units into buckets for each agent
(def work-buckets (partition (int (/ (count work-units)
                                   num-agents)) work-units))

You realize that the strategy you're using to partition work units into work buckets allows for some data loss if things don't divide perfectly but then chuckle at the ultra-controlled and incredibly contrived situation you're in.

With your task nearly complete and almost a minute left before Dr. Moreau returns, you spawn the agents and send them their jobs.

; create an agent for each bucket
(def agents (spawn-agents [] work-buckets))

; send each agent a job to do
(doseq [agent agents]
  (send agent decode))

; wait on the agents to finish
(apply await agents)

; view the results
(doseq [agent agents]
  (doseq [result @agent] (println result)))

; clean up
(shutdown-agents)

Dr. Moreau runs Linux, so you fire up top. Eager to see his incredibly evil CPU pegged at 400%, you invoke your program. Text begins streaming across the terminal, but to your complete horror only one processor is being used. Luckily, you're the world's greatest debugger, so the thirty seconds you have left is more than ample.

In this case, the problem lies in the same laziness that we took advantage of with filter earlier. In many instances, Clojure will only calculate what it believes you will need to speed up processing in the present. In some cases, this can work against you.

The decode function above passes the decode-md5 function to map with a bucket as the input data. The map function sees that you're not asking for any specific number of items, so it says, "I'll take care of it later. Baywatch is on."

When you begin requesting results from your agents, none of the work has been done. To make matters worse, you're iterating over the agents in series, so the lazy map function is being called in a completely serial fashion.

With only fifteen seconds left, the monkey begins banging on the keyboard. You're about to crawl under a table when you realize the error of your ways. If you could force the values from the map to be materialized when it's initially called, all your problems would go away (well, there's that IRS thing, but I digress...).

Invoking the Vim editor, you make a small change.

; decodes a bucket
(defn decode [bucket]
  (doall (map decode-md5 bucket)))

The doall function tells Clojure we want the work done now. With just seconds to spare, you test the revised copy. All four processors are utterly consumed as bytes begin spinning out of the ether. There's an electric charge in the monkey's eye, and the test completes rapidly.

Some additional testing shows that processing a 6,000 item dataset takes one minute twenty-six seconds with four agents, and three minutes fifteen seconds with one agent. It's also helpful to consider that Dr. Moreau's co-worker has MySQL running buckwild on the same machine, and it frequently pegs two or three cores. This makes accurate benchmarking difficult.

Triumphs aside, there's bad news. You've wasted all your time screwing around with generating characters and testing. Dr. Moreau has returned with the bone saw. As he approaches, you think back to ancient times and realize that these kinds of situations used to be reconciled without Clojure. You grab a nearby stapler and begin flogging the doctor. The monkey joins in and inflicts a painful bite. The doctor rapidly succumbs.

Exhausted after a hard day's work, you take the monkey to Ben & Jerry's for some well deserved ice cream. You enjoy a nice evening together chatting about Dr. Moreau, Clojure, and possible content for upcoming blog articles.

The full source code for the example program is available here. No animals were harmed during the development of this program.

7 comments:

  1. Thanks very much for this useful and interesting article.

    But:

    "it's" == "it is"

    "its" == "belonging to it, of it"

    ReplyDelete
  2. Great work! You really showed how all the different parts of Clojure work beautifully together.

    ReplyDelete
  3. hi i like this post , thanks!

    ReplyDelete
  4. Great post!

    @Jonathan Feinberg

    Q: Do you know what's worse than when you post useless comments with grammar corrections?

    A: When you actually don't have a clue what you're talking about in your attempt to do so.

    http://www.dummies.com/how-to/content/using-apostrophes-to-show-possession.html

    ReplyDelete