Tuesday, June 2, 2009

Markov Chaining

A friend of mine wrote a very nice Markov Chainer in Python a while back. I had a lot of fun playing with it and decided to try my hand at rewriting it in Clojure. This proved to be trickier than I expected as it was my first Clojure program beyond simple factorial and fibonacci generators, but I learned a lot in the process. The main issue I ran into was my own misunderstanding of how Clojure adds elements to it's various sequence types.

It's fast to add elements to the head of a list and the tail of a vector. Clojure takes advantage of this by adding elements to the optimal position of a sequence based on it's type. Being totally oblivious to this fact, I spent quite a while trying to pin down the difference in behavior between my Python reference implementation and my own version. Once I discovered the source of the program, I had a classic hacker aha moment and a great deal of satisfaction.

Note that this program stays in the realm of immutable data 100% of the time. I believe this may be the reason that it's much slower than the Python version, but I could be wrong. For whatever reason, there's a pretty large speed difference between the two programs. If anyone has any suggestions, I'll be happy to try them.

(ns markov 
  (use clojure.contrib.str-utils)) 

(set! *warn-on-reflection* true)

(defn flatten 
  "Takes any nested combination of sequential things (lists, vectors, 
  etc.) and returns their contents as a single, flat sequence. 
  (flatten nil) returns nil." 
  [x] 
  (filter (complement sequential?) 
          (rest (tree-seq sequential? seq x))))

(defn rand-elt 
  "Return a random element of this seq" 
  [s] 
  (nth s (rand-int (count s)))) 

(defn clean [txt] 
  "clean given txt for symbols disruptive to markov chains" 
  (let [new-txt (re-gsub #"[:;,^\"()]" "" txt) 
        new-txt (re-gsub #"'(?!(d|t|ve|m|ll|s|de|re))" "" new-txt)] new-txt)) 

(defn chain-lengths [markov-chain] 
  "return a set of lengths for each element in the collection" 
  (let [markov-keys (map keys markov-chain)] 
    (set (for [x markov-keys] (count x))))) 

(defn max-chain-length [markov-chain] 
  "return the length lf the longest chain" 
  (apply max (chain-lengths markov-chain))) 

(defn chain 
  "Take a list of words and build a markov chain out of them. 
  The length is the size of the key in number of words." 
  ([words] 
   (chain words 3)) 
  ([words length] 
   (loop [markov-chain {} 
          keychain (for [x (range length)] nil) 
          words (map clean words)] 
     (let [first-word (first words)] 
       (if (seq words) 
         (recur (assoc markov-chain keychain 
                       (cons first-word (get markov-chain keychain))) 
                (concat (rest keychain) [first-word]) 
                (rest words)) 
         (assoc markov-chain keychain [])))))) 

(defn split-sentence [text] 
  "Convert a string to a collection on common boundaries" 
  (filter seq (re-split #"[,.!?()\d]+\s*" text))) 

(defn file-chain 
  "Create a markov chain from the contents of a given file" 
  ([file] 
   (file-chain file 3)) 
  ([file length] 
   (let [sentences (split-sentence (slurp file)) 
         flatten-list (fn [& x] (flatten (list x)))] 
     (loop [markov-chain {} words sentences] 
       (if (seq words) 
         (recur (merge-with flatten-list 
                            markov-chain 
                            (chain (re-split #"\s+" (first words)))) 
                (rest words)) 
         markov-chain))))) 

(defn construct-sentence 
   "Build a sentence from a markov chain structure.  Given a 
   Markov chain (any size key),  Seed (to start the sentence) and 
   Proc (a function for choosing the next word), returns a sentence 
   composed until is reaches the end of a chain (an end of sentence)." 
  ([markov-chain] 
   (construct-sentence markov-chain nil rand-elt)) 
  ([markov-chain seed] 
   (construct-sentence markov-chain seed rand-elt)) 
  ([markov-chain seed proc] 
   (loop [words (if seed seed (rand-elt (keys markov-chain))) 
          sentence (str-join " " (filter identity words))] 
     (if (seq (markov-chain words)) 
       (let [word-new (proc (markov-chain words))] 
         (recur (concat (rest words) [word-new]) 
                (str-join " " (into [sentence] [word-new])))) 
       sentence))))
Here's some example usage:
(ns main (use markov))
(def markov (file-chain "corpus.txt"))
(doseq [x (range 100)]
  (doseq [x (range 3)] (println (construct-sentence markov)))
  (println))

2 comments:

  1. Cool, looks good. It's always fun to see how much you can do with a little amount of code in Clojure. I had a go at this myself back in January, you can see the code on my blog.

    ReplyDelete
  2. Hi, i am looking for a markov rewriter myself. can you pls give the python code. my website is http://www.chainsrewriting.com

    ReplyDelete