It's fast to add elements to the head of a list and the tail of a vector. Clojure takes advantage of this by adding elements to the optimal position of a sequence based on it's type. Being totally oblivious to this fact, I spent quite a while trying to pin down the difference in behavior between my Python reference implementation and my own version. Once I discovered the source of the program, I had a classic hacker aha moment and a great deal of satisfaction.
Note that this program stays in the realm of immutable data 100% of the time. I believe this may be the reason that it's much slower than the Python version, but I could be wrong. For whatever reason, there's a pretty large speed difference between the two programs. If anyone has any suggestions, I'll be happy to try them.
(ns markov
(use clojure.contrib.str-utils))
(set! *warn-on-reflection* true)
(defn flatten
"Takes any nested combination of sequential things (lists, vectors,
etc.) and returns their contents as a single, flat sequence.
(flatten nil) returns nil."
[x]
(filter (complement sequential?)
(rest (tree-seq sequential? seq x))))
(defn rand-elt
"Return a random element of this seq"
[s]
(nth s (rand-int (count s))))
(defn clean [txt]
"clean given txt for symbols disruptive to markov chains"
(let [new-txt (re-gsub #"[:;,^\"()]" "" txt)
new-txt (re-gsub #"'(?!(d|t|ve|m|ll|s|de|re))" "" new-txt)] new-txt))
(defn chain-lengths [markov-chain]
"return a set of lengths for each element in the collection"
(let [markov-keys (map keys markov-chain)]
(set (for [x markov-keys] (count x)))))
(defn max-chain-length [markov-chain]
"return the length lf the longest chain"
(apply max (chain-lengths markov-chain)))
(defn chain
"Take a list of words and build a markov chain out of them.
The length is the size of the key in number of words."
([words]
(chain words 3))
([words length]
(loop [markov-chain {}
keychain (for [x (range length)] nil)
words (map clean words)]
(let [first-word (first words)]
(if (seq words)
(recur (assoc markov-chain keychain
(cons first-word (get markov-chain keychain)))
(concat (rest keychain) [first-word])
(rest words))
(assoc markov-chain keychain []))))))
(defn split-sentence [text]
"Convert a string to a collection on common boundaries"
(filter seq (re-split #"[,.!?()\d]+\s*" text)))
(defn file-chain
"Create a markov chain from the contents of a given file"
([file]
(file-chain file 3))
([file length]
(let [sentences (split-sentence (slurp file))
flatten-list (fn [& x] (flatten (list x)))]
(loop [markov-chain {} words sentences]
(if (seq words)
(recur (merge-with flatten-list
markov-chain
(chain (re-split #"\s+" (first words))))
(rest words))
markov-chain)))))
(defn construct-sentence
"Build a sentence from a markov chain structure. Given a
Markov chain (any size key), Seed (to start the sentence) and
Proc (a function for choosing the next word), returns a sentence
composed until is reaches the end of a chain (an end of sentence)."
([markov-chain]
(construct-sentence markov-chain nil rand-elt))
([markov-chain seed]
(construct-sentence markov-chain seed rand-elt))
([markov-chain seed proc]
(loop [words (if seed seed (rand-elt (keys markov-chain)))
sentence (str-join " " (filter identity words))]
(if (seq (markov-chain words))
(let [word-new (proc (markov-chain words))]
(recur (concat (rest words) [word-new])
(str-join " " (into [sentence] [word-new]))))
sentence))))
Here's some example usage:
(ns main (use markov)) (def markov (file-chain "corpus.txt")) (doseq [x (range 100)] (doseq [x (range 3)] (println (construct-sentence markov))) (println))
Cool, looks good. It's always fun to see how much you can do with a little amount of code in Clojure. I had a go at this myself back in January, you can see the code on my blog.
ReplyDeleteHi, i am looking for a markov rewriter myself. can you pls give the python code. my website is http://www.chainsrewriting.com
ReplyDelete