;; I often find that the default behaviour of exceptions in emacs/slime is a bit ;; annoying. ;; Up pops up a debugger window, but that moves the focus away from where ;; you're working, and the new window then needs dismissing, and has often ;; buggered up the size of other windows, etc... ;; Also when playing with experimental clojure versions, the behaviour is a bit ;; random, and often the expression just gets swallowed, with no clue as to what ;; it was: ;; Anyway I wrote this little macro, which turns an exception into a return value: (defmacro tryc[ & e] `(try ~@e (catch Exception a# a#))) (tryc (java.io.File ".")) ;; #<ClassCastException java.lang.ClassCastException: java.lang.Class cannot be cast to clojure.lang.IFn> ;; Obviously you should only use this at the top level! ;; I find I'm using it all the time when writing code in emacs
Blog Archive

▼
2011
(26)

▼
January
(10)
 Turning Exceptions into Return Values
 Finding Something in a Vector, Parsing CSV Files
 £500 if you can find me a job (version 1.0)
 Kmeans : An Algorithm for Clustering Data
 Cleaning Old Definitions from the REPL : shreduser
 takewhileunstable
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory :...
 A Very Gentle Introduction to Information Theory :...

▼
January
(10)
Sunday, January 30, 2011
Turning Exceptions into Return Values
Finding Something in a Vector, Parsing CSV Files
;; Indexing Into a Vector: A Sequence of Answers ;; The other day, I had a csv file to read (csv files are commaseparated ;; values, often produced by exporting spreadsheets, or a reasonably standard ;; humanreadable format for grids of data) ;; There's a lovely library which takes all the hard work out of reading csv ;; files ;; The library's on clojars, so you use it by adding this to your maven pom.xml ;; <dependency> ;; <groupId>clojurecsv</groupId> ;; <artifactId>clojurecsv</artifactId> ;; <version>1.2.0</version> ;; </dependency> ;; and then you can use clojurecsv.core/parsecsv to read your file in as a ;; sequence. It seems to magically handle all the intricies of all the ;; different possible csv file arrangements. ;; Thankyou David Santiago, it works like a charm. ;; But then I hit a problem: I wanted to read the emailaddresses from every ;; record. ;; I've rather simplified here, but let's imagine that the data after parsing ;; looked like this: (def csvdata [["Name" "Email 1" "Address" "Email 2" "Email 3"] ["John" "john@mailinator.com" "3 Church St" "xyz@learnclojure.com" ""] ["Fred" "fred@mailinator.com" "7 Park Street" "fred@gmail.com" "fred@googlemail.com" ]]) (def header (first csvdata)) (def data (rest csvdata)) ;; As it happened, most of the emailaddresses were stored in the second column, ;; but some records had two or even three addresses, and it looked as though the ;; generating program might be labelling its columns according to the number of ;; email addresses it needed to output. ;; It seemed very likely that another data set might have an "Email 4" column, ;; and it seemed unwise to rely on the needed columns always being 1,3,4 and ;; possibly 5. What if the program introduced another field entirely? ;; So obviously I needed a function to look up things in the header row somehow. ;; And there didn't seem to be one, although there were the interesting ;; functions keepindexed and mapindexed, which I hadn't noticed before. ;; And I couldn't find one. So I figured that I could either write one, or I ;; could ask on Stack Overflow. ;; And so I did both, expecting to find that I'd reinvent something in the ;; standard library, or at least in contrib, that I hadn't been able to find. ;; http://stackoverflow.com/questions/4830900/howdoifindtheindexofaniteminavector ;; So the function(s) I came up with were: (defn indicesof [f coll] (keepindexed #(if (f %2) %1 nil) coll)) (defn firstindexof [f coll] (first (indicesof f coll))) (defn findthing [value coll] (firstindexof #(= % value) coll)) ;; And here are some examples: (indicesof #(= "Name" %) header) ; (0) (indicesof (partial rematches #".*Email.*") header) ; (1 3 4) (firstindexof #(= "Name" %) header) ; 0 (findthing "Name" header) ; 0 ;; But I was a bit nervous about these solutions, because I thought I must just ;; have reinvented some sort of wheel, and also because they're happy to return ;; answers for sets and maps, where the question doesn't really make much sense ;;fine (findthing "two" ["one" "two" "three" "two"]) ; 1 (findthing "two" '("one" "two" "three")) ; 1 ;;but these answers are a bit silly (findthing "two" #{"one" "two" "three"}) ; 1 (findthing "two" {"one" "two" "two" "three"}) ; nil ;; But I went back to Stack Overflow, in order to answer my own question, and ;; found a couple of much better answers. ;; Brian Carper pointed out: (.indexOf header "Name") ; 0 ;; Which is clearly the answer for searching vectors. ;; And ponzao pointed out this lovely thing, originally due to Stuart Halloway: (require 'clojure.contrib.seq) (first (clojure.contrib.seq/positions #{"Name"} header)) ; 0 ;; positions is like indicesof, but using a set as the predicate is really clever. ;; anyway, now I could say: (map #( % (.indexOf header "Name")) data) ; ("John" "Fred") (map #( % (.indexOf header "Email 1")) data) ; ("john@mailinator.com" "fred@mailinator.com") ;; Or even, for fans of terseness and crypticity (and forgive me Lord, for I am ;; such a fan), something like: (map #(vector (% (.indexOf header "Name")) (for [i (clojure.contrib.seq/positions (partial rematches #"Email \d+") header)] (% i))) data) ;; (["John" ("john@mailinator.com" "xyz@learnclojure.com" "")] ;; ["Fred" ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")]) ;; But a little later, there was another answer from cgrand, who warned me ;; against using indices on general principles. And I agree with that, so I ;; asked what I should do in the particular case of csv files. And there was an ;; answer from Alex Stoddard, which I believe is the answer to the question that ;; I should have asked. ;; We can make a map from strings to indices (def hmap (zipmap header (iterate inc 0))) ;; And use it like this: (map #(% (hmap "Name")) data) ; ("John" "Fred") ;; or this: (def emails (filter (partial rematches #"Email \d+") header)) (map #(vector (% (hmap "Name")) (for [e emails] (% (hmap e)))) data) ;; or this: (map #(vector (% (hmap "Name")) (for [e (filter (partial rematches #"Email \d+") header)] (% (hmap e)))) data) ;; to get: ;; (["John" ("john@mailinator.com" "xyz@learnclojure.com" "")] ;; ["Fred" ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")]) ;; or even this (although now you do have to rely on the name coming before the emails): (map #(for [e (filter (partial rematches #"Email \d+Name") header)] (% (hmap e))) data) ;; to get: ;; (("John" "john@mailinator.com" "xyz@learnclojure.com" "") ;; ("Fred" "fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")) ;;If we want to abstract out a pattern, then: (defn columns [f header data] (let [hmap (zipmap header (iterate inc 0)) cols (filter f header) (map #(for [e cols] (% (hmap e))) data))) ;; allows: (columns #{"Name"} header data) ; (("John") ("Fred")) (columns (partial rematches #"Email \d+") header data) ; (("john@mailinator.com" "xyz@learnclojure.com" "") ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com"))
Thursday, January 27, 2011
£500 if you can find me a job (version 1.0)
That worked a treat, £500 awarded to lucky winner who found me a fun short contract.
Now I'd like another one, so I repeat the offer:
If, within the next six months, I take a job which lasts longer than one month, and that is not obtained through an agency, then on the day the first cheque from that job cashes, I'll give £500 to the person who provided the crucial introduction.
If there are a number of people involved somehow, then I'll apportion it fairly between them. And if the timing conditions above are not quite met, or if someone points me at a short contract which the £500 penalty makes not worth taking, then I'll do something fair and proportional anyway. (The thing Simon pointed me at only lasted three weeks and I paid him in full anyway, because it was neat.)
And this offer applies even to personal friends, and to old contacts who I have not got round to calling yet, and to people who are themselves offering work, because why wouldn't it?
And obviously if I find one through my own efforts then I'll keep the money. But my word is generally thought to be good, and I have made a public promise on my own blog to this effect, so if I cheat you you can blacken my name and ruin my reputation for honesty, which is worth much more to me than £500.
Anyhow, my CV is at http://www.aspden.com, and any advice on how it could be improved will be gratefully received.
I'll also repeat the original advert: (jobhunt.contrib 1.0)
Anyone in Cambridge need a programmer? Obviously Clojure is a speciality, and my current obsession, but I'm also pretty good with C (especially the embedded variety), microcontrollers, and Python, and I have a particular facility with mathematical concepts and algorithms of all kinds. My obsessions can be pretty quickly changed when I need them to be.
I have a (deserved) reputation for being able to produce heavily optimised but nevertheless bugfree and readable code, but I also know how to hack together sloppy, bugridden prototypes, and I know which style is appropriate when, and how to slide along the continuum between them.
I've worked in telecoms, commercial research, banking, university research, a chip design company, server virtualization, a couple of startups, and occasionally completely alone.
I've worked on many sizes of machine. I've written programs for tiny 8bit microcontrollers, and once upon a time every IBM machine in one building in Imperial College was running my partial differential equation solvers in parallel in the background.
I'm smart and I get things done. I'm confident enough in my own abilities that if I can't do something I admit it and find someone who can.
I also have various ancient and rusty skills with things like Java, C++, R, OCaml, Common LISP, Scheme, FORTRAN and Pascal which can be brushed up if necessary. Like all lispers, I occasionally write toy interpreters for madeup languages for fun.
If you're a local company using Java, who might be interested in giving Clojure a try (motivation here, in Paul Graham's classic Beating the Averages), I'd love to try to show you what all the fuss is about.
CV here if you're interested: http://www.aspden.com
I've never used a CV before, having always found work through word of mouth. So I expect that it can be improved. If anyone's got any suggestions as to how it could be better written, do please leave comments or email cv@aspden.com.
Now I'd like another one, so I repeat the offer:
If, within the next six months, I take a job which lasts longer than one month, and that is not obtained through an agency, then on the day the first cheque from that job cashes, I'll give £500 to the person who provided the crucial introduction.
If there are a number of people involved somehow, then I'll apportion it fairly between them. And if the timing conditions above are not quite met, or if someone points me at a short contract which the £500 penalty makes not worth taking, then I'll do something fair and proportional anyway. (The thing Simon pointed me at only lasted three weeks and I paid him in full anyway, because it was neat.)
And this offer applies even to personal friends, and to old contacts who I have not got round to calling yet, and to people who are themselves offering work, because why wouldn't it?
And obviously if I find one through my own efforts then I'll keep the money. But my word is generally thought to be good, and I have made a public promise on my own blog to this effect, so if I cheat you you can blacken my name and ruin my reputation for honesty, which is worth much more to me than £500.
Anyhow, my CV is at http://www.aspden.com, and any advice on how it could be improved will be gratefully received.
I'll also repeat the original advert: (jobhunt.contrib 1.0)
Anyone in Cambridge need a programmer? Obviously Clojure is a speciality, and my current obsession, but I'm also pretty good with C (especially the embedded variety), microcontrollers, and Python, and I have a particular facility with mathematical concepts and algorithms of all kinds. My obsessions can be pretty quickly changed when I need them to be.
I have a (deserved) reputation for being able to produce heavily optimised but nevertheless bugfree and readable code, but I also know how to hack together sloppy, bugridden prototypes, and I know which style is appropriate when, and how to slide along the continuum between them.
I've worked in telecoms, commercial research, banking, university research, a chip design company, server virtualization, a couple of startups, and occasionally completely alone.
I've worked on many sizes of machine. I've written programs for tiny 8bit microcontrollers, and once upon a time every IBM machine in one building in Imperial College was running my partial differential equation solvers in parallel in the background.
I'm smart and I get things done. I'm confident enough in my own abilities that if I can't do something I admit it and find someone who can.
I also have various ancient and rusty skills with things like Java, C++, R, OCaml, Common LISP, Scheme, FORTRAN and Pascal which can be brushed up if necessary. Like all lispers, I occasionally write toy interpreters for madeup languages for fun.
If you're a local company using Java, who might be interested in giving Clojure a try (motivation here, in Paul Graham's classic Beating the Averages), I'd love to try to show you what all the fuss is about.
CV here if you're interested: http://www.aspden.com
I've never used a CV before, having always found work through word of mouth. So I expect that it can be improved. If anyone's got any suggestions as to how it could be better written, do please leave comments or email cv@aspden.com.
Tuesday, January 25, 2011
Kmeans : An Algorithm for Clustering Data
;; Kmeans ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Kmeans is a Clustering Algorithm. ;; We use it when we have some data, and we want to split the data into separate categories. ;; For instance, an early biologist, let's call him Adam, might measure all ;; sorts of things about the living objects he encounters in the world. ;; (black, feathers, flies) ;; (green, tasty) ;; (green, slithers, poisonous) ;; ... and so on ;; After he collects enough data, he might go looking for structure in it. ;; Uninformed by theory, he might nevertheless notice that many things that do ;; not move are green, and that many things that are not green move. ;; He might name these two obvious groups the Animals and the Plants. ;; Further analysis of the data might split the Plants into Trees and Flowers, ;; and the Animals into Mammals, Birds, and Fish. ;; Theoretically, this process could continue further, extracting 'natural ;; categories' from the observed structure of the data, without any theory about ;; how the various properties come about ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Let's consider a very simple clustering situation. We have some numbers, ;; and we'd like to see if they form groups. ;; Suppose we want to cluster the points 2 3 5 6 10 11 100 101 102 (def data '(2 3 5 6 10 11 100 101 102)) ;; You may be able to see some natural groupings in this data. ;; It's easy enough to say how far one number is from another (defn distance[a b] = (if (< a b) ( b a) ( a b))) ;; To do Kmeans, we need to start with some guesses about where the clusters are. ;; They don't have to be terribly good guesses. (def guessedmeans '(0 10)) ;; Given a particular point, we want to know which of our means is closest (defn closest [point means distance] (first (sortby #(distance % point) means))) ;; In our little dataset, 2 is closest to the guess of 0, and 100 is closest to the guess of 10 (closest 2 guessedmeans distance) ; 0 (closest 100 guessedmeans distance) ; 10 ;; So we can talk about the group of all points for which 0 is the best guess ;; and the group of all points for which 10 is the best guess. (defn pointgroups [means data distance] (groupby #(closest % means distance) data)) (pointgroups guessedmeans data distance) ; {0 [2 3 5], 10 [6 10 11 100 101 102]} ;; We can take an average of a group of points (defn average [& list] (/ (reduce + list) (count list))) (average 6 10 11 100 101 102) ; 55 ;; So we can take the average of each group, and use it as a new guess for where ;; the clusters are. If a mean doesn't have a group, then we'll leave it where ;; it is. (defn newmeans [average pointgroups oldmeans] (for [o oldmeans] (if (contains? pointgroups o) (apply average (get pointgroups o)) o))) (newmeans average (pointgroups guessedmeans data distance) guessedmeans) ; (10/3 55) ;; So if we know we've got a particular set of points, and a particular idea of ;; distance, and a way of averaging things, that gives us a way of making a new ;; list of guesses from our original list of guesses (defn iteratemeans [data distance average] (fn [means] (newmeans average (pointgroups means data distance) means))) ((iteratemeans data distance average) '(0 10)) ; (10/3 55) ;; and of course we can use that as a new guess, and improve it again. ((iteratemeans data distance average) '(10/3 55)) ; (37/6 101) ;; We can do this repeatedly until it settles down. (iterate (iteratemeans data distance average) '(0 10)) ; ((0 10) (10/3 55) (37/6 101) (37/6 101) .....) ;; Kmeans with two means seems to be trying to tell us that we've got a group ;; centered around 6 and another centred around 101 ;; These groups are: (defn groups [data distance means] (vals (pointgroups means data distance))) (groups data distance '(37/6 101)) ; ([2 3 5 6 10 11] [100 101 102]) ;; Ideally we'd like to iterate until the groups stop changing. ;; I described a function for doing this in a previous post: (defn takewhileunstable ([sq] (lazyseq (iflet [sq (seq sq)] (cons (first sq) (takewhileunstable (rest sq) (first sq)))))) ([sq last] (lazyseq (iflet [sq (seq sq)] (if (= (first sq) last) '() (takewhileunstable sq)))))) (takewhileunstable '(1 2 3 4 5 6 7 7 7 7)) ; (1 2 3 4 5 6 7) (takewhileunstable (map #(groups data distance %) (iterate (iteratemeans data distance average) '(0 10)))) ; (([2 3 5] [6 10 11 100 101 102]) ; ([2 3 5 6 10 11] [100 101 102])) ;; Shows that our first guesses group 2,3 and 5 (nearer to 0 than 10) vs all the rest. ;; Kmeans modifies that instantly to separate out the large group of three. ;; We can make a function, which takes our data, notion of distance, and notion of average, ;; and gives us back a function which, for a given set of initial guesses at the means, ;; shows us how the group memberships change. (defn kgroups [data distance average] (fn [guesses] (takewhileunstable (map #(groups data distance %) (iterate (iteratemeans data distance average) guesses))))) (def grouper (kgroups data distance average)) (grouper '(0 10)) ; (([2 3 5] [6 10 11 100 101 102]) ; ; ([2 3 5 6 10 11] [100 101 102])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Nothing we said above limits us to only having two guesses (grouper '(1 2 3)) ; (([2] [3 5 6 10 11 100 101 102]) ; ([2 3 5 6 10 11] [100 101 102]) ; ([2 3] [5 6 10 11] [100 101 102]) ; ([2 3 5] [6 10 11] [100 101 102]) ; ; ([2 3 5 6] [10 11] [100 101 102])) ;; The more means we start with, the more detailed our clustering. (grouper '(0 1 2 3 4)) ; (([2] [3] [5 6 10 11 100 101 102]) ; ([2] [3 5 6 10 11] [100 101 102]) ; ([2 3] [5 6 10 11] [100 101 102]) ; ([2 3 5] [6 10 11] [100 101 102]) ; ([2] [3 5 6] [10 11] [100 101 102]) ; ; ([2 3] [5 6] [10 11] [100 101 102])) ;; We have to be careful not to start off with too many means, or we just get our data back: (grouper (range 200)) ; (([2] [3] [100] [5] [101] [6] [102] [10] [11])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Generalizing to Other Spaces ;; In fact, nothing we said above depends on our inputs being numbers ;; We can use any data where we can define a distance, and a method of averaging: ;; One of the easiest things to do this for would be vectors: (defn vecdistance [a b] (reduce + (map #(* % %) (map  a b)))) (defn vecaverage [& l] (map #(/ % (count l)) (apply map + l))) (vecdistance [1 2 3][5 6 7]) ; 48 (vecaverage [1 2 3][5 6 7]) ; (3 4 5) ;; Here's a little set of vectors (def vectordata '( [1 2 3] [3 2 1] [100 200 300] [300 200 100] [50 50 50])) ;; And choosing three guesses in a fairly simpleminded manner, we can see how the algorithm ;; divides them into three different groups. ((kgroups vectordata vecdistance vecaverage) '([1 1 1] [2 2 2] [3 3 3])) ; (([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100] [50 50 50]]) ; ([[1 2 3] [3 2 1] [50 50 50]] ; [[100 200 300] [300 200 100]]) ; ([[1 2 3] [3 2 1]] ; [[100 200 300] [300 200 100]] ; [[50 50 50]])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Pedantic Footnote ;; Note that the algorithm as described above isn't quite the classic Kmeans. ;; I don't think the difference is terribly important, and I thought it would complicate the explanation to deal with it. ;; In the usual Kmeans, if you have two identical means, then you're only supposed to update one of them. ;; Here our two identical guesses are both getting updated (newmeans average (pointgroups '(0 0) '(0 1 2 3 4) distance) '(0 0)) ; (2 2) ;; Our update function: (defn newmeans [average pointgroups oldmeans] (for [o oldmeans] (if (contains? pointgroups o) (apply average (get pointgroups o)) o))) ;; Needs to be changed so that if there are two identical means only one of them will be changed: (defn updateseq [sq f] (let [freqs (frequencies sq)] (apply concat (for [[k v] freqs] (if (= v 1) (list (f k)) (cons (f k) (repeat (dec v) k))))))) (defn newmeans [average pointgroups oldmeans] (updateseq oldmeans (fn[o] (if (contains? pointgroups o) (apply average (get pointgroups o)) o)))) ;; Now only one will get updated at once (newmeans average (pointgroups '(0 0) '(0 1 2 3 4) distance) '(0 0)) ; (2 0) ;; Now we don't lose groups when the means get aliased. ((kgroups '(0 1 2 3 4) distance average) '(0 1)) ; (([0] [1 2 3 4]) ([0 1] [2 3 4])) ((kgroups '(0 1 2 3 4) distance average) '(0 0)) ; (([0 1 2 3 4]) ([0] [1 2 3 4]) ([0 1] [2 3 4])) ((kgroups vectordata vecdistance vecaverage) '([1 1 1] [1 1 1] [1 1 1])) ; ; (([[1 2 3] [3 2 1] [100 200 300] [300 200 100] [50 50 50]]) ; ([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100] [50 50 50]]) ; ([[1 2 3] [3 2 1] [50 50 50]] [[100 200 300] [300 200 100]]) ; ([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100]] [[50 50 50]])) ;; Although it's still possible that a mean never acquires any points, so we can still get out fewer groups than means. ((kgroups '(0 1 2 3 4) distance average) '(0 5 10)) ; (([0 1 2] [3 4]))
Cleaning Old Definitions from the REPL : shreduser
;; I often find myself restarting the clojure repl in order to get rid of old ;; definitions which introduce subtle bugs. ;; Consider: (defn factorial [n] (if (< n 2) 1 (* n (factorial (dec n))))) ; #'user/factorial (factorial 10) ; 3628800 ;; Suppose instead I wanted (factorial2 n) to be (* 1 2 2 2 3 2 4 .... 2 n) (defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial (dec n))))) ; #'user/factorial2 ;; There is a subtle bug introduced by my failing to rename the recursive call (map factorial2 '(1 2 3 4)) ; (1 4 24 192) ;; But (factorial2 4) should be (* 1 2 2 2 3 2 4) = 192 ;; I suspect that there is an old definition hanging around confusing matters. ;; To find it, I can run the program from the command line, or restart the repl and recompile, or: (doseq [s (map first (nsinterns 'user))](nsunmap 'user s)) ; nil (defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial (dec n))))) ; Unable to resolve symbol: factorial in this context ;; Aha! (defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial2 (dec n))))) ; #'user/factorial2 (factorial2 4) ; 192 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; For convenience I'd like to make that a function: (defn shreduser [] (doseq [s (map first (nsinterns 'user))] (nsunmap 'user s))) (shreduser) ; nil (shreduser) ; Unable to resolve symbol: shreduser in this context ; bugger! (defn shreduser [] (doseq [s (filter (complement #{'shreduser}) (map first (nsinterns 'user)))] (nsunmap 'user s))) (shreduser) ; nil (shreduser) ; nil
takewhileunstable
;; I often find myself wanting to iterate something until it converges. (defn collatz [n] (if (even? n) (recur (/ n 2)) (+ 1 (* 3 n)))) (iterate collatz 100) ; (100 76 58 88 34 52 40 16 4 4 4 4 4 4 ...) ;; And so I often find myself writing: (defn takewhileunstable ([sq] (if (empty? sq) '() (cons (first sq) (#'takewhileunstable (rest sq) (first sq))))) ([sq last] (cond (empty? sq) '() (= (first sq) last) '() :else (cons (first sq) (#'takewhileunstable (rest sq) (first sq)))))) ;; Which allows me to stop the iteration once it's settled down. (takewhileunstable (iterate collatz 100)) ; (100 76 58 88 34 52 40 16 4) ;; You can see how it works here: (require 'clojure.contrib.trace) (binding [*printlength* 20] ; taking care not to look at the medusa (clojure.contrib.trace/dotrace (takewhileunstable) (takewhileunstable (iterate collatz 100)))) ;; TRACE t15884: (takewhileunstable (100 76 58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 ...)) ;; TRACE t15885:  (takewhileunstable (76 58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 100) ;; TRACE t15886:   (takewhileunstable (58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 76) ;; TRACE t15887:    (takewhileunstable (88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 58) ;; TRACE t15888:     (takewhileunstable (34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 88) ;; TRACE t15889:      (takewhileunstable (52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 34) ;; TRACE t15890:       (takewhileunstable (40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 52) ;; TRACE t15891:        (takewhileunstable (16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 40) ;; TRACE t15892:         (takewhileunstable (4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 16) ;; TRACE t15893:          (takewhileunstable (4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 4) ;; TRACE t15893:          => () ;; TRACE t15892:         => (4) ;; TRACE t15891:        => (16 4) ;; TRACE t15890:       => (40 16 4) ;; TRACE t15889:      => (52 40 16 4) ;; TRACE t15888:     => (34 52 40 16 4) ;; TRACE t15887:    => (88 34 52 40 16 4) ;; TRACE t15886:   => (58 88 34 52 40 16 4) ;; TRACE t15885:  => (76 58 88 34 52 40 16 4) ;; TRACE t15884: => (100 76 58 88 34 52 40 16 4) ;; (100 76 58 88 34 52 40 16 4) ;; Anyway, this function comes in so handy that I thought I'd lazyize it and give it some tests (use 'clojure.test) (withtest (defn takewhileunstable ([sq] (lazyseq (iflet [sq (seq sq)] (cons (first sq) (takewhileunstable (rest sq) (first sq)))))) ([sq last] (lazyseq (iflet [sq (seq sq)] (if (= (first sq) last) '() (takewhileunstable sq)))))) (is (= (takewhileunstable '()) '())) (is (= (takewhileunstable '(1)) '(1))) (is (= (takewhileunstable '(1 1)) '(1))) (is (= (takewhileunstable '(1 2)) '(1 2))) (is (= (takewhileunstable '(1 1 2)) '(1))) (is (= (takewhileunstable '(1 1 1)) '(1))) (is (= (takewhileunstable '(1 1 2 1)) '(1))) (is (= (takewhileunstable '(1 2 3 4 5 6 7 7 7 7)) '(1 2 3 4 5 6 7))) (is (= (take 10000 (takewhileunstable (range))) (take 10000 (range)))) (is (= (takewhileunstable (concat (take 10000 (range)) '(10000) (drop 10000 (range)))) (range 10001))) (is (= (takewhileunstable '(nil)) '(nil))) (is (= (takewhileunstable '(nil nil)) '(nil))) (is (= (takewhileunstable '[nil nil]) '(nil))) (is (= (takewhileunstable [ :a :b :c :d :d :e]) '(:a :b :c :d)))) (runtests) ; {:type :summary, :test 1, :pass 14, :fail 0, :error 0} ;; Hopefully someone will find it useful. Can anyone break it?
Monday, January 17, 2011
A Very Gentle Introduction to Information Theory: Guessing the Entropy
;; A Very Gentle Introduction to Information Theory : Part IV ;; Entropy and Huffman Coding ;; Once again, I'll keep some code from the earlier parts, without explanation ;; Probability distributions (defn combinekeywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a)))) (defn combinedistributions ([P] P) ([P1 P2] (into {} (for [[s1 p1] P1 [s2 p2] P2] [(combinekeywords s1 s2) (* p1 p2)]))) ([P1 P2 & Plist] (reduce combinedistributions (cons P1 (cons P2 Plist))))) ;; Coding and Decoding (defn decoder ([codetree stream] (decoder codetree codetree stream)) ([currentcodetree codetree stream] (lazyseq (if (keyword? currentcodetree) (cons currentcodetree (decoder codetree codetree stream)) (iflet [stream (seq stream)] (if (= (first stream) 1) (decoder (first currentcodetree) codetree (rest stream)) (decoder (second currentcodetree) codetree (rest stream)))))))) (defn encoder [code stream] (mapcat code stream)) (defn makeencoder [code] (fn [s] (encoder code s))) (defn makedecoder [codetree] (fn[s] (decoder codetree s))) ;; Huffman encoding (defn symbols [prefix codetree] (if (keyword? codetree) (list prefix codetree) (concat (symbols (cons 1 prefix) (first codetree)) (symbols (cons 0 prefix) (second codetree))))) (defn makecode [codetree] (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() codetree))))) ;; The original makecodetree was very slow, because it resorted the list every ;; iteration. We can speed it up considerably by using a priority queue instead ;; of sorting the list every iteration. Clojure doesn't have one built in, so ;; here's a poor man's version built out of a sorted map of lists. (defn madd [m [k p]] (assoc m p (cons k (m p)))) (defn mpop [m] (let [[k vlist] (first m)] [k (first vlist) (if (empty? (rest vlist)) (dissoc m k) (assoc m k (rest vlist)))])) (defn mcombine [m] (let [[pa a pop1] (mpop m) [pb b pop2] (mpop pop1)] (madd pop2 [[a b] (+ pa pb)]))) (defn makecodetree [P] (second (mpop (nth (iterate mcombine (reduce madd (sortedmap) P)) (dec (count P)))))) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (def faircoin {:H 1 :T 1}) (def unfaircoin {:H 3 :T 1}) (def unfairtriples (combinedistributions unfaircoin unfaircoin unfaircoin)) (def triplecodetree (makecodetree unfairtriples)) (def triplecode (makecode triplecodetree)) ;; We don't need to estimate the cost of a code by generating a long random stream and transmitting it. ;; Given a probability distribution and a code, we can just calculate the expected cost: ;; We make a distribution over the transmitted symbols (defn codedistribution [P code] (for [s (keys P)] [(P s) (code s)])) ;; e.g. (codedistribution unfairtriples triplecode) ;;([27 (0)] [9 (1 1 1)] [9 (1 1 0)] [3 (1 0 0 1 0)] [9 (1 0 1)] [3 (1 0 0 0 1)] [3 (1 0 0 0 0)] [1 (1 0 0 1 1)]) ;; And from that, it's easy to calculate the expected length of a sequence (defn expectedcodelength [P code] (let [cd (codedistribution P code)] (/ (reduce + (for [[k v] cd] (* k (count v)))) (reduce + (map first cd))))) ;; So the expected cost per symbol is: (expectedcodelength unfairtriples triplecode) ;79/32 ;; or per cointoss: (float ( / 79/32 3)) ; 0.8229167 ;; So we can get the noise out of our table: (defn costforncode [ P n ] (let [Pn (apply combinedistributions (repeat n P)) code (makecode (makecodetree Pn))] (float (/ (expectedcodelength Pn code) n)))) (costforncode unfaircoin 1) ; 1.0 (costforncode unfaircoin 2) ; 0.84375 (costforncode unfaircoin 3) ; 0.8229167 (costforncode unfaircoin 4) ; 0.8183594 (costforncode unfaircoin 5) ; 0.81777346 (costforncode unfaircoin 6) ; 0.8186849 (costforncode unfaircoin 7) ; 0.81685966 (costforncode unfaircoin 8) ; 0.81575775 (costforncode unfaircoin 9) ; 0.81493336 (costforncode unfaircoin 10) ; 0.8141917 (costforncode unfaircoin 11) ; 0.8137328 (costforncode unfaircoin 12) ; 0.81351095 ;; It looks like something is converging, although the convergence isn't monotonic. ;; I'm now revising my estimate of the cost of sending the results of a 1:3 process to be about 0.813 ;; But we don't know whether that's the limit of the coding process, or a property of the 1:3 distribution, ;; or whether it's in some way specific to transmitting over a binary channel. ;; Let's look at some other distributions. ;; For the fair coin distribution, huffman coding triples doesn't help at all. (costforncode faircoin 1) ; 1.0 (costforncode faircoin 2) ; 1.0 (costforncode faircoin 3) ; 1.0 (costforncode faircoin 4) ; 1.0 ;; But for an even choice between three things, it does: (def triad {:A 1 :B 1 :C 1}) (costforncode triad 1) ; 1.6666666 (costforncode triad 2) ; 1.6111112 (costforncode triad 3) ; 1.6049383 (costforncode triad 4) ; 1.6049383 (costforncode triad 5) ; 1.5893004 (costforncode triad 6) ; 1.5992227 (costforncode triad 7) ; 1.5895878 (costforncode triad 8) ; 1.5939262 ;; For a choice between four things, it makes no difference (def quad {:A 1 :B 1 :C 1 :D 1}) (costforncode quad 1) ; 2.0 (costforncode quad 2) ; 2.0 (costforncode quad 3) ; 2.0 (costforncode quad 4) ; 2.0 (costforncode quad 5) ; 2.0 ;; For five it's a good thing to do (def quint {:A 1 :B 1 :C 1 :D 1 :E 1}) (costforncode quint 1) ; 2.4 (costforncode quint 2) ; 2.36 (costforncode quint 3) ; 2.3253334 (costforncode quint 4) ; 2.3404 (costforncode quint 5) ; 2.337856 (costforncode quint 6) ; 2.3252373 ;; And again, for the next power of two, no difference. (def octet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1 :G 1 :H 1}) (costforncode octet 1) ; 3.0 (costforncode octet 2) ; 3.0 (costforncode octet 3) ; 3.0 ;; I think that we might have guessed that it would take three bits to decide between ;; eight equally likely things, and two bits for four things, but what about the other numbers? ;; If 8 = 2*2*2 > 3, and 4 = 2*2 > 2, and 2 > 1, what's the easiest pattern we can fit to that? (defn bits [n] (/ (Math/log n)(Math/log 2))) ;; Also known as log2 (map bits (range 2 10)) ; (1.0 1.5849625007211563 2.0 2.321928094887362 2.584962500721156 2.807354922057604 3.0 3.1699250014423126) ;; So let's make a prediction (def sextet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1}) (bits 6) ; 2.584962500721156 (costforncode sextet 1) ; 2.6666667 (costforncode sextet 2) ; 2.6111112 (costforncode sextet 3) ; 2.6049383 (costforncode sextet 4) ; 2.6049383 (costforncode sextet 5) ; 2.5893004 (costforncode sextet 6) ; 2.5992227 ;; It looks as though the cost of coding an even distribution using huffman ;; encoding of runs is pretty close to being the logarithm (to base 2) of the number of symbols.
Saturday, January 15, 2011
A Very Gentle Introduction to Information Theory: Huffman Coding
;; A Very Gentle Introduction to Information Theory : Part III ;; Entropy and Huffman Coding ;; Once again, I'll keep some code from the first two parts, without explanation (defn randomstream [P] (let [pseq (vec (mapcat (fn[[k v]](repeat v k )) P))] (for [i (range)] (randnth pseq)))) (defn cost [encoder decoder message] (let [coded (encoder message)] (if (= (decoder coded) message) (count coded) :fail))) (defn decoder ([codetree stream] (decoder codetree codetree stream)) ([currentcodetree codetree stream] (lazyseq (if (keyword? currentcodetree) (cons currentcodetree (decoder codetree codetree stream)) (iflet [stream (seq stream)] (if (= (first stream) 1) (decoder (first currentcodetree) codetree (rest stream)) (decoder (second currentcodetree) codetree (rest stream)))))))) (defn encoder [code stream] (mapcat code stream)) (defn makeencoder [code] (fn [s] (encoder code s))) (defn makedecoder [codetree] (fn[s] (decoder codetree s))) (defn combinekeywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a)))) (defn splitkeyword [a] (map #(keyword (str %)) (drop 1 (str a)))) (defn makecombinationencoder [code n] (fn [s] (encoder code (map #(apply combinekeywords %) (partition n s))))) (defn makecombinationdecoder [codetree] (fn [s] (mapcat splitkeyword (decoder codetree s)))) ;; So far we've looked at three probability distributions: (def faircoin {:H 1 :T 1}) (def unfaircoin {:H 3 :T 1}) (def unfairpairs {:HH 9, :HT 3, :TH 3, :TT 1}) ;; And two codes: (def faircodetree [:H :T]) (def faircode {:H '(1) :T '(0)}) (def unfaircodetree [ :HH [ :HT [ :TH :TT]]]) (def unfaircode {:HH '(1) :HT '(0 1) :TH '(0 0 1) :TT '(0 0 0)}) ;; We should add a fourth probability distribution to represent pairs of fair coin toss results (def fairpairs {:HH 1 :HT 1 :TH 1 :TT 1}) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; We found that (defn estimatecost [P encoder decoder] (let [n 100000 c (cost encoder decoder (take n (randomstream P)))] (if (number? c) (float (/ c n)) c))) ;; Using the best code we can think of for the fair coin resulted in a transmission cost of 1 (£/symbol) (estimatecost faircoin (makeencoder faircode) (makedecoder faircodetree)) ; 1.0 ;; And that that was also the cost for the unfair coin with this code: (estimatecost unfaircoin (makeencoder faircode) (makedecoder faircodetree)) ; 1.0 ;; But that we could come up with a code for pairs of coin tosses ;; which did substantially better for pairs of unfair coin tosses (estimatecost unfairpairs (makeencoder unfaircode) (makedecoder unfaircodetree)) ; 1.68338 ;; but substantially worse for pairs of fair coin tosses (estimatecost fairpairs (makeencoder unfaircode) (makedecoder unfaircodetree)) ; 2.24722 ;; remember that that's the transmission cost per symbol, and that each symbol represents two coin tosses ;; In case you think there's any sleight of hand going on there, here's how we'd use the pairs code to transmit ;; the original unpaired streams (estimatecost unfaircoin (makecombinationencoder unfaircode 2) (makecombinationdecoder unfaircodetree)) ; 0.84561 (estimatecost faircoin (makecombinationencoder unfaircode 2) (makecombinationdecoder unfaircodetree)) ; 1.12407 ;; Notice that the costs here are pertoss, showing that the unfair code is actually an improvement ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Now it seems that, if we can send the results of unfaircoin more efficiently ;; by considering {:HH 9, :HT 3, :TH 3, :TT 1}, the distribution of pairs of ;; tosses, then we should have a look at the distribution of triples, and work out a code for that: (defn combinedistributions ([P] P) ([P1 P2] (into {} (for [[s1 p1] P1 [s2 p2] P2] [(combinekeywords s1 s2) (* p1 p2)]))) ([P1 P2 & Plist] (reduce combinedistributions (cons P1 (cons P2 Plist))))) (def unfairtriples (combinedistributions unfaircoin unfaircoin unfaircoin)) ;; unfairtriples is {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1} ;; Now how should we work out a code for this distribution? ;; Huffman tells us that we should combine the two lowest probability events ;; so that {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1} ;; goes to {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4} ;; and then do it again, so that {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4} ;; goes to {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, {:THT 3, {:TTH 3, :TTT 1} 4} 7} ;; and so on ..... (defn huffmancombine [P] (let [plist (sortby second P) newelement (into {} (take 2 plist))] (into {} (cons [newelement (reduce + (vals newelement))] (drop 2 plist))))) (nth (iterate huffmancombine unfairtriples) (dec (dec (count unfairtriples)))) ;; {{{:HHT 9, :HTH 9} 18, {:THH 9, {{:TTT 1, :HTT 3} 4, {:THT 3, :TTH 3} 6} 10} 19} 37, :HHH 27} ;; At the end, we get a sort of nested binary probability distribution ;; You could think of this as being a way to generate the triples by tossing strangely biased coins! ;; From that, we can generate our code tree directly by just throwing away the numbers (require 'clojure.walk) (defn makecodetree [P] (clojure.walk/postwalk #(if (map? %) (into[] (map first %)) %) (nth (iterate huffmancombine P) (dec (dec (count P)))))) (def triplecodetree (makecodetree unfairtriples)) ;;[[[:HHT :HTH] [:THH [[:TTT :HTT] [:THT :TTH]]]] :HHH] ;; If we have the decoder, then we can use it to generate the coder! (defn symbols [prefix codetree] (if (keyword? codetree) (list prefix codetree) (concat (symbols (cons 1 prefix) (first codetree)) (symbols (cons 0 prefix) (second codetree))))) (defn makecode [codetree] (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() codetree))))) (def triplecode (makecode triplecodetree)) ;; {:HHT (0 0 0), :HTH (0 0 1), :THH (0 1 0), :TTT (0 1 1 0 0), :HTT (0 1 1 0 1), :THT (0 1 1 1 0), :TTH (0 1 1 1 1), :HHH (1)} ;; Let's see how this does (estimatecost unfairtriples (makeencoder triplecode) (makedecoder triplecodetree)) ; 2.4615 ;; £2.46 per symbol, or 0.82p per toss ;; So while going from single tosses to pairs allowed us to go from 1>0.85, going from pairs to triples only allowed us to get from 0.85>0.82. ;; Is there an end to this process? (defn bitrate [P n] (let [Pn (apply combinedistributions (repeat n P)) tree (makecodetree Pn)] (/ (estimatecost Pn (makeencoder (makecode tree)) (makedecoder tree)) n))) (bitrate unfaircoin 1) ; 1.0 (bitrate unfaircoin 2) ; 0.844435 (bitrate unfaircoin 3) ; 0.82466 (bitrate unfaircoin 4) ; 0.8196275 (bitrate unfaircoin 5) ; 0.818912 (bitrate unfaircoin 6) ; 0.81896996 (bitrate unfaircoin 7) ; 0.8166514 (bitrate unfaircoin 8) ; 0.815995 (bitrate unfaircoin 9) ; 0.81352335 (bitrate unfaircoin 10) ; 0.81462896 (bitrate unfaircoin 11) ; 0.8137691 ;; To me, this is at least suggestive that there might be something fundamental ;; about a cost of 82p to transmit the result of a 3:1 random result over a ;; binary channel.
Subscribe to:
Posts (Atom)