;; A Very Gentle Introduction to Information Theory : Part IV ;; Entropy and Huffman Coding ;; Once again, I'll keep some code from the earlier parts, without explanation ;; Probability distributions (defn combinekeywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a)))) (defn combinedistributions ([P] P) ([P1 P2] (into {} (for [[s1 p1] P1 [s2 p2] P2] [(combinekeywords s1 s2) (* p1 p2)]))) ([P1 P2 & Plist] (reduce combinedistributions (cons P1 (cons P2 Plist))))) ;; Coding and Decoding (defn decoder ([codetree stream] (decoder codetree codetree stream)) ([currentcodetree codetree stream] (lazyseq (if (keyword? currentcodetree) (cons currentcodetree (decoder codetree codetree stream)) (iflet [stream (seq stream)] (if (= (first stream) 1) (decoder (first currentcodetree) codetree (rest stream)) (decoder (second currentcodetree) codetree (rest stream)))))))) (defn encoder [code stream] (mapcat code stream)) (defn makeencoder [code] (fn [s] (encoder code s))) (defn makedecoder [codetree] (fn[s] (decoder codetree s))) ;; Huffman encoding (defn symbols [prefix codetree] (if (keyword? codetree) (list prefix codetree) (concat (symbols (cons 1 prefix) (first codetree)) (symbols (cons 0 prefix) (second codetree))))) (defn makecode [codetree] (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() codetree))))) ;; The original makecodetree was very slow, because it resorted the list every ;; iteration. We can speed it up considerably by using a priority queue instead ;; of sorting the list every iteration. Clojure doesn't have one built in, so ;; here's a poor man's version built out of a sorted map of lists. (defn madd [m [k p]] (assoc m p (cons k (m p)))) (defn mpop [m] (let [[k vlist] (first m)] [k (first vlist) (if (empty? (rest vlist)) (dissoc m k) (assoc m k (rest vlist)))])) (defn mcombine [m] (let [[pa a pop1] (mpop m) [pb b pop2] (mpop pop1)] (madd pop2 [[a b] (+ pa pb)]))) (defn makecodetree [P] (second (mpop (nth (iterate mcombine (reduce madd (sortedmap) P)) (dec (count P)))))) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (def faircoin {:H 1 :T 1}) (def unfaircoin {:H 3 :T 1}) (def unfairtriples (combinedistributions unfaircoin unfaircoin unfaircoin)) (def triplecodetree (makecodetree unfairtriples)) (def triplecode (makecode triplecodetree)) ;; We don't need to estimate the cost of a code by generating a long random stream and transmitting it. ;; Given a probability distribution and a code, we can just calculate the expected cost: ;; We make a distribution over the transmitted symbols (defn codedistribution [P code] (for [s (keys P)] [(P s) (code s)])) ;; e.g. (codedistribution unfairtriples triplecode) ;;([27 (0)] [9 (1 1 1)] [9 (1 1 0)] [3 (1 0 0 1 0)] [9 (1 0 1)] [3 (1 0 0 0 1)] [3 (1 0 0 0 0)] [1 (1 0 0 1 1)]) ;; And from that, it's easy to calculate the expected length of a sequence (defn expectedcodelength [P code] (let [cd (codedistribution P code)] (/ (reduce + (for [[k v] cd] (* k (count v)))) (reduce + (map first cd))))) ;; So the expected cost per symbol is: (expectedcodelength unfairtriples triplecode) ;79/32 ;; or per cointoss: (float ( / 79/32 3)) ; 0.8229167 ;; So we can get the noise out of our table: (defn costforncode [ P n ] (let [Pn (apply combinedistributions (repeat n P)) code (makecode (makecodetree Pn))] (float (/ (expectedcodelength Pn code) n)))) (costforncode unfaircoin 1) ; 1.0 (costforncode unfaircoin 2) ; 0.84375 (costforncode unfaircoin 3) ; 0.8229167 (costforncode unfaircoin 4) ; 0.8183594 (costforncode unfaircoin 5) ; 0.81777346 (costforncode unfaircoin 6) ; 0.8186849 (costforncode unfaircoin 7) ; 0.81685966 (costforncode unfaircoin 8) ; 0.81575775 (costforncode unfaircoin 9) ; 0.81493336 (costforncode unfaircoin 10) ; 0.8141917 (costforncode unfaircoin 11) ; 0.8137328 (costforncode unfaircoin 12) ; 0.81351095 ;; It looks like something is converging, although the convergence isn't monotonic. ;; I'm now revising my estimate of the cost of sending the results of a 1:3 process to be about 0.813 ;; But we don't know whether that's the limit of the coding process, or a property of the 1:3 distribution, ;; or whether it's in some way specific to transmitting over a binary channel. ;; Let's look at some other distributions. ;; For the fair coin distribution, huffman coding triples doesn't help at all. (costforncode faircoin 1) ; 1.0 (costforncode faircoin 2) ; 1.0 (costforncode faircoin 3) ; 1.0 (costforncode faircoin 4) ; 1.0 ;; But for an even choice between three things, it does: (def triad {:A 1 :B 1 :C 1}) (costforncode triad 1) ; 1.6666666 (costforncode triad 2) ; 1.6111112 (costforncode triad 3) ; 1.6049383 (costforncode triad 4) ; 1.6049383 (costforncode triad 5) ; 1.5893004 (costforncode triad 6) ; 1.5992227 (costforncode triad 7) ; 1.5895878 (costforncode triad 8) ; 1.5939262 ;; For a choice between four things, it makes no difference (def quad {:A 1 :B 1 :C 1 :D 1}) (costforncode quad 1) ; 2.0 (costforncode quad 2) ; 2.0 (costforncode quad 3) ; 2.0 (costforncode quad 4) ; 2.0 (costforncode quad 5) ; 2.0 ;; For five it's a good thing to do (def quint {:A 1 :B 1 :C 1 :D 1 :E 1}) (costforncode quint 1) ; 2.4 (costforncode quint 2) ; 2.36 (costforncode quint 3) ; 2.3253334 (costforncode quint 4) ; 2.3404 (costforncode quint 5) ; 2.337856 (costforncode quint 6) ; 2.3252373 ;; And again, for the next power of two, no difference. (def octet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1 :G 1 :H 1}) (costforncode octet 1) ; 3.0 (costforncode octet 2) ; 3.0 (costforncode octet 3) ; 3.0 ;; I think that we might have guessed that it would take three bits to decide between ;; eight equally likely things, and two bits for four things, but what about the other numbers? ;; If 8 = 2*2*2 > 3, and 4 = 2*2 > 2, and 2 > 1, what's the easiest pattern we can fit to that? (defn bits [n] (/ (Math/log n)(Math/log 2))) ;; Also known as log2 (map bits (range 2 10)) ; (1.0 1.5849625007211563 2.0 2.321928094887362 2.584962500721156 2.807354922057604 3.0 3.1699250014423126) ;; So let's make a prediction (def sextet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1}) (bits 6) ; 2.584962500721156 (costforncode sextet 1) ; 2.6666667 (costforncode sextet 2) ; 2.6111112 (costforncode sextet 3) ; 2.6049383 (costforncode sextet 4) ; 2.6049383 (costforncode sextet 5) ; 2.5893004 (costforncode sextet 6) ; 2.5992227 ;; It looks as though the cost of coding an even distribution using huffman ;; encoding of runs is pretty close to being the logarithm (to base 2) of the number of symbols.
Blog Archive

▼
2011
(26)

▼
January
(10)
 Turning Exceptions into Return Values
 Finding Something in a Vector, Parsing CSV Files
 £500 if you can find me a job (version 1.0)
 Kmeans : An Algorithm for Clustering Data
 Cleaning Old Definitions from the REPL : shreduser
 takewhileunstable
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory :...
 A Very Gentle Introduction to Information Theory :...

▼
January
(10)
Search This Blog
Monday, January 17, 2011
A Very Gentle Introduction to Information Theory: Guessing the Entropy
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment