;; A Very Gentle Introduction to Information Theory : Part III ;; Entropy and Huffman Coding ;; Once again, I'll keep some code from the first two parts, without explanation (defn randomstream [P] (let [pseq (vec (mapcat (fn[[k v]](repeat v k )) P))] (for [i (range)] (randnth pseq)))) (defn cost [encoder decoder message] (let [coded (encoder message)] (if (= (decoder coded) message) (count coded) :fail))) (defn decoder ([codetree stream] (decoder codetree codetree stream)) ([currentcodetree codetree stream] (lazyseq (if (keyword? currentcodetree) (cons currentcodetree (decoder codetree codetree stream)) (iflet [stream (seq stream)] (if (= (first stream) 1) (decoder (first currentcodetree) codetree (rest stream)) (decoder (second currentcodetree) codetree (rest stream)))))))) (defn encoder [code stream] (mapcat code stream)) (defn makeencoder [code] (fn [s] (encoder code s))) (defn makedecoder [codetree] (fn[s] (decoder codetree s))) (defn combinekeywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a)))) (defn splitkeyword [a] (map #(keyword (str %)) (drop 1 (str a)))) (defn makecombinationencoder [code n] (fn [s] (encoder code (map #(apply combinekeywords %) (partition n s))))) (defn makecombinationdecoder [codetree] (fn [s] (mapcat splitkeyword (decoder codetree s)))) ;; So far we've looked at three probability distributions: (def faircoin {:H 1 :T 1}) (def unfaircoin {:H 3 :T 1}) (def unfairpairs {:HH 9, :HT 3, :TH 3, :TT 1}) ;; And two codes: (def faircodetree [:H :T]) (def faircode {:H '(1) :T '(0)}) (def unfaircodetree [ :HH [ :HT [ :TH :TT]]]) (def unfaircode {:HH '(1) :HT '(0 1) :TH '(0 0 1) :TT '(0 0 0)}) ;; We should add a fourth probability distribution to represent pairs of fair coin toss results (def fairpairs {:HH 1 :HT 1 :TH 1 :TT 1}) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; We found that (defn estimatecost [P encoder decoder] (let [n 100000 c (cost encoder decoder (take n (randomstream P)))] (if (number? c) (float (/ c n)) c))) ;; Using the best code we can think of for the fair coin resulted in a transmission cost of 1 (£/symbol) (estimatecost faircoin (makeencoder faircode) (makedecoder faircodetree)) ; 1.0 ;; And that that was also the cost for the unfair coin with this code: (estimatecost unfaircoin (makeencoder faircode) (makedecoder faircodetree)) ; 1.0 ;; But that we could come up with a code for pairs of coin tosses ;; which did substantially better for pairs of unfair coin tosses (estimatecost unfairpairs (makeencoder unfaircode) (makedecoder unfaircodetree)) ; 1.68338 ;; but substantially worse for pairs of fair coin tosses (estimatecost fairpairs (makeencoder unfaircode) (makedecoder unfaircodetree)) ; 2.24722 ;; remember that that's the transmission cost per symbol, and that each symbol represents two coin tosses ;; In case you think there's any sleight of hand going on there, here's how we'd use the pairs code to transmit ;; the original unpaired streams (estimatecost unfaircoin (makecombinationencoder unfaircode 2) (makecombinationdecoder unfaircodetree)) ; 0.84561 (estimatecost faircoin (makecombinationencoder unfaircode 2) (makecombinationdecoder unfaircodetree)) ; 1.12407 ;; Notice that the costs here are pertoss, showing that the unfair code is actually an improvement ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Now it seems that, if we can send the results of unfaircoin more efficiently ;; by considering {:HH 9, :HT 3, :TH 3, :TT 1}, the distribution of pairs of ;; tosses, then we should have a look at the distribution of triples, and work out a code for that: (defn combinedistributions ([P] P) ([P1 P2] (into {} (for [[s1 p1] P1 [s2 p2] P2] [(combinekeywords s1 s2) (* p1 p2)]))) ([P1 P2 & Plist] (reduce combinedistributions (cons P1 (cons P2 Plist))))) (def unfairtriples (combinedistributions unfaircoin unfaircoin unfaircoin)) ;; unfairtriples is {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1} ;; Now how should we work out a code for this distribution? ;; Huffman tells us that we should combine the two lowest probability events ;; so that {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1} ;; goes to {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4} ;; and then do it again, so that {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4} ;; goes to {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, {:THT 3, {:TTH 3, :TTT 1} 4} 7} ;; and so on ..... (defn huffmancombine [P] (let [plist (sortby second P) newelement (into {} (take 2 plist))] (into {} (cons [newelement (reduce + (vals newelement))] (drop 2 plist))))) (nth (iterate huffmancombine unfairtriples) (dec (dec (count unfairtriples)))) ;; {{{:HHT 9, :HTH 9} 18, {:THH 9, {{:TTT 1, :HTT 3} 4, {:THT 3, :TTH 3} 6} 10} 19} 37, :HHH 27} ;; At the end, we get a sort of nested binary probability distribution ;; You could think of this as being a way to generate the triples by tossing strangely biased coins! ;; From that, we can generate our code tree directly by just throwing away the numbers (require 'clojure.walk) (defn makecodetree [P] (clojure.walk/postwalk #(if (map? %) (into[] (map first %)) %) (nth (iterate huffmancombine P) (dec (dec (count P)))))) (def triplecodetree (makecodetree unfairtriples)) ;;[[[:HHT :HTH] [:THH [[:TTT :HTT] [:THT :TTH]]]] :HHH] ;; If we have the decoder, then we can use it to generate the coder! (defn symbols [prefix codetree] (if (keyword? codetree) (list prefix codetree) (concat (symbols (cons 1 prefix) (first codetree)) (symbols (cons 0 prefix) (second codetree))))) (defn makecode [codetree] (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() codetree))))) (def triplecode (makecode triplecodetree)) ;; {:HHT (0 0 0), :HTH (0 0 1), :THH (0 1 0), :TTT (0 1 1 0 0), :HTT (0 1 1 0 1), :THT (0 1 1 1 0), :TTH (0 1 1 1 1), :HHH (1)} ;; Let's see how this does (estimatecost unfairtriples (makeencoder triplecode) (makedecoder triplecodetree)) ; 2.4615 ;; £2.46 per symbol, or 0.82p per toss ;; So while going from single tosses to pairs allowed us to go from 1>0.85, going from pairs to triples only allowed us to get from 0.85>0.82. ;; Is there an end to this process? (defn bitrate [P n] (let [Pn (apply combinedistributions (repeat n P)) tree (makecodetree Pn)] (/ (estimatecost Pn (makeencoder (makecode tree)) (makedecoder tree)) n))) (bitrate unfaircoin 1) ; 1.0 (bitrate unfaircoin 2) ; 0.844435 (bitrate unfaircoin 3) ; 0.82466 (bitrate unfaircoin 4) ; 0.8196275 (bitrate unfaircoin 5) ; 0.818912 (bitrate unfaircoin 6) ; 0.81896996 (bitrate unfaircoin 7) ; 0.8166514 (bitrate unfaircoin 8) ; 0.815995 (bitrate unfaircoin 9) ; 0.81352335 (bitrate unfaircoin 10) ; 0.81462896 (bitrate unfaircoin 11) ; 0.8137691 ;; To me, this is at least suggestive that there might be something fundamental ;; about a cost of 82p to transmit the result of a 3:1 random result over a ;; binary channel.
Blog Archive

▼
2011
(26)

▼
January
(10)
 Turning Exceptions into Return Values
 Finding Something in a Vector, Parsing CSV Files
 £500 if you can find me a job (version 1.0)
 Kmeans : An Algorithm for Clustering Data
 Cleaning Old Definitions from the REPL : shreduser
 takewhileunstable
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory :...
 A Very Gentle Introduction to Information Theory :...

▼
January
(10)
Search This Blog
Saturday, January 15, 2011
A Very Gentle Introduction to Information Theory: Huffman Coding
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment