;; A Very Gentle Introduction to Information Theory : Part I ;; Entropy and Huffman Coding ;; What is the 'information content' of a random process? ;; We might think about a Victorian bookmaker transmitting horse racing results ;; over an expensive telegraph connection. A more modern example would be ;; streaming files over an internet connection. ;; To think about the essence of these things, let us choose very simple models ;; for both the random process and the channel over which the message is to be ;; sent. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Setting up a model problem ;; Our communications channel will be a device over which we can transmit either ;; 0 or 1, paying a cost of £1 for every symbol. ;; Our random number generators will make arbitrary symbols from a known set, ;; one a second, with simple integer odds. ;; One example might produce the symbols A B and C, with A:B:C in a 2:1:1 ratio. ;; Sample output from this particular random number generator might be ;; BACAAAABCAACBCAACAAABAABCBAAACBC.... ;; And our challenge is to send a message down our channel that will allow our ;; friend at the other end to reconstruct the random stream. ;; We'd be interested in minimizing the cost, but not at the expense of accuracy. ;; If the random number generator is someone repeatedly tossing a fair coin, then we can say: (def faircoin {:H 1 :T 1}) (defn randomstream [P] (let [pseq (vec (mapcat (fn[[k v]](repeat v k )) P))] (for [i (range)] (randnth pseq)))) ;; And if the fair coin were to come up with: (def coinstream (randomstream faircoin)) (take 20 coinstream) ; (:T :H :H :T :H :H :T :H :T :T :T :T :H :T :T :T :T :H :T :T) ;; We might want to transmit the results by sending 1 for tails and 0 for heads (defn coincoder [sq] (map #(if (= % :T) 0 1) sq)) (take 20 (coincoder (randomstream faircoin))) ; (1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0) ;; Our friend on the other end, with whom we have agreed this scheme, might decode like this: (defn coindecoder [sq] (map #(if (= % 0) :T :H) sq)) (take 20 (coindecoder (coincoder coinstream))) ; (:T :H :H :T :H :H :T :H :T ; :T :T :T :H :T :T :T :T :H ; :T :T) ;; And finally, the world might judge our scheme thus: (defn cost [encoder decoder message] (let [coded (encoder message)] (if (= (decoder coded) message) (count coded) :fail))) (cost coincoder coindecoder (take 200000 coinstream)) ; £200000 ;; Under this scheme, our message gets through accurately, and we have spent ;; £200000 to transmit 200000 symbols. ;; If anyone can think of a better encoding scheme, please let me know. Until I ;; see a counterexample, I'll take the information content of a fair coin under ;; these conditions to be £1/symbol. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; A 'less random' random generator (def unfaircoin {:H 3 :T 1}) (def unfairstream (randomstream unfaircoin)) (take 20 unfairstream) ; (:T :H :H :T :H :H :H :H :H :H :H :H :H :H :T :H :H :H ; :H :H) ;; It seems as though there might be less information in the unfair coin tosses. ;; Why do I say this? ;; Well, memory is a kind of channel. You put something in, and later you get ;; something out, and it costs you something to hold the memory. ;; And the 20 tosses of the unfair coin above seem easier to remember than 20 ;; tosses of the fair coin would be. ;; It's quite hard to make this intuition precise, though. After all, all HT ;; sequences are still possible. But it does seem as though we should be able ;; to send these sorts of streams through our channel for less cost than the ;; fair coin streams, ON AVERAGE. ;; To think about how that might work, let's consider what the stream above ;; looks like when split into pairs. (take 20 (partition 2 unfairstream)) ;;((:T :H) (:H :T) (:H :H) (:H :H) (:H :H) ;;(:H :H) (:H :H) (:T :H) (:H :H) (:H :H) ;;(:T :H) (:H :H) (:H :H) (:T :H) (:H :H) ;;(:T :H) (:H :T) (:T :H) (:T :H) (:T :H)) (frequencies (take 16000 (partition 2 unfairstream))) ;; {(:T :H) 2981, (:H :T) 3083, (:H :H) 8964, (:T :T) 972} ;; Because the frequencies of the underlying coin are distorted 3:1, the ;; frequencies of the pairs are even more distorted, it looks like 1:3:3:9 ;; A little bird is telling me that the next random number generator we try to ;; code should be: (def unfairpairs {:HH 9, :HT 3, :TH 3, :TT 1}) (def unfairpairstream (randomstream unfairpairs)) (take 15 unfairpairstream ) ; (:HH :TH :HH :HH :HT :HT :HH :HT :HH :TH :TH :HH ; :HH :TH :HT) ;; How might we go about encoding this to send through our channel? ;; If we do it the obvious way: (defn paircoder [sq] (mapcat #(case % :HH '(0 0) :HT '(0 1) :TH '(1 0) :TT '(1 1)) sq)) (take 20 (paircoder unfairpairstream)) ; (0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0) (defn pairdecoder [sq] (map #(case % '(0 0) :HH '(0 1) :HT '(1 0) :TH '(1 1) :TT) (partition 2 sq))) (take 20 (pairdecoder (paircoder unfairpairstream))) ; (:HH :TH :HH :HH :HT :HT :HH :HT :HH :TH :TH :HH :HH :TH :HT :HH :HH :HT :HH :TH) (cost paircoder pairdecoder (take 200000 unfairpairstream)) ; £400000 ;; Then we can see that it costs £2 per symbol to transmit the data now, which ;; sounds about right, since we made each symbol out of a pair of the previous ;; symbols, so we could use them as a cheaper transmission method if they cost ;; any less. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Sneaking more data down the telegraph ;; As Samuel Morse knew, when transmitting in binary, we should use shorter ;; sequences for more common letters. ;; Let's try HH > 1, HT >01 TH>001, TT> 000 (defn variablelengthpaircoder [sq] (mapcat #(case % :HH '(1) :HT '(0 1) :TH '(0 0 1) :TT '(0 0 0)) sq)) (take 20 (variablelengthpaircoder unfairpairstream)) ; (1 0 0 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1) ;; decoding this is trickier (defn variablelengthpairdecoder [sq] (lazyseq (iflet [sq (seq sq)] (if (= (first sq) 1) (cons :HH (variablelengthpairdecoder (rest sq))) (iflet [sq2 (seq (rest sq))] (if (= (first sq2) 1) (cons :HT (variablelengthpairdecoder (rest sq2))) (iflet [sq3 (seq (rest sq2))] (if (= (first sq3) 1) (cons :TH (variablelengthpairdecoder (rest sq3))) (cons :TT (variablelengthpairdecoder (rest sq3))))))))))) ;; but it can be done: (take 20 (variablelengthpairdecoder (variablelengthpaircoder unfairpairstream))) ;;(:HH :TH :HH :HH :HT :HT :HH :HT :HH :TH :TH :HH :HH :TH :HT :HH :HH :HT :HH :TH) ;; and it saves on the number of transmitted symbols: (cost variablelengthpaircoder variablelengthpairdecoder (take 200000 unfairpairstream)) ; £337179 ;; For this particular stream, with this particular code, we're now only paying ;; £ 1.69 per symbol, as opposed to £2 per symbol for the naive code. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Blog Archive

▼
2011
(26)

▼
January
(10)
 Turning Exceptions into Return Values
 Finding Something in a Vector, Parsing CSV Files
 £500 if you can find me a job (version 1.0)
 Kmeans : An Algorithm for Clustering Data
 Cleaning Old Definitions from the REPL : shreduser
 takewhileunstable
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory: ...
 A Very Gentle Introduction to Information Theory :...
 A Very Gentle Introduction to Information Theory :...

▼
January
(10)
Search This Blog
Thursday, January 13, 2011
A Very Gentle Introduction to Information Theory : Setting the Scene
I've been reading David Mackay's wonderful book 'Information Theory, Inference, and Learning Algorithms', and writing a lot of clojure code as I go.
I thought I might try to explain the concept of Entropy, or Information Content, as it's used in Communications Theory.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment