Search This Blog

Monday, November 22, 2010

Levenshtein Distance (Edit Distance, String Difference)

Jeff Foster's excellent blog FatVat is www.fatvat.co.uk

Brenton Ashworth points out below that this blows stack on quite short strings. My bad. I've learned to avoid using recursion for infinite loops, but it really never occurred to me that a couple of hundred self calls would blow stack on anything larger than a digital watch.

I am shame. 

Now the JVM is open source, can we mend it? I would quite like to be able to change directory too.


;; The other day I needed a way of measuring the distance between two strings

;; Levenshtein distance is one such way:

;; avada kedavra
;; abada kedavra
;; abrada kedavra
;; abrada cedavra
;; abrada cadavra
;; abrada cadabra
;; abraa cadabra
;; abra cadabra
;; abracadabra

;; Eight changes get you from one to the other

;; Thank you google, wikipedia and Jeff Foster's excellent blog fatvat (http://www.fatvat.co.uk/)

(defn levenshtein-distance
  "Calculates the edit-distance between two sequences"
  [seq1 seq2]
  (cond
   (empty? seq1) (count seq2)
   (empty? seq2) (count seq1)
   :else (min
          (+ (if (= (first seq1) (first seq2)) 0 1)
             (#'levenshtein-distance (rest seq1) (rest seq2))) 
          (inc (#'levenshtein-distance (rest seq1) seq2))      
          (inc (#'levenshtein-distance seq1 (rest seq2))))))

(levenshtein-distance "avada kedavra" "abracadabra") ;; much grinding......

;; We cast a spell more powerful than either:
(def levenshtein-distance (memoize levenshtein-distance))

(levenshtein-distance "avada kedavra" "abracadabra") ; ok, it was seven.

6 comments:

  1. There's also an implementation of levenshtein-distance in incanter.stats if you want to compare:

    https://github.com/liebke/incanter/blob/master/modules/incanter-core/src/incanter/stats.clj

    ReplyDelete
  2. Looks more traditional, certainly.

    But I like these 'memoizations to turn a tree into a table' type solutions.

    I can think off the top of my head of fibonnacci, pascal's triangle, change counting, and levenshtein.

    Anyone else have any nice ones?

    One where the table was an evil shape so that it couldn't be easily rephrased would be sweet.

    ReplyDelete
  3. Is (inc (#'levenshtein-distance ...) the same as (recur ...)?

    ReplyDelete
  4. John,

    It is a beautiful algorithm. I love when problems can be solved like this. Unfortunately it will stack overflow with large sequences.

    I am working on a diff library for Clojure which currently includes diff, patch and edit-distance functions. Levenshtein distance is not exactly the same as edit distance but the latter can be used for comparing the difference between sequences in much the same way.

    My focus has been on performance. For example: for a sequence of 1000 elements with 100 changes it can calculate the edit-distance in ~70 milliseconds (it actually generates a diff and then gets the edit distance from that). Incanter's algorithm takes 34 seconds and yours overflows.

    I thought you might be interested.

    http://github.com/brentonashworth/clj-diff

    I may add a levenshtein-distance function in the future as it can also be calculated from a diff.

    Brenton

    ReplyDelete
  5. Duncan,

    Well spotted.

    It's a recursive function call, but clojure would normally compile that to a direct jump rather than going through the variable. If I use #' then it forces the call to go through the var, and that means that memoize works as I intended. Try it with and without the #'.

    It might look a bit clearer if you try to write a fibonacci function that can calculate fib(50) in finite time.

    You might also try to work out why recur wouldn't work here.

    Clue: It will not work for the standard factorial 'hello world', but if you massage the factorial function to use an accumulator, it will.

    ReplyDelete
  6. Brenton,

    It never occurred to me that I'd want to compare strings long enough to blow the stack, but in fact you only seem to need them to be 140 characters long:

    (let [n 140] (levenshtein-distance (apply str (repeat n \a)) (apply str (repeat n \b))))

    I wonder why? I would have thought it was good for thousands before the stack went. Presumably these 140 character strings want to create a stack 280 deep.

    There is something very wrong here. Morally, those function calls are only storing two pointers, three integers and a return address. I wonder what they are actually storing! (And why the hell the JVM doesn't just make a bigger stack if it needs one).

    Still, whine, whine, blame the tools. I should have checked that. I wonder if there's some sort of general memoizing-trampoline thingy that can be used for these sorts of problems.

    Your diff algorithm sounds interesting. I will investigate.

    What do you think the practical limit on speed is? I can just about imagine filling in a 1000x1000 grid by hand if I really, really needed to.

    ReplyDelete

Followers