tag:blogger.com,1999:blog-7056990295646173627.post30840472733820404..comments2020-01-23T16:40:31.968+00:00Comments on Learning Clojure: Levenshtein Distance (Edit Distance, String Difference)John Lawrence Aspdenhttp://www.blogger.com/profile/02587130870181071109noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-7056990295646173627.post-52926487993089656772010-11-23T01:02:00.136+00:002010-11-23T01:02:00.136+00:00Brenton,
It never occurred to me that I'd wan...Brenton,<br /><br />It never occurred to me that I'd want to compare strings long enough to blow the stack, but in fact you only seem to need them to be 140 characters long:<br /><br />(let [n 140] (levenshtein-distance (apply str (repeat n \a)) (apply str (repeat n \b))))<br /><br />I wonder why? I would have thought it was good for thousands before the stack went. Presumably these 140 character strings want to create a stack 280 deep.<br /><br />There is something very wrong here. Morally, those function calls are only storing two pointers, three integers and a return address. I wonder what they are actually storing! (And why the hell the JVM doesn't just make a bigger stack if it needs one).<br /><br />Still, whine, whine, blame the tools. I should have checked that. I wonder if there's some sort of general memoizing-trampoline thingy that can be used for these sorts of problems.<br /><br />Your diff algorithm sounds interesting. I will investigate.<br /><br />What do you think the practical limit on speed is? I can just about imagine filling in a 1000x1000 grid by hand if I really, really needed to.John Lawrence Aspdenhttps://www.blogger.com/profile/02587130870181071109noreply@blogger.comtag:blogger.com,1999:blog-7056990295646173627.post-78945838475066066572010-11-22T23:16:58.373+00:002010-11-22T23:16:58.373+00:00Duncan,
Well spotted.
It's a recursive func...Duncan, <br /><br />Well spotted.<br /><br />It's a recursive function call, but clojure would normally compile that to a direct jump rather than going through the variable. If I use #' then it forces the call to go through the var, and that means that memoize works as I intended. Try it with and without the #'.<br /><br />It might look a bit clearer if you try to write a fibonacci function that can calculate fib(50) in finite time. <br /><br />You might also try to work out why recur wouldn't work here. <br /><br />Clue: It will not work for the standard factorial 'hello world', but if you massage the factorial function to use an accumulator, it will.John Lawrence Aspdenhttps://www.blogger.com/profile/02587130870181071109noreply@blogger.comtag:blogger.com,1999:blog-7056990295646173627.post-28067481694158877462010-11-22T19:56:52.864+00:002010-11-22T19:56:52.864+00:00John,
It is a beautiful algorithm. I love when pr...John,<br /><br />It is a beautiful algorithm. I love when problems can be solved like this. Unfortunately it will stack overflow with large sequences.<br /><br />I am working on a diff library for Clojure which currently includes diff, patch and edit-distance functions. Levenshtein distance is not exactly the same as edit distance but the latter can be used for comparing the difference between sequences in much the same way.<br /><br />My focus has been on performance. For example: for a sequence of 1000 elements with 100 changes it can calculate the edit-distance in ~70 milliseconds (it actually generates a diff and then gets the edit distance from that). Incanter's algorithm takes 34 seconds and yours overflows.<br /><br />I thought you might be interested.<br /><br />http://github.com/brentonashworth/clj-diff<br /><br />I may add a levenshtein-distance function in the future as it can also be calculated from a diff.<br /><br />BrentonBrenton Ashworthhttps://www.blogger.com/profile/13919036386288728360noreply@blogger.comtag:blogger.com,1999:blog-7056990295646173627.post-13018250946241533542010-11-22T18:38:36.314+00:002010-11-22T18:38:36.314+00:00Is (inc (#'levenshtein-distance ...) the same ...Is (inc (#'levenshtein-distance ...) the same as (recur ...)?Duncanhttps://www.blogger.com/profile/13843763027842460609noreply@blogger.comtag:blogger.com,1999:blog-7056990295646173627.post-56164712369039501452010-11-22T16:26:56.251+00:002010-11-22T16:26:56.251+00:00Looks more traditional, certainly.
But I like the...Looks more traditional, certainly.<br /><br />But I like these 'memoizations to turn a tree into a table' type solutions.<br /><br />I can think off the top of my head of fibonnacci, pascal's triangle, change counting, and levenshtein.<br /><br />Anyone else have any nice ones?<br /><br />One where the table was an evil shape so that it couldn't be easily rephrased would be sweet.John Lawrence Aspdenhttps://www.blogger.com/profile/02587130870181071109noreply@blogger.comtag:blogger.com,1999:blog-7056990295646173627.post-66553450118592402342010-11-22T16:12:52.699+00:002010-11-22T16:12:52.699+00:00There's also an implementation of levenshtein-...There's also an implementation of levenshtein-distance in incanter.stats if you want to compare:<br /><br />https://github.com/liebke/incanter/blob/master/modules/incanter-core/src/incanter/stats.cljAnonymousnoreply@blogger.com