Search This Blog

Wednesday, September 25, 2013

Efficiency and Progress : My God, Clojure is Slow!

A warning: this post is considered stupid and harmful by a large number of people.

From the comments:

Anonymous May 28, 2014 at 7:06 AM

I love that this post shows up so highly in Clojure searches. It's a somber reminder to not write about things I know nothing about (let alone with a proclamation like the author uses).

And my reply:

Refute me then, and I will make another post saying I was wrong. I wrote this after two days trying to get a clojure program to complete while doing an algorithms course. I eventually rewrote it in C and it ran in a few seconds on the same hardware.

Once upon a time I wrote posts claiming that clojure was very fast if you wrote it cleverly. No one complained then and they did get a lot of publicity. Nowadays I've lost the knack. If anything it seems that compromises made to speed up the performance have resulted in un-optimizability and I understand that the official advice is to use Java for the tight loops.

But I have no wish to damage the public reputation of clojure, which I love and use all the time. What search makes this post show up on the front page?





;; Efficiency and Progress
;; Are ours once again
;; Now that we have the neut-ron bomb
;; It's nice and quick and clean and ge-ets things done...

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; When you program in Clojure, you get the raw speed of assembler.

;; Unfortunately, that is, assembler on a ZX81, running a Z80 processor at 4MHz in 1981.

;; If anything, that comparison is unfair to my old ZX81. Does anyone
;; remember '3D Invaders', a fast and exciting first person shooter /
;; flight simulator that ran in 1K of RAM *including memory for the
;; screen*?

;; Once upon a time, I had the knack of making clojure run at the same
;; speed as Java, which is not far off the same speed as C, which is
;; not far off the speed of the sort of hand-crafted machine code which
;; no-one in their right mind ever writes, in these degenerate latter
;; days which we must reluctantly learn to call the future.

;; But I seem to have lost the knack. Can anyone show me what I am doing wrong?

;; At any rate, it isn't too hard to get it to run at something like
;; the real speed of the machine, as long as you're prepared to write
;; code that is more like Java or C than Clojure.

;; So here are some thoughts about how to do this.

;; Which I offer up only as a basis for discussion, and not in any way
;; meaning to stir up controversy, or as flame-bait or ammunition for
;; trolls or anything of that sort.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; Clojure is very slow:

(time (reduce + (map + (range 1000000) (range 1000000))))
"Elapsed time: 5316.638869 msecs"
;-> 999999000000

;; The greater part of its slowness seems to be do with lazy sequences

(time (def seqa (doall (range 1000000))))
"Elapsed time: 3119.468963 msecs"
(time (def seqb (doall (range 1000000))))
"Elapsed time: 2839.593429 msecs"

(time (reduce + (map + seqa seqb)))
"Elapsed time: 3558.975552 msecs"
;-> 999999000000

;; It looks as though making a new sequence is the expensive bit
(time (doall (map + seqa seqb)))
"Elapsed time: 3612.553803 msecs"
;-> (0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 ...)

;; Just adding things up is way faster
(time (reduce + seqa))
"Elapsed time: 315.717033 msecs"
499999500000


;; I wondered if there was a way of avoiding lazy-seqs

(time (def veca (vec seqa)))
"Elapsed time: 470.512696 msecs"

(time (def vecb (vec seqb)))
"Elapsed time: 374.796054 msecs"

;; After all, 'use the right data structure for the problem' is pretty much lesson 1, and if vectors are not a good data structure
;; for this problem, then what is?

;; But it seems that despite the speed of making the vectors, it doesn't help much when we do our thing.
;; In fact it's a bit slower
(time (reduce + (mapv + veca vecb)))
"Elapsed time: 4329.070268 msecs"
999999000000

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; So lets say 3600ms to add together two arrays of 1000000 elements and sum the result.

;; In C on the same machine (my little netbook with its 1.66GHz Atom
;; and 512kb cache) this seems to take 16 ms, being 8ms for the map
;; and 8 ms for the reduce.  I'm assuming that that time is mostly
;; spent waiting on the main memory, but I may be wrong. Who knows how
;; these things are done?

(/ 3600 16) ;-> 225

;; So shall we call this a 225x slowdown for the natural expression in
;; the two languages of mapping and reducing?

(time (reduce + seqa))
"Elapsed time: 358.152249 msecs"
499999500000

;; If we just look at the reduction, then that's 
(/ 358 8.) ; 44.75


;; So around 50x


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

31 comments:

  1. If you're longing for something closer to C speed *and* Clojure-like expressiveness, consider Common Lisp. Here are some timings on my laptop:

    ;; clojure
    (time (reduce + (map + (range 1000000) (range 1000000))))
    ;; "Elapsed time: 223.477 msecs"

    ;; common lisp (sbcl)
    (time (reduce #'+ (mapcar #'+ (range 1000000) (range 1000000))))
    ;; 0.035791 seconds

    ... which uses a made-up-on-the-spot non-lazy definition of range:

    (defun range (max &key (min 0) (step 1))
    (loop for n from min below max by step
    collect n))

    Really, one would use loop directly for this sort of thing:

    ;; common lisp, idiomatic
    (time (loop for n from 1 to 1000000 sum (+ n n)))
    ;; 0.003 seconds
    ;; 133.33% CPU
    ;; 10,000,804 processor cycles
    ;; 0 bytes consed

    ReplyDelete
    Replies
    1. Interesting. I tried this using Armed Bear Common Lisp (abcl) vs sbcl vs clojure on my laptop, essentially to see to what extent this was a JVM issue (abcl is Common Lisp for the JVM). Results as follows:

      sbcl:
      * (time (reduce #'+ (mapcar #'+ (range 1000000) (range 1000000))))

      Evaluation took:
      0.172 seconds of real time
      0.172011 seconds of total run time (0.152009 user, 0.020002 system)
      [ Run times consist of 0.044 seconds GC time, and 0.129 seconds non-GC time. ]
      100.00% CPU
      273,285,104 processor cycles
      80,019,424 bytes consed

      abcl:
      CL-USER(4): (time (reduce #'+ (mapcar #'+ (range 1000000) (range 1000000))))
      4.515 seconds real time
      3000026 cons cells

      Clojure 1.5.1
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 919.203758 msecs"

      I honestly don't see poor performance from Clojure. OK, so it's five times slower than the mature and highly optimised Steel Bank Common Lisp, but it's five times *faster* than ABCL.

      And when you start comparing this to your ZX81, you're simply forgetting what the performance of those things was like. In 1986 I had a Xerox 1108 which ran InterLisp natively on custom optimised-for-lisp hardware. It could compute (factorial 1000) in 22 minutes, and we thought that was pretty fast. sbcl on my current laptop does that computation in one one hundredth of a second. A ZX81 couldn't do any of these things, of course, because it hadn't the address space to handle any of the computations we're talking about; but it processed four million eight bit instructions per second, or less than the equivalent of one million 32 bit instructions. My current laptop, with four cores running at 3GHz, is delivering twelve thousand million 64 bit instructions per second, or about twenty-five thousand times the performance of a ZX81. Your map/reduce example would have to be four orders of magnitude slower in Clojure than in C for that comparison to be valid. In fact, on your own evidence, it's less than one order of magnitude.

      That isn't good, of course; making idiomatic Clojure more efficient is something we do need to pay attention to. But it isn't dramatically bad, either, and in these days of increasingly multi-core processors Clojure's inherent ability to make use of multiple cores may balance any 'single core' inefficiencies.

      Delete
  2. If you're worried about your clojure performance, you should get out a profiler (visualVM works well) and see what's slow. Adding type hints in appropriate places is often the solution, especially for numeric performance. But the first step is always understanding the problem.

    ReplyDelete
  3. Did you consider running this with primitive arrays? It should be way faster. I will be surprised if it will be slower than a couple of times. Or primitive vector obtained by vector-of.

    ReplyDelete
  4. I would also strongly recommend using criterium. core/quick-bench instead of just time. It's a drop-in replacement. When you use time you are measuring non-jited code most of the time, what's the point of this?

    ReplyDelete
  5. Clojure's numeric performance is usually pretty poor without a lot of effort. It's pretty hard to work with primitives in Clojure and Java's boxed numbers are a performance killer.

    ReplyDelete
  6. I think that pretty much the only way to do this particular task quickly in Clojure is:
    (loop [i 0
    a 0]
    (if (= i 1000000) a (recur (inc i) (+ a i i))))

    [runs in about 6ms on my new-ish laptop]

    The problem is that Cloure has practically no iteration constructs other than loop, because it wants for/range/map/reduce to be the idiomatic way to do things. Unfortunately, as soon as you put a number in a sequence, it is boxed, so arithmetic performance is bad unless you eschew the idiomatic way and do it in a loop.

    Dmitry's suggestion to use a primitive vector won't work; you'll use way too much time building the vector.

    ReplyDelete
  7. Here is what I've told about:
    user> (time (reduce + (map + (range 1000000) (range 1000000))))
    "Elapsed time: 5568.260303 msecs"
    999999000000
    user> (time (reduce + (map + (range 1000000) (range 1000000))))
    "Elapsed time: 473.669475 msecs"
    999999000000

    user> (cr/quick-bench (reduce + (map + (range 1000000) (range 1000000))))
    ...
    Execution time mean : 463.815941 ms

    After disabling default lein JVM options:

    user> (cr/quick-bench (reduce + (map + (range 1000000) (range 1000000))))
    ...
    Execution time mean : 190.522061 ms

    So apparently Clojure is not so slow.

    ReplyDelete
    Replies
    1. Dmitry, I'm not seeing anything like your speedups!

      I'm starting clojure directly to avoid any weirdness:

      rlwrap java -classpath ~/.m2/repository/org/clojure/clojure/1.5.1/clojure-1.5.1.jar clojure.main
      Clojure 1.5.1
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 4952.810757 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 4748.199679 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 4739.475807 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 4740.157042 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 4758.834977 msecs"
      999999000000

      I wonder what the difference is?

      I'll give criterium a try, thanks.

      Delete
    2. How much memory do you have? It's possible that Java starts in non-server mode and disables some optimizations.

      Delete
    3. Oooh! (1GB) 4740ms to 1910ms. That's quite a speedup, thanks! It starts off quicker, and then optimizes further.

      rlwrap java -server -classpath ~/.m2/repository/org/clojure/clojure/1.5.1/clojure-1.5.1.jar clojure.main
      Clojure 1.5.1
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 2829.660116 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 2674.908172 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 1920.93443 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 1910.829511 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 1909.622308 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 1917.896758 msecs"
      999999000000
      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 1911.71056 msecs"
      999999000000

      Delete
  8. Out of interest did you try core.reducers?

    ReplyDelete
    Replies
    1. Vesa,

      (time (reduce + (range 1000000)))
      "Elapsed time: 1052.487052 msecs"
      499999500000
      user=> (time (r/fold + (range 1000000)))
      "Elapsed time: 1050.743251 msecs"
      499999500000

      user=> (time (reduce + (map + (range 1000000) (range 1000000))))
      "Elapsed time: 4974.2109 msecs"
      999999000000
      user=> (time (r/fold + (r/map + (range 1000000) (range 1000000))))
      ArityException Wrong number of args (3) passed to: reducers$map clojure.lang.AFn.throwArity (AFn.java:437)

      Any ideas?

      Delete
  9. Make sure to turn off tiered compilation with leiningen, or run equivalent C code thru make or some other equivalent. I'd also second using criterium or some other benchmarking, using just time() to do math is saying you're only ever going to do that calculation once, and then the point of speed is lost if it's less than a few minutes.

    For real world calculations, I do a large dataset analysis, and I've gotten clojure to basically run much faster than C's code doing pearson correlation on a large dataset(80kx80k matrix). The reason is I used reducers and type-hints, it becomes nearly equivalent to C's speed for the fn that actually does the math, then when the reducers kick in it runs about 10-12 hrs faster than the C code (which, does not have FJPool and since I didn't come up with it couldn't try to make it run in parallel in the first place).

    ReplyDelete
    Replies
    1. Joey, that sounds awesome. Can you show me how to do this: (reduce + (map + (range 1000000) (range 1000000))) at C speeds? What I'm actually doing at the moment is Floyd-Warshall, so operations on a 2D matrix involving random access. In C or Java the calculations just fly by, but in Clojure I can't work out how to do them at anything like the speed.

      Delete
  10. https://github.com/Prismatic/hiphip seems to address these issues

    ReplyDelete
  11. Haven't seen such an unscientific bullshit in a long time.

    You don't want to feed the trolls but you put a yellow headline and throw up a bunch of half cooked experiments, which only thing they show is you don't know a thing about clojure nor the jvm.

    ReplyDelete
  12. I'm with guilespi here, I think you should have done more research. The questions you ask in the comments show that your Clojure skills are not that strong. Maybe you should change the title to a question like 'is clojure this slow?' or remove the exclamation mark.

    ReplyDelete
  13. Why are you using reduce instead of apply? If you do so then it is pretty fast.

    user=> (time (apply + (map + (range 1000000) (range 1000000))))
    "Elapsed time: 949.341 msecs"
    999999000000

    ReplyDelete
  14. I could have said : "My God , John Lawrence Aspden is slow!" but I didn't.

    ReplyDelete
    Replies
    1. Well, I certainly wouldn't be my own choice for high performance numerics. I probably get to around a milliflop on a good day.

      Delete
  15. Something seriously strange is going on with your results. My Core i5 Macbook Air (not the fastest machine) is more than an order of magnitude faster.

    Here are my results with Clojure 1.6 and JDK 1.7:

    (time (reduce + (map + (range 1000000) (range 1000000))))
    "Elapsed time: 299.542 msecs"

    (time (def seqa (doall (range 1000000))))
    "Elapsed time: 63.29 msecs"

    (time (def seqb (doall (range 1000000))))
    "Elapsed time: 137.355 msecs"

    (time (reduce + (map + seqa seqb)))
    "Elapsed time: 1630.153 msecs"

    (time (reduce + seqa))
    "Elapsed time: 54.074 msecs"

    (time (def veca (vec seqa)))
    "Elapsed time: 107.335 msecs"

    (time (def vecb (vec seqb)))
    "Elapsed time: 44.885 msecs"


    user=> (time (reduce + (mapv + veca vecb)))
    "Elapsed time: 617.836 msecs"

    user=> (time (reduce + seqa))
    "Elapsed time: 39.58 msecs"

    Come now, be serious. Are you running these tests on a Raspberry Pi?

    ReplyDelete
    Replies
    1. No, on a netbook that is the computer I actually carry around with me, as I said in the post. The programs run an order of magnitude faster if I run them on the university servers.

      The problem is the slowdown vs C or Java, not the actual speed of the computer. You can't ignore two and a half orders of magnitude by saying 'but the language is so elegant', unless you're using it for toy problems.

      Delete
  16. I love that this post shows up so highly in Clojure searches. It's a somber reminder to not write about things I know nothing about (let alone with a proclamation like the author uses).

    ReplyDelete
    Replies
    1. Refute me then, and I will make another post saying I was wrong. I wrote this after two days trying to get a clojure program to complete while doing an algorithms course. I eventually rewrote it in C and it ran in a few seconds on the same hardware.

      Once upon a time I wrote posts claiming that clojure was very fast if you wrote it cleverly. No one complained then and they did get a lot of publicity. Nowadays I've lost the knack. If anything it seems that compromises made to speed up the performance have resulted in un-optimizability and I understand that the official advice is to use Java for the tight loops.

      But I have no wish to damage the public reputation of clojure, which I love and use all the time. What search makes this post show up on the front page?

      Delete
    2. I came here searching "clojure performance". Found some one liners without warming up JVM cache or profiling or equivalent comparisons. First comment made me look into the comments though.

      Delete
    3. > What search makes this post show up on the front page?

      "clojure speed" currently puts this post in 4th position for me.

      Delete
  17. Very strange comparison, seems like you're measuring jvm startup time...
    You measured runtime speed of C, right? Why don't you even tried to measure speed in REPL:

    => (time (reduce + (map + (range 1000000) (range 1000000))))
    "Elapsed time: 350.128 msecs"
    999999000000

    which seems pretty fast for me, as I deals mostly with ruby, which is no-surprisingly really slow for that
    > Benchmark.realtime { (0..1000000).zip(0..1000000).map(&:sum).reduce(:+) } * 1000
    => 6160.161999999999

    (Benchmark#realtime shows seconds, as such * 1000)

    ReplyDelete
  18. "What search makes this post show up on the front page?"

    https://www.google.com/#q=efficiency+of+clojure

    1st hit as of September 8th, 2016.

    ReplyDelete
  19. What's the C code to compare?

    I'm not surprised C is much faster either way, what I'd be surprised is if another dynamic language could do better.

    It also seems my Android phone runs it at a similar speed as your Netbook, so I assume your Netbook is pretty darn slow.

    On my Galaxy S6:
    (time (reduce + (map + (range 1000000) (range 1000000)))) ; 6691ms
    (time (reduce + (map #(+ ^long % ^long %) (range 1000000)))) ; 1951ms

    On my Laptop:
    (time (reduce + (map + (range 1000000) (range 1000000)))) ; 147ms
    (time (reduce + (map #(+ ^long % ^long %) (range 1000000)))) ; 35ms
    (time (r/reduce + (r/map #(+ ^long % ^long %) (range 1000000)))) ; 22ms
    (time (m/esum (m/add (m/array (range 1000000)) (m/array (range 1000000))))) ; 68ms using core.matrix vectorz backend
    (time (reduce + (for [x (range 1000000) :let [y (+ x x)]] y))) ; 23ms

    ReplyDelete
    Replies
    1. And if you want utmost performance, just use a loop off-course. Which honestly is as easy as using a while loop in any imperative lang.

      On my Galaxy S6:
      (time (loop [sum 0 i 0] (if (< i 1000000) (recur (+ sum (+ i i)) (inc i)) sum))) ; 260ms

      On my laptop:
      (time (loop [sum 0 i 0] (if (< i 1000000) (recur (+ sum (+ i i)) (inc i)) sum))) ; 2ms

      It does appear that the sequence is what makes the whole thing slower, which makes sense, since you are first building a data-structure, and then looping over it. If you use two (range) calls, you're doing it twice, building two data-structure and looping over both. That's why I'm curious to see the C code, were you also building two C linked-list and then mapping over them and reducing? Or were you doing something closer to this loop?

      Delete

Followers