Deoptimizing for loops
Table of contents
Prologue
For as long as I could remember (~2 weeks ago), every time I would write a for loop
, I would wonder
whether for(int i=0;i<len;i++)
is slower than for(int i=0;i!=len+1;i++)
.
Granted, this is a “possible” super-micro-optimization that (in the grand scheme of things) probably doesn’t matter, but a lot of great things start out with curiosity (dead cats notwithstanding).
Problem (C++)
Okay, let’s look at some code:
for(i = 0; i < len; i++){}
-
for(i = 0; i != len + 1; i++){}
- The first one loosely compiles to something like this (both compiled with
-O2
):
mov edx, 0 ;; i = 0
.L3:
;; <-- body of loop goes here, not important for this rant
add edx, 1 ;; increment i by 1
cmp esi, edx ;; compare i to len
jl .L3 ;; jump back to L3 if len is larger than i
- The second one:
mov edx, 0 ;; i = 0
add esi, 1 ;; len += 1
.L3:
;; <-- body of loop goes here, not important for this rant
add edx, 1 ;; increment i by 1
cmp esi, edx ;; compare i to len
jne .L3 ;; jump back to L3 if len != i
Okay, so we get different jump instructions, and obviously one of them has to be much more expensive than the other, otherwise, why are we even here!?
Conclusion
Esteemed reader, we apologize for the disappointment. But according to the x86 documentation these instructions are identical T_T.
Problem 2.0 (Java)
But fear not, we will not be discouraged!
Let’s take a look at the Java™©☕.
Same parameters:
for(i = 0; i < len; i++){}
for(i = 0; i != len + 1; i++){}
Java of course doesn’t directly compile to assembly, so let’s look at some bytecode instead:
1.
L1
ILOAD 1 ;; load i
ALOAD 0 ;; load len
IF_ICMPGE L2 ;; if i >= len, go to L2
IINC 1 1 ;; increment i by 1
GOTO L1 ;; jump back to beginning of loop
L2
RETURN
2.
L1
ILOAD 1 ;; load i
ALOAD 0 ;; load len
ICONST_1 ;; load 1 onto stack
IADD ;; add 1 to len==WTF!
IF_ICMPEQ L2 ;; if i==WTF, go to L2
IINC 1 1 ;; increment i by 1
GOTO L1 ;; jump back to beginning of loop
L2
RETURN
WTF!? moment
Before even benchmarking them, there’s a big red flag in the second bytecode snippet.
1
is added to len
in every iteration of the loop, before it is compared against i
. Interesting…
This has some overhead of course, but knowing what we know about the JVM, the overhead could be negligible, or it is simply optimized in a later stage.
Remember, this is simply the bytecode translation. Most of the magic happens later on in C1
& C2
(or GraalVM if you
are so inclined).
We could probably end this rant right here and now at this point, but we are far too stupid curious for that.
What about C++?
Seems strange that Java would make such a weird decision to not optimize this simple ‘constant-value’.
Can we reproduce this in C++ though? Let us recompile with -O1
& -O0
.
And indeed, -O0
produces the same unoptimized version:
for(i = 0; i != len + 1; i++){}
.L7:
mov eax, DWORD PTR [rbp-4] ;; copy i to eax
add DWORD PTR [rbp-8], eax ;; total+=i
add DWORD PTR [rbp-4], 1 ;; i++
.L6:
mov eax, DWORD PTR [rbp-20] ;; load len into eax
add eax, 1 ;; add 1 to eax(len)==WTF!?
cmp DWORD PTR [rbp-4], eax ;; compare len to i
jne .L7 ;; jump back to L7 if i!=len+1
Benchmarking time
There’s this Dutch saying; meten is weten
, which colloquially translates to something like she who measures, knows
🧙♂️. And nothing is truer of code. You can make all the deductions and write the nicest algorithms you like, your
program is either fast, or not.
Enough babble, let us set up a simple JMH benchmark (sorry, not gonna go over it line by line):
public class Benchmark {
public static void main(String[] args) {
final Options options = new OptionsBuilder()
.include(Benchmark.class.getSimpleName())
.build();
new Runner(options).run();
}
@Param({"1000", "10000", "100000", "1000000", "10000000"})
public int x;
@Benchmark
public void test1(final Blackhole blackholeSun) {
int total = 0;
for (int i = 0; i < x; i++) {
total += i;
}
blackholeSun.consume(total);
}
@Benchmark
public void test2(final Blackhole blackholeSun) {
int total = 0;
for (int i = 0; i != x + 1; i++) {
total += i;
}
blackholeSun.consume(total);
}
}
And, what’s the verdict?
(not sure if JMH can generate output like this, but in this case it's some scrappy copy paste work)
(x) | test1 | test2 |
---|---|---|
1000 | 248.127 | 248.770 |
10000 | 2628.718 | 2630.839 |
100000 | 25027.878 | 26170.237 |
1000000 | 260002.502 | 256436.492 |
10000000 | 2469421.318 | 2515042.555 |
Okay, so it looks like test1
is just slightly faster. This could mean a couple of things:
- this could simply be within the margin of error (remember, benchmarks are not perfect)
- if not, then:
test1
andtest2
probably have different machine code- or
test1
andtest2
could have differentC1
orC2
code (or both)
So let’s see if we can make widen the gap between the test results. We won’t bore our loyal readership with all
our failed experiments, but this small change does the trick:
@Param({"1000", "10000", "100000", "1000000", "10000000"})
public long x; // <-- we change this to a long
(x) | test1L | test2L |
---|---|---|
1000 | 22940.000 | 23340.000 |
10000 | 189420.000 | 192720.000 |
100000 | 178100.000 | 183920.000 |
1000000 | 1205300.000 | 1496420.000 |
10000000 | 5990540.000 | 7850100.000 |
Boom! That’s a whopping ~30% difference! So no longer in the realm of margin of error.
There could be more going on here of course, but let’s just proceed with the slow version, and see what we get.
Assemble
Before we proceed with the slower test1L
& test2L
version, we want you to know that similar to the bytecode we show
above, the bytecode version of test2L
also seems to do the x + 1
part in every iteration of the loop.
C1
Okay, so what does the C1
output look like?
Run with -XX:-UseCompressedOops -XX:PrintAssemblyOptions=intel -XX:CompileCommand=print,org/example/Benchmark.*
.
test1L
(abridged version):
;this block initializes i=0, total=0, and saves r8 (to restore it later?)
0x000002adc77623b4: mov rsi,r8
0x000002adc77623b7: mov edi,0x0
0x000002adc77623bc: mov ebx,0x0
0x000002adc77623c1: jmp 0x000002adc7762415 ; note that we directly jump to the i<x part of the loop
;this block does the total+=i & the i++ parts
0x000002adc77623c6: xchg ax,ax ;NOP
0x000002adc77623c8: mov r8,rdi ;copy i to r8
0x000002adc77623cb: add r8d,ebx ;bla = i+total
0x000002adc77623ce: inc edi ;i++
;this block checks if i<x
0x000002adc7762415: mov r8,QWORD PTR [rdx+0x10] ;copy x to the r8 register
0x000002adc7762419: movsxd rax,edi ;copy i to rax
0x000002adc776241c: cmp rax,r8 ;compare i to x
0x000002adc776244f: jl 0x000002adc77623c8 ;go back to start of loop while x>i
test2L
(abridged version):
;same init block as test1L
0x0000024e00894834: mov rsi,r8
0x0000024e00894837: mov edi,0x0
0x0000024e0089483c: mov ebx,0x0
0x0000024e00894841: jmp 0x0000024e00894895
;same increment block as test1L
0x0000024e00894846: xchg ax,ax
0x0000024e00894848: mov r8,rdi
0x0000024e0089484b: add r8d,ebx
0x0000024e0089484e: inc edi
;ALMOST same condition block
0x0000024e00894895: mov r8,QWORD PTR [rdx+0x10] ;copy x to r8
0x0000024e00894899: movsxd rax,edi ;copy i to rax
0x0000024e0089489c: movabs r10,0x1 ;put 1 in r10
0x0000024e008948a6: add r8,r10 ;add r10 to r8==>add 1 to x!!!!!!?
0x0000024e008948a9: cmp rax,r8 ;compare i to x
0x0000024e008948dc: jne 0x0000024e00894848 ;loop if not equal
So yeah, similar to the bytecode version, test2L
does x + 1
in every iteration of the loop.
Can we make this even worse?
Problem 2.1 (Java)
So what happens if we extract x + 1
to a function?
Strangely enough, this version is actually faster, not slower. 😅
(x) | test1L | test2L | test2L_func |
---|---|---|---|
1000 | 22940.000 | 23340.000 | 37940.000 |
10000 | 189420.000 | 192720.000 | 263360.000 |
100000 | 178100.000 | 183920.000 | 246280.000 |
1000000 | 1205300.000 | 1496420.000 | 1525420.000 |
10000000 | 5990540.000 | 7850100.000 | 5895140.000 |
As you can see it’s slower in all instances except when x=10000000
. Could this be an inlining issue?
Let’s disable inlining, and try again.
Disable inlining
Run with -XX:CompileCommand=dontinline,org/example/Benchmark.*
.
(x) | test1L | test2L | test2L_func | test2L_func disable inlining |
---|---|---|---|---|
1000 | 22940.000 | 23340.000 | 37940.000 | 43100.000 |
10000 | 189420.000 | 192720.000 | 263360.000 | 291880.000 |
100000 | 178100.000 | 183920.000 | 246280.000 | 276600.000 |
1000000 | 1205300.000 | 1496420.000 | 1525420.000 | 1366640.000 |
10000000 | 5990540.000 | 7850100.000 | 5895140.000 | 5557340.000 |
Okay, so it wasn’t the inlining.
Let’s try upping the compile-threshold.
Increasing the compile-threshold
(x) | test1L | test2L | test2L_func | test2L_func disable inlining |
test2L_func compile threshold |
---|---|---|---|---|---|
1000 | 22940.000 | 23340.000 | 37940.000 | 43100.000 | 48120.000 |
10000 | 189420.000 | 192720.000 | 263360.000 | 291880.000 | 298160.000 |
100000 | 178100.000 | 183920.000 | 246280.000 | 276600.000 | 272060.000 |
1000000 | 1205300.000 | 1496420.000 | 1525420.000 | 1366640.000 | 1769000.000 |
10000000 | 5990540.000 | 7850100.000 | 5895140.000 | 5557340.000 | 7762280.000 |
Okay, so that’s what it was (should have been obvious).
But can we make it even…worse?
Problem 2.2 (Java)
So far our i < x + 1
(test2L) and i < func()
(test2L_func) have been our slowest contestants. Both kind of work the
same way, as in that they add some kind of action
that is executed every iteration.
So how could we make that even worse?
A bigger action
I guess?
Let’s change our function to something like:
private double rand() {
return x + new Random().nextInt(1); // *evil grin*
}
And the results are:
(x) | test1L | test2L | test2L_func | test2L_func disable inlining |
test2L_func compile threshold |
test2L_rand |
---|---|---|---|---|---|---|
1000 | 22940.000 | 23340.000 | 37940.000 | 43100.000 | 48120.000 | 123580.000 |
10000 | 189420.000 | 192720.000 | 263360.000 | 291880.000 | 298160.000 | 606320.000 |
100000 | 178100.000 | 183920.000 | 246280.000 | 276600.000 | 272060.000 | 4829220.000 |
1000000 | 1205300.000 | 1496420.000 | 1525420.000 | 1366640.000 | 1769000.000 | 39910100.000 |
10000000 | 5990540.000 | 7850100.000 | 5895140.000 | 5557340.000 | 7762280.000 | 398303480.000 |
That’s a whopping 50 times slower!
While we’re fairly certain we could go even slower, we will present that task as homework for our esteemed reader(s), and proceed to dissect our latest creation.
C2
We will spare you the C1
details, but it does a call to rand()
for every loop iteration, as expected.
But what about C2
?
test2L_rand
(with disabled inlining to exclude the java.util.Random
stuff from the disassembly):
;init block
0x000001ce29eaead2: mov DWORD PTR [rsp+0x30],ebx
0x000001ce29eaead6: mov DWORD PTR [rsp+0x34],r12d
0x000001ce29eaeadb: mov QWORD PTR [rsp+0x38],rdi
0x000001ce29eaeae0: jmp 0x000001ce29eaeb19
0x000001ce29eaeae2: nop DWORD PTR [rax+0x0]
0x000001ce29eaeae9: nop DWORD PTR [rax+0x0]
;loop block
0x000001ce29eaeaf0: mov r10,QWORD PTR [r15+0x120]
0x000001ce29eaeaf7: mov r8d,DWORD PTR [rsp+0x34] ; r12d=total
0x000001ce29eaeafc: add r8d,DWORD PTR [rsp+0x30] ; total+=i
0x000001ce29eaeb01: mov r11d,DWORD PTR [rsp+0x30] ; rd11=i
0x000001ce29eaeb06: inc r11d ; i++
0x000001ce29eaeb09: test DWORD PTR [r10],eax
0x000001ce29eaeb0c: mov DWORD PTR [rsp+0x30],r11d ; put the incremented i value (edi) back into i
0x000001ce29eaeb11: mov DWORD PTR [rsp+0x34],r8d ; put the total+=i value back into total
0x000001ce29eaeb16: mov r10,rbp
;condition block
0x000001ce29eaeb19: vcvtsi2sd xmm0,xmm0,QWORD PTR [r10+0x10]
0x000001ce29eaeb1f: vmovsd QWORD PTR [rsp+0x20],xmm0
0x000001ce29eaeb25: mov rbp,r10
0x000001ce29eaeb28: mov r10d,DWORD PTR [rsp+0x30]
0x000001ce29eaeb2d: vmovd xmm0,r10d
0x000001ce29eaeb32: vcvtdq2pd xmm0,xmm0
0x000001ce29eaeb36: vmovsd QWORD PTR [rsp+0x28],xmm0
0x000001ce29eaeb3c: mov rdx,rbp
0x000001ce29eaeb3f: call 0x000001ce29eab2c0 ;*invokevirtual rand {reexecute=0 rethrow=0 return_oop=0}
; {optimized virtual_call}
0x000001ce29eaeb44: vaddsd xmm0,xmm0,QWORD PTR [rsp+0x20]
0x000001ce29eaeb4a: vmovsd xmm1,QWORD PTR [rsp+0x28]
0x000001ce29eaeb50: vucomisd xmm1,xmm0 ; (fancy) compare i!=rand()
0x000001ce29eaeb54: jp 0x000001ce29eaeaf0 ; loop if parity flag set
0x000001ce29eaeb56: jne 0x000001ce29eaeaf0 ; loop if not equal
As we can see, rand()
is called every iteration of the loop 🤯. Aside from the overhead, this does seem a little
absurd. Think about it, all of a sudden this loop has become non-deterministic!?
Bonus side-rant (C++ version)
Compiled with -O4
, and yeah, rand()
is also called every iteration ¯\_(ツ)_/¯
.
;; https://godbolt.org/z/odf6rWfdT
test2L_rand(int):
push r12
mov r12d, edi
push rbp
xor ebp, ebp
push rbx
xor ebx, ebx
jmp .L2
.L3:
add ebp, ebx
add ebx, 1
.L2:
call rand
add eax, r12d
cmp eax, ebx
jg .L3
mov eax, ebp
pop rbx
pop rbp
pop r12
ret
Is this normal though?
To be honest dear reader, the author has spent so much time staring at assembly, that little to no time was left for him to look up good documentation on this.
Suffice it to say, both the C++ standard and
the Java spec mention in a very subtle manner that
the termination/condition
block is evaluated before each iteration. So, normal I guess.
Epilogue
We started this experiment with the assumption that different jmp
instructions might be more expensive, but instead
discovered that it’s easier to thwart the compiler’s plans for optimizing our crappy beautiful code than we thought.
The devil is in the details as they say. None of what I came across is secret knowledge, nor is it common knowledge for some reason.
Part of that reason is probably, that in most simple cases, the compiler is smart enough to recognize a constant
expression (like i < x + 1
or i < x + bla()
) and inline it without having to recalculate it every iteration. This is
sometimes done in C1
, but more likely in C2
, which explains (some of) the performance gains there.
Do not assume the numbers tell you what you want them to tell.
JMH is an exquisite tool, but this sentence here from
the JMH output is worthy of a Nobel Peace Prize. The numbers don’t lie, but we certainly do.
I started out by assuming that the difference in numbers was due to different cmp
or jmp
instructions that were
costly. And sadly enough, I was hubristic enough to write the first draft of this rant around this…assumption (
yes I actually looked at C1/C2
code of the Java version before looking in the x86
spec T_T).
links
- https://stackoverflow.com/questions/18596300/performance-greater-smaller-than-vs-not-equal-to
- http://www.intel80386.com/386htm/SETcc.htm
- http://www.intel80386.com/386htm/Jcc.htm
- https://stackoverflow.com/questions/9617877/assembly-jg-jnle-jl-jnge-after-cmp
- https://marc.info/?l=linux-kernel&m=90222165330252
- https://stackoverflow.com/questions/12135518/is-faster-than
- https://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
- https://stackoverflow.com/questions/39556649/in-x86-whats-difference-between-test-eax-eax-and-cmp-eax-0
- https://stackoverflow.com/questions/33721204/test-whether-a-register-is-zero-with-cmp-reg-0-vs-or-reg-reg/33724806#33724806
- https://en.wikipedia.org/wiki/List_of_Java_bytecode_instructions
- https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8256508