Java explicit vectorization
- Part 1 - background (written by dummies, for curious cats)
- Part 2 - let’s talk about Java
- References
Part 1 - background (written by dummies, for curious cats)
What is ‘Vectorization’?
Without beating about the bush, does Vectorization, explicit or otherwise, have anything to do with the
fantabulous java.util.Vector
class? Well, not much.
Allow me to explain with this very long example.
Bullshit use-case
Let’s say you are writing an image editing tool similar to Photoshop. One of the functions you want to implement is adjusting the brightness. In this particular example:
- Your input is an array of 3 bytes of
RGB
values per pixel, meaning a1920 x 1080
picture would have3 x 1920 x 1080 = 6220800
bytes. - Increasing brightness is basically adding 1 (or more) to each RGB value per pixel
So how would we do that?
for (int i = 0; i < 6220800; i++) {
rgb[i]++;
}
We could probably do something fancy like unrolling the loop (eliminating some bounds checking) to speed this up:
for (int i = 0; i < 6220800; i += 2) {
rgb[i + 0]++;
rgb[i + 1]++;
}
The reason we have to do it this way, is because of real world physics. Your fancy shmancy CPU is basically
a digital logic gate
on steroids, that can do A LOT of dumb simple things really REALLY fast.
Vivisection
To give you an idea of what “simple” means, if you want to add 2 numbers you have to:
- 1st simple operation (
mov reg1, num1
): load first number into register (small memory inside CPU) - 2nd simple operation (
add reg1, num2
): add second number to register - 3rd simple operation (
mov num1, reg1
): write register back to first number
That is 3 CPU operations for something that seems rather innocent. The good news is, is that your CPU (depending on speed, and the type of ops) can do millions of these operations per second.
Let’s say our CPU can do a million operations per second. Let us do a simplified dissection of the first example from above:
- 3 instructions: save current state
- 3 instructions: setting up the loop
- 3 instructions: checking/incrementing the loop
- 3 instructions: adding 1 to a value (
mov, add, mov
)
- 3 instructions: restore saved state
So about 6 instructions per loop iteration. Multiply that by the size of our loop 6 x 6220800 = 37324800
, which when
divided by a million operations per second results in 37324800 / 1000000 = 37
seconds to increase the brightness of
a 1920 x 1080
picture (on a fundamental level, computers have a mechanical nature to them).
If you think this sounds ridiculous dear reader, know that it is the truth, albeit half the truth. Your computer truly does operate in this way, but that does not mean that this is the only way it can operate. If changing every pixel on a screen took 37 seconds, then neither video-games nor video playback would be possible on a computer.
Soothing light at the end the tunnel
It might surprise/scare you to know, that a lot of the time, software boils down to loads of repetitive operations like
the ones we described above (disclaimer: this depends of course, but it’s our story, so it is true). The point being, our
example is not an edge case. Meaning that a solution is needed, which in turn results in companies inventing ways to
address the business opportunity problem.
Enter the SIMD
Again, etymology and history are available on the Wikipedia, so forgive me for skipping them. SIMD stands for Single Instruction Multiple Data. Let’s put a pin in that, and recap our problem.
The nail to SIMD’s Hammer
- In our example, it takes us
37324800
instructions to add 1 to every element of our array. - The instructions are very repetitive
- The instruction sequences are also very repetitive
- Almost half the instructions are checking/incrementing the loop
- And finally, the array elements do not depend on each other
Okay, so common sense would dictate that our program would run faster in 1 of 2 cases:
- Less instructions
- Faster instructions - we won’t go into this, but some CPU instructions are faster than others. For this example, we will assume we are using the fastest instructions.
So less instructions is the key. We previewed such a technique in the 2nd example above -loop unrolling-, which would
reduce our need to check the loop after every element update. So if we update 5 elements at a time, we
eliminate 4 x 3 loop check instructions
for every 5 updates
, which results in a total of 18/5 x 6220800 = 22394880
instructions. That’s nice, so why not go further and do 100 or 1000 elements at a time? Well, the short answer is, we
forgot to mention that not all CPU instructions are as fast as each other
. Okaaaaay, but why is this important? Well,
even though a loop checking
iteration has:
- 3 instructions for checking the loop
- And 3 instructions for incrementing the element
The second 3 instructions are 50% slower, so take 75% of the total time it takes to do an iteration (again, this might
not be accurate, but it makes for a better story). So even if you manage to remove all of the loop checking
instructions, the program will never be twice as fast.
Also, your friendly neighbourhood compiler is _usually_ smart enough to know if and how much loop unrolling is necessary :)
Soothing light at the end the tunnel
So instead, we want the copying, adding, and copying back parts to be faster. But how, we are already using the fastest instructions available?
forgive the delay in reaching this point, but trust me when I say this makes much more sense with some context.
This is where SIMD comes in. SIMD allows us to execute special instructions that do a simple action for a sequence of data. How much of that last part sounds familiar?
Okay, so in our first example we started out by copying the data byte to the CPU register by invoking a mov
instruction. Is there a SIMD version for this operation? Yessur! vmovdqu
to the rescue.
So what does vmovdqu
do? Well, to quote the official Intel
manual[1] "Move unaligned packed byte integer values from xmm2/m128 to xmm1 using writemask k1"
. To translate that
into mere mortal
terms "Move a SEQUENCE of bytes from anywhere memory to SPECIAL registers, and some other stuff we don't care about yet"
. Exactly what we are looking for then!
But what are these “special registers” they talk about.
Well, like we briefly mentioned at the top, a register is a memory block inside the CPU. What we didn’t mention is that
registers are traditionally very small. Like 1 or 2 bytes
small. Which of course, doesn’t help when we want to copy a
sequence of bytes into a register. That’s why these SIMD instructions came hand in hand with the special registers. A
special register can usually hold about 8 times
what a normal register can hold (this depends on the CPU). Great! So we
can copy 8 values
per operations!
But does this mean that this single instruction is 8 times faster than doing 8 ordinary instructions? The short answer is, alas no. The good news is, it about 60% faster!
So if all our program did was copy bytes to the CPU registers, by moving to the SIMD flavour, we gain 60%. Meaning 100 seconds becomes 40.
Furthermore, if this assumption holds for the rest of the operations of our simple program, then our runtime
becomes 37 * 0.4 = 14.8
seconds.
What’s my name?
This phenomenon, is called Vectorization. The term stems from Vector Processors (see wikipedia), which operate on
Mathematical Vectors. And if you remember your linear algebra, a Vector is basically something like [1 2 3 0]
, which
I’m sure your Java laden eyes mistook this for an Array. Fear not, it basically is an array, only on a much lower level.
Part 2 - let’s talk about Java
So before we get into what any of this has to do with Java, know that there 2 (or maybe more?) different types of Vectorization:
- Auto Vectorization (AV)
- Explicit Vectorization (EV)
Of course we recognize EV from the title of this rant, so we will talk about it extensively. But to get to that, we need to talk about AV first.
Auto Vectorization (AV)
So the stuff we talked about in the previous section was more or less on Assembly/Machine Code level. But naturally,
most of us mere mortals program in higher level languages. So how does my for loop
become Machine Code?
We all know this one, the compiler does it for us.
But…is that all a compiler does? Well, without going too much into naming semantics, almost all modern compilers optimize code.
Optimizations come in many shapes and forms, but one of the most basic ones is choosing the best opcode/instruction
for our operation. This might sound like a non-optimization requirement of a compiler, but please remember that some
CPU’s have hundreds upon thousands of instructions
, and any given operation usually has a couple of variations that
are best used under different conditions.
For example, addition in Assembly is done by using the add reg1, reg2
instruction. Pretty straight-forward. However,
in the case of adding 1 to a loop index for instance, a smart compiler will most likely use the inc reg
(increment)
instruction. While the speed gain is usually negligible, your code becomes smaller by saving the additional bytes[2].
– Interlude –
So again, forgive the long-winded introduction, but as usual, it’s easier to relate points to each other.
If we reduce the fact that SIMD instructions are simply (fancy)CPU instructions, and add that to the fact that an
Optimizing Compiler chooses the best opcode/instruction
for the job, then we might conclude that the Compiler might
also choose a SIMD instruction, if the situation arises. And we would be correct.
And this process my dear comrades, is called: Auto Vectorization.
Does this mean my trustworthy compiler will Vectorize all my code? As a wise man once said…it depends.
To explain this, we will need to quote a different wise man, who once said that compilers are basically “stupid regex pattern matching engines” (sorry, can’t remember the source). Meaning, like everything your compiler does, it only does so under very specific circumstances. So while the magical compiler might rewrite the above loop to use SIMD, it doesn’t take a lot of deviation for the same wizardly compiler to skip it.
Non-SIMD example:
for (int i = 0; i < 6220800; i++) {
if (i > 0)
rgb[i]++;
}
Granted that you wouldn’t write code like this (on purpose), it is still baffling that something like this would keep us from our deserved Vectorization goodness ¯\(ツ)/¯
What abouts the Java?
Okay then, back to Java reality with the million dollar question; to be or not to be what about Java? Well,
considering that Java compilers don’t actually compile to Assembly/Machine Code T_T, we are sorry to say that…the JVM
architects/developer got you covered and already implemented this in the JVM portion for your pleasure (we’re not worthy).
So yeah, Java supports a substantial subset of Auto Vectorization[3]. But like everything in JavaLand that concerns real-life hardware, there is an added layer of indirection that needs to be taken into consideration.
If you remember, the Java code lifecycle towards the CPU is an arduous one:
- Interpreter
- C1
- C2
As such, the AV optimizations are usually only activated from the latter part of C1 onwards (downwards?).
Meaning (among other things), that very short lived pieces of code that don’t reach C2 would unfortunately not be considered AV candidates. This on top of the compiler level requirements for a piece of code (such as a simple loop) to be Vectorizable.
Explicit Vectorization (EV)
Prelude
We’ve always been taught the magic words "the compiler is smart enough to..."
, which in turn would mean that if a
compiler fails to do whatever it is we expect of it, then the ‘fault’ lies with layer 8.
The point being that one way or another, the compiler has to recognize certain rules before it can apply an optimization. As we will see, this is sometimes very difficult to do with our code.
So instead, a lot of compilers provide so called Compiler Overrides that in some cases advise the compiler to take a closer look at a piece of code with regards to the optimization named by the override, while in others the compiler heuristics are completely ignored, and the optimization (or deopt in some cases) is simply performed.
Even though Java doesn’t have the traditional Compiler structure, it still maintains the spirit of being overridable in certain areas.
Enter the Dragon
Even though a lot of this applies to areas outside of Java, we will try to focus on Java only from this point on.
We have finally reached the meat of the rant. Explicit Vectorization (EV).
Nutshell: where AV is an optimization provided by the compiler, EV is the override the programmer provides to force the compiler to Vectorize/SIMD-ify the code.
Show me the money
Let’s give it a try then.
Prerequisites
- JDK 16+
- Add command line argument
--add-modules=jdk.incubator.vector
(this won’t be necessary once the Vector API exits incubation status) - x64 or ARM AArch64 CPU (as of JDK16, only these architectures are properly supported)
- AVX support (if you have a CPU from the last decade[4], then just assume you have this)
Hello, 世界
Let’s take our increasing brightness example from above.
Vanilla
static void increaseBrightness(int[] rgb) {
for (int i = 0; i < 6220800; i++) {
rgb[i]++;
}
}
With Nuts
Okay, so before going on, I would like to warn ye with an important takeaway from thishere rant, which is “magic is not cheap”.
Species
That being said, the first thing you need to know is the vector.VectorSpecies
of your operation.
The magnitude of concurrent operations you can execute are determined by 2 things:
- the register width of our CPU
- the width of the data type we are trying to manipulate
The way this works is as follows:
- if your CPU only supports AVX, that means the widest register it has
XMM#
is128 bits
wide. - in our example
rgb[]
is an int array, meaning the data type isint
which has a width of4 bytes
or32 bits
.
So by dividing 128 / 32 = 4
, meaning we can do operations on 4 ints at a time.
final vector.VectorSpecies<Integer> species=vector.IntVector.SPECIES_128;
Sweet, but do I need to do all these tedious calculations everytime I want to do me some vectors?
Luckily, no. The API contains a couple of convenience constants that do this for you.
final vector.VectorSpecies<Integer> species = vector.IntVector.SPECIES_PREFERRED;
Read input into SIMD register(s)
So next, we have to transform our plain old int[]
into the Vector version. Or, in other words, move the data into the
special SIMD registers.
The funny thing about this part though, is that it seems very counterintuitive from the performance point of view. I mean, we are literally copying the data into smaller chunks, instead of iterating over an array?? At the very least you use double the memory size.
The problem is, our naïveté assumes that an array stays where it is, but know this, SIMD or not, a CPU only deals with registers.
final int offset = 0;
IntVector vector = IntVector.fromArray(species, rgb, offset);
Here we create a Vector of the type and width we determined with the species
before.
Meaning, if species == 128 bytes wide
, and we are dealing with integers
, then vector
will contain the
first 4 ints
from the rgb
array, at the specified offset
.
Do operation(s) on SIMD register(s)
Now that the data is where we want it to be, we can unleash our CPU prowess upon it!!!
As you will no doubt discover when you press ctrl + space on vector.*
, there’s a lot of fun things you can do with the
vector. These of course are but a sane subset of SIMD, and instead of whining about it, be grateful (and send John Rose
& Paul Sandoz a gift basket).
IntVector result = vector.add(1);
This step is self-explanatory.
Write SIMD register(s) back to input
result.intoArray(rgb, offset);
And finally, we write the updated values back to the array. The way we do it here isn’t really the ‘proper’ way, but it works nonetheless. By proper, I mean writing the values back to the input array. It’s much better to write to an intermediate array (as we might explain later).
Hello hello
This is what our end result looks like:
final VectorSpecies<Integer> species = IntVector.SPECIES_PREFERRED;
int offset = 0;
for (offset = 0; offset < rgb.length; offset += species.length()) {
final IntVector vector = IntVector.fromArray(species, rgb, offset);
final IntVector result = vector.add(1);
result.intoArray(rgb, offset);
}
A casual observer might…observe, that we are still looping over our array. This is a necessary evil. Computers are magical, but not THAT magical. Maybe someday there will be a single instruction that operates on any length of data. Until that day though, loops rule!
And of course this loop might have a tail/remainder, but we will leave it up to your imagination to implement this.
Benchmark Vanilla vs Nuts
Is the SIMD version faster or slower? Good question. To be honest, I expected it to be slightly slower because I assumed
the Vector
API had some slight overhead compared to the compiler provided AV
.
Seems I was only half right though, as in it is indeed a little slower, but not for the reasons I thought.
So running both versions for 1000 iterations (after warmup) yields the following results:
vanilla | simd |
---|---|
1249416200ns | 1295295700ns |
That’s a difference of ~45ms in favor of the vanilla loop.
Looking at the assembly code for both, we note one major difference; the vanilla version unrolls 8 iterations of the loop, while the SIMD version only unrolls 4.
;vanilla || simd
0x0000020d44a0c9b0: vpaddd ymm1,ymm0,YMMWORD PTR [r12+rbx*4+0x10] || 0x0000020d44a0e020: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x10]
0x0000020d44a0c9b7: vmovdqu YMMWORD PTR [r12+rbx*4+0x10],ymm1 || 0x0000020d44a0e027: vmovdqu YMMWORD PTR [r12+r8*4+0x10],ymm1
0x0000020d44a0c9be: movsxd r8,ebx || 0x0000020d44a0e02e: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x30]
0x0000020d44a0c9c1: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x30] || 0x0000020d44a0e035: vmovdqu YMMWORD PTR [r12+r8*4+0x30],ymm1
0x0000020d44a0c9c8: vmovdqu YMMWORD PTR [r12+r8*4+0x30],ymm1 || 0x0000020d44a0e03c: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x50]
0x0000020d44a0c9cf: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x50] || 0x0000020d44a0e043: vmovdqu YMMWORD PTR [r12+r8*4+0x50],ymm1
0x0000020d44a0c9d6: vmovdqu YMMWORD PTR [r12+r8*4+0x50],ymm1 || 0x0000020d44a0e04a: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x70]
0x0000020d44a0c9dd: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x70] || 0x0000020d44a0e051: vmovdqu YMMWORD PTR [r12+r8*4+0x70],ymm1
0x0000020d44a0c9e4: vmovdqu YMMWORD PTR [r12+r8*4+0x70],ymm1 || 0x0000020d44a0e058: add r8d,0x20
0x0000020d44a0c9eb: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0x90] || 0x0000020d44a0e05c: cmp r8d,edi
0x0000020d44a0c9f5: vmovdqu YMMWORD PTR [r12+r8*4+0x90],ymm1 || 0x0000020d44a0e05f: jl 0x0000020d44a0e020
0x0000020d44a0c9ff: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0xb0] ||
0x0000020d44a0ca09: vmovdqu YMMWORD PTR [r12+r8*4+0xb0],ymm1 ||
0x0000020d44a0ca13: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0xd0] ||
0x0000020d44a0ca1d: vmovdqu YMMWORD PTR [r12+r8*4+0xd0],ymm1 ||
0x0000020d44a0ca27: vpaddd ymm1,ymm0,YMMWORD PTR [r12+r8*4+0xf0] ||
0x0000020d44a0ca31: vmovdqu YMMWORD PTR [r12+r8*4+0xf0],ymm1 ||
0x0000020d44a0ca3b: add ebx,0x40 ||
0x0000020d44a0ca3e: cmp ebx,ecx ||
0x0000020d44a0ca40: jl 0x0000020d44a0c9b0 ||
Interesting…or is it T_T
Let sleeping dogs lie
Warning: comparing EV results to AV is not necessarily useful in the real world, since AV can easily deviate with a simple modification, while the point of EV is to have consistent compiler output. So while this might be fun, and very educative…don’t do it.
So I should have ignored this 50ms discrepancy, but I just couldn’t.
This one is a long story (that I might write up later), but the gist of what we found thus far, is that EV version translated to bigger instructions, which are double the size, so the C2 compiler ends up unrolling half as much.
Also…while the EV version is a tad slower, it was in fact Vectorized, so in a sense, EV works perfectly. Loop unrolling is a different kind of optimization, that might have failed for a completely different (and unrelated) reason.
The assembly snippet above rather the result, not the cause. For now, just accept this inconsistency, and if you’re interested, check back for a (possible)follow up rant explaining why. To be continued…
Monkey Wrench Drop - where AV fails and EV thrives
So why even show or mention the assembly/unrolling snafu you ask? Well, showing off is one part of it. But more importantly, to show off the power of EV.
Let’s tweak the examples above a little, by making the 1
constant a bit more dynamic. Something like:
static void increaseBrightness(int[] rgb) {
final double val = Math.random() * 100;
for (int i = 0; i < 6220800; i++) {
rgb[i] += getValue();
}
}
private static int val = -1;
private static int getValue() {
if (System.currentTimeMillis() > 0 && val < 0) {
val = (int) (Math.random() * 100);
}
return val;
}
Purists might think this is cheating, but to prove this specific point, I would daresay it is an acceptable cheat.
So what are we doing here, well, we are tricking the compiler into believing that getValue()
does not always
return the same value (even though it does). By doing so, the compiler declines to Auto Vectorize the loop :(
On the other hand, the EV variant is still vectorized, making the SIMD version in our benchmark (1000 iterations)
about 1 second
faster!
To offset the number of times getValue()
is called, we hacked the SIMD version as follows:
final IntVector result = vector.add(getValue() + getValue() + getValue() + getValue() + getValue() + getValue() + getValue() + getValue());
Real world example
As the esteemed reader (all 1 of you) might be accustomed to, most articles of this nature provide examples that are often a bit removed from what one might do in an actual application. So to continue our quest to show off, let’s do some actual production quality coding here.
My first instinct was to do a Javanese version of the “What to Do When Auto-Vectorization Fails?” article by Intel, but instead chose to do something closer to home. Doom.
Doom 3 in particular. A bit of background; Doom 3 like many games was very performance driven, and to achieve this, the smart folks at idSoft wrote different method implementations for different CPU/GPU architectures. This achieved performance where possible, and compatibility where needed.
Problem statement
AS A 'frustrated' programmer
I WANT TO calculate the min/max of a huge list of vertices
IN ORDER TO (magically)remove the 'duplicates' from the vertex list generated by the artist tools, that somehow
manage to generate duplicate vertices with infinitesimal deviations. ¯\_(ツ)_/¯
Vanilla MinMax() version
So let’s take a look at the C version of how this method was implemented in Doom 3:
/*
============
idSIMD_Generic::MinMax
============
source: https://github.com/id-Software/DOOM-3/blob/master/neo/idlib/math/Simd_Generic.cpp#L523
*/
void VPCALL idSIMD_Generic::MinMax( idVec3 &min, idVec3 &max, const idVec3 *src, const int count ) {
min[0] = min[1] = min[2] = idMath::INFINITY; max[0] = max[1] = max[2] = -idMath::INFINITY;
#define OPER(X) const idVec3 &v = src[(X)]; if ( v[0] < min[0] ) { min[0] = v[0]; } if ( v[0] > max[0] ) { max[0] = v[0]; } if ( v[1] < min[1] ) { min[1] = v[1]; } if ( v[1] > max[1] ) { max[1] = v[1]; } if ( v[2] < min[2] ) { min[2] = v[2]; } if ( v[2] > max[2] ) { max[2] = v[2]; }
UNROLL1(OPER)
#undef OPER
}
Which (after expanding the macros)translates to something like this in Java:
public class idSIMD_Generic {
// NB we tweaked the method a little to accept float arrays as input/output instead of idVec3 classes
// public void MinMax(idVec3 min, idVec3 max, idVec3[] src, int count) {
public void MinMax(float[] min, float[] max, float[] src) {
min[0] = min[1] = min[2] = +idMath.INFINITY;// Float.POSITIVE_INFINITY;
max[0] = max[1] = max[2] = -idMath.INFINITY;// Float.NEGATIVE_INFINITY;
for (int _IX = 0; _IX < src.length; _IX += 3) {
final float x = src[_IX + 0];
final float y = src[_IX + 1];
final float z = src[_IX + 2];
if (x < min[0]) min[0] = x;
if (x > max[0]) max[0] = x;
if (y < min[1]) min[1] = y;
if (y > max[1]) max[1] = y;
if (z < min[2]) min[2] = z;
if (z > max[2]) max[2] = z;
}
}
}
This version of the method is at the mercy of any compiler + hardware available optimizations, so it should work on all
platforms (including your toaster). Looking closely at the rest of the Doom 3 repo, we find multiple
files idSIMD_SSE, idSIMD_3DNow, idSIMD_AltiVec...etc
(wikipedia is your friend). Some having a different (probably
faster) implementations.
SIMD MinMax() version
So let’s take a look at an EV variant of this method:
/*
============
idSIMD_SSE::MinMax
============
source:https://github.com/id-Software/DOOM-3/blob/master/neo/idlib/math/Simd_SSE.cpp#L3696
*/
void VPCALL idSIMD_SSE::MinMax( idVec3 &min, idVec3 &max, const idVec3 *src, const int count ) {
__asm {
movss xmm0, idMath::INFINITY
.
.
.
.
loop4:
movss xmm4, [esi+eax+0*12+8]
movhps xmm4, [esi+eax+0*12+0]
minps xmm0, xmm4
maxps xmm1, xmm4
.
.
.
add eax, 4*12
jl loop4
.
.
.
}
}
Looking at the parts of the code we show here (method is too long to explain), we notice 2~4 things:
- movss/movhps: bulk moving instructions, which we already know we can do in Java.
- minps/maxps: which stands for
Minimum/Maximum Packed Single-Precision Floating-Point
; compares 2 registers, and returns the lowest float values of both.
At this point you should be crossing your fingers (I have my toes crossed) while you go check the Vector API javadocs.
Indeed, the API contains exactly what we are looking for[6].
So how would this we write this in Java:
public class idSIMD_SSE {
// 256bits == 8 floats
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_256;
private static void MinMaxSimd(float[] min, float[] max, float[] src) {
// divisibility by 6 makes our lives easier
assert src.length % 6 == 0;
// init with min/max values
FloatVector minVec = FloatVector.broadcast(SPECIES, Float.POSITIVE_INFINITY);
FloatVector maxVec = FloatVector.broadcast(SPECIES, Float.NEGATIVE_INFINITY);
// increment every iteration by 6 floats
final int stride = 6;
for (int i = 0; i < src.length - stride; i += stride) {
// read 8 floats into memory
final FloatVector vector = FloatVector.fromArray(SPECIES, src, i);
// basically a Math.min(minVec, vector) between every entry in both
minVec = vector.min(minVec);
maxVec = vector.max(maxVec);
}
// each vector contains 2 values for x, y, z
// so we choose the min/max of the 2 accordingly
// x
min[0] = Math.min(minVec.toArray()[0], minVec.toArray()[3]);
max[0] = Math.max(maxVec.toArray()[0], maxVec.toArray()[3]);
// y
min[1] = Math.min(minVec.toArray()[1], minVec.toArray()[4]);
max[1] = Math.max(maxVec.toArray()[1], maxVec.toArray()[4]);
// z
min[2] = Math.min(minVec.toArray()[2], minVec.toArray()[5]);
max[2] = Math.max(maxVec.toArray()[2], maxVec.toArray()[5]);
}
}
So I tried to make the code self explanatory by adding as many comments as possible (as I always do).
So let’s look at some benchmarks of 20000 vertices (60000 floats):
vanilla | simd |
---|---|
34000ns | 18500ns |
Wow, that’s almost twice as fast. Not bad, not bad at all. There’s probably some minor tweaking (or problems similar to the unrolling issue what talked about before) to be done here, but I will spare you the need to look at more assembly code, and myself to pretend that I write my shopping lists in assembly.
####### Elephant in the room
A keen observer might have…observed, that we use a FloatVector.SPECIES_256
, meaning we can read/write/operate
on 8 floats
simultaneously, and yet we increment our loop by 6
at a time. This ofcourse is due to the fact that the
floats are arranged in a sequence of xyzxyzxyz...
, meaning reading 8 floats would read xyzxyzxy
and the following
iteration zxyzxyzx
, and so on.
Hang on, so what if instead of comparing less floats, we go for more?
Not a bad idea (if I may say so myself), so how high do the SPECIES
go?
Well, FloatVector.SPECIES_512
is the max. Which would mean xyzxyzxy_zxyzxyzx
. Dang, we get the same problem.
Okay, so how many bits do we need for a full sequence read? Considering our sequence xyz
is 3 floats long, and our
lane FloatVector.SPECIES_256
is 8 floats wide, we probably need 3 x 8 = 24 floats
. xyzxyzxy_zxyzxyzx_yzxyzxyz
.
Problem is, that results in 768 bits
, something our API (and hardware) doesn’t support.
So instead, let’s tackle this from a programmer point of view, and worry about the rest later. We need 768 bits
, but
we can only read 256 bits
at a time. Simple, we iterate! Or, in this case, we unroll our iteration.
Meaning our code becomes something like this:
public class idSIMD_SSE {
// 256bits == 8 floats
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_256;
private static void MinMaxSimd768BitEdition(float[] min, float[] max, float[] src) {
// divisibility by 24 makes our lives easier
assert src.length % 24 == 0;
// init with min values
min[0] = min[1] = min[2] = Float.POSITIVE_INFINITY;
FloatVector firstMinVec = FloatVector.broadcast(SPECIES, Float.POSITIVE_INFINITY);
FloatVector secondMinVec = FloatVector.broadcast(SPECIES, Float.POSITIVE_INFINITY);
FloatVector thirdMinVec = FloatVector.broadcast(SPECIES, Float.POSITIVE_INFINITY);
// init with max values
max[0] = max[1] = max[2] = Float.NEGATIVE_INFINITY;
FloatVector firstMaxVec = FloatVector.broadcast(SPECIES, Float.NEGATIVE_INFINITY);
FloatVector secondMaxVec = FloatVector.broadcast(SPECIES, Float.NEGATIVE_INFINITY);
FloatVector thirdMaxVec = FloatVector.broadcast(SPECIES, Float.NEGATIVE_INFINITY);
// increment by 24 floats * 32 bits = 768 bits
final int stride = 24;
for (int i = 0; i < src.length - stride; i += stride) {
// xyzxyzxy
final FloatVector firstVec = FloatVector.fromArray(SPECIES, src, i + 0);
firstMinVec = firstVec.min(firstMinVec);
firstMaxVec = firstVec.max(firstMaxVec);
// zxyzxyzx
final FloatVector secondVec = FloatVector.fromArray(SPECIES, src, i + 8);
secondMinVec = secondVec.min(secondMinVec);
secondMaxVec = secondVec.max(secondMaxVec);
// yzxyzxyz
final FloatVector thirdVec = FloatVector.fromArray(SPECIES, src, i + 16);
thirdMinVec = thirdVec.min(thirdMinVec);
thirdMaxVec = thirdVec.max(thirdMaxVec);
}
// we now have 8 values in total for each of x, y, z
// so min/max-ing them takes a little more effort
final float[] allMins = new float[24];
firstMinVec.intoArray(allMins, 0);
secondMinVec.intoArray(allMins, 8);
thirdMinVec.intoArray(allMins, 16);
final float[] allMaxs = new float[24];
firstMaxVec.intoArray(allMaxs, 0);
secondMaxVec.intoArray(allMaxs, 8);
thirdMaxVec.intoArray(allMaxs, 16);
for (int i = 0; i < 8; i++) {
// x
min[0] = Math.min(min[0], allMins[(i * 3) + 0]);
max[0] = Math.max(max[0], allMaxs[(i * 3) + 0]);
// y
min[1] = Math.min(min[1], allMins[(i * 3) + 1]);
max[1] = Math.max(max[1], allMaxs[(i * 3) + 1]);
// z
min[2] = Math.min(min[2], allMins[(i * 3) + 2]);
max[2] = Math.max(max[2], allMaxs[(i * 3) + 2]);
}
}
}
So, let’s do a little benchmarking, and see how we did:
vanilla | simd | simd-768bit |
---|---|---|
34000ns | 18500ns | 6100ns |
Holy guacamole, that’s almost 3 times as fast as the simd version and almost 6 times faster than the vanilla one!!!! 1One
How is this possible you ask? This is only conjecture, but most CPU’s have 8 SSE (Streaming SIMD Extension) registers, meaning our 3 loads/mins/maxs are most likely done concurrently to different registers, instead of sequentially to the same one. Could we go up to 6 registers then…you might ask?
We could, and we did, but we didn’t gain any performance. Meaning that this would be an interesting subject-matter for a follow-up rant :)
Epilogue
The Vector API is a very exciting addition to the Java realm. At first glance it looks complicated and difficult to use, but it is easy to get used to, and pleasantly documented.
A lot of libraries that might depend on JNI or other more native-centric techniques for performance would benefit drastically from the readability and maintainability that writing code with the Vector API will provide.
The bigger problem lies with this new hammer to which many a programmer will seek to convert their problem into a befitting nail. My advice to you dear reader is, don’t. You will know when the time is right.
My final piece of advice (before I go eat), is to always measure, and to understand what your compiler/CPU wants to do. You take care of them, and they will take care of you.
References
- https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf
- https://stackoverflow.com/a/36510865
- https://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf
- https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2
- https://software.intel.com/content/www/us/en/develop/articles/what-to-do-when-auto-vectorization-fails.html
- https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/Vector.html#min(jdk.incubator.vector.Vector)
- https://arstechnica.com/features/2000/03/simd/