1

I implemented three naive sum methods (in scala) which act on an Array[Double] (or double[])

    inline def sum2 =
      var sum: Double = 0.0
      var i: Int = 0
      inline val species = DoubleVector.SPECIES_256

      while i < species.loopBound(vec.length) do        
        val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)        
        sum = sum + vec2.reduceLanes(VectorOperators.ADD)
        i += species.length()
      end while      
      while i < vec.length do
        sum += vec(i)
        i += 1
      end while
      sum
    end sum2

    inline def sum: Double =
      var sum = 0.0
      var i = 0;
      while i < vec.length do
        sum = sum + vec(i)
        i = i + 1
      end while
      sum


    inline def sum3 =
      var i: Int = 0
      val species = DoubleVector.SPECIES_256

      var acc = DoubleVector.zero(species)

      while i < species.loopBound(vec.length) do
        val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)
        acc =
          acc.add(vec2) // This line, appears to break JMH. It's not clear why as it passes a unit test and compiles.
        i += species.length()
      end while

      var temp = acc.reduceLanes(VectorOperators.ADD)
      // var temp = 0.0
      while i < vec.length do
        temp += vec(i)
        i += 1
      end while
      temp
    end sum3

Then setup a benchmark to see if there was a difference.


@BenchmarkMode(Array(Mode.Throughput))
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(value = 1)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
abstract class BLASBenchmark:


  private final val rand: Random = new Random(0);

  protected def randomDouble(): Double =
    return rand.nextDouble();

  protected def randomDoubleArray(n: Int): Array[Double] =
    val res = new Array[Double](n);

    for i <- 0 until n do res(i) = rand.nextDouble();
    end for
    return res;
  end randomDoubleArray

  protected def randomFloat(): Float =
    return rand.nextFloat();

  protected def randomFloatArray(n: Int): Array[Float] =
    val res = new Array[Float](n);
    for i <- 0 until n do res(i) = rand.nextFloat();
    end for
    return res;
  end randomFloatArray
end BLASBenchmark

@State(Scope.Thread)
class SumBenchmark extends BLASBenchmark:

  @Param(Array("3", "10", "100", "10000", "100000"))
  var len: String = uninitialized;

  var arr: Array[Double] = uninitialized

  @Setup(Level.Trial)
  def setup: Unit =
    arr = randomDoubleArray(len.toInt);
    ()
  end setup

  @Benchmark
  def sum_loop(bh: Blackhole) =
    val r = arr.sum
    bh.consume(r);
  end sum_loop

  @Benchmark
  def sum_vec(bh: Blackhole) =
    val r = arr.sum2
    bh.consume(r);
  end sum_vec

  /**
This doesn't work. I don't understand why.
   */
  @Benchmark
  def sum_vec_alt(bh: Blackhole) =
    val r = arr.sum3
    bh.consume(r);
  end sum_vec_alt

end SumBenchmark

The result of which were as follows (locally)

Benchmark               (len)   Mode  Cnt          Score           Error  Units
SumBenchmark.sum_loop       3  thrpt    3  650060202.104 ± 176059544.674  ops/s
SumBenchmark.sum_loop      10  thrpt    3  422309397.044 ±  19968584.370  ops/s
SumBenchmark.sum_loop     100  thrpt    3   30901788.878 ±    354896.502  ops/s
SumBenchmark.sum_loop   10000  thrpt    3     116283.523 ±      4388.696  ops/s
SumBenchmark.sum_loop  100000  thrpt    3      11487.085 ±       586.552  ops/s
SumBenchmark.sum_vec        3  thrpt    3  607547613.267 ±  25376258.940  ops/s
SumBenchmark.sum_vec       10  thrpt    3   48259596.149 ±   1205351.697  ops/s
SumBenchmark.sum_vec      100  thrpt    3    4604684.325 ±    122538.788  ops/s
SumBenchmark.sum_vec    10000  thrpt    3      45689.056 ±      1872.688  ops/s
SumBenchmark.sum_vec   100000  thrpt    3       4729.203 ±        83.013  ops/s

The way I'm reading this, is that my vectorised version is in fact, significantly slower that the simple while loop.

My questions;

  • Am I measuring "the right thing", I'm not a confident user of JMH.
  • Are my implementations using the vectorAPI naive or obviously bleeding performance somewhere?
  • If none of the above, would this be expected? It seems to undermine the value proposition of the VectorAPI, I was hoping for a speedup!

3 Answers 3

2

I believe the algothim I used, was indeed naive.

    inline def sum: Double =
      var i: Int = 0

      var acc = DoubleVector.zero(Matrix.doubleSpecies)
      val sp = Matrix.doubleSpecies
      val l = sp.length()

      while i < sp.loopBound(vec.length) do
        acc = acc.add(DoubleVector.fromArray(Matrix.doubleSpecies, vec, i))
        i += l
      end while
      var temp = acc.reduceLanes(VectorOperators.ADD)
      // var temp = 0.0
      while i < vec.length do
        temp += vec(i)
        i += 1
      end while
      temp
    end sum

With the key changes being;

  • remove the length method and calls into Specires from the hot loop
  • don't allocate an intermediate data structure - allocate straight back into the accumulator
  • accumulate along lanes, then reduce at the end point
  • select the species as "species preferred"

This provides a more satisfactory result.

umBenchmark.sum_loop                  3     N/A  thrpt    3  454987372.518 ±  4974554.176  ops/s
SumBenchmark.sum_loop                100     N/A  thrpt    3   22881036.167 ±   429259.728  ops/s
SumBenchmark.sum_loop             100000     N/A  thrpt    3      10733.807 ±       65.014  ops/s
SumBenchmark.sum_vec                   3     N/A  thrpt    3  454811646.671 ± 12166349.908  ops/s
SumBenchmark.sum_vec                 100     N/A  thrpt    3   40353921.106 ±   547651.178  ops/s
SumBenchmark.sum_vec              100000     N/A  thrpt    3      40141.621 ±      714.686  ops/s
SumBenchmark.sum_vec_alt               3     N/A  thrpt    3  505591602.643 ± 36682920.662  ops/s
SumBenchmark.sum_vec_alt             100     N/A  thrpt    3  110847968.173 ±  1363183.523  ops/s
SumBenchmark.sum_vec_alt          100000     N/A  thrpt    3      42821.901 ±      244.293  ops/s

For very large vectors, a factor of 4x (or so).

5
  • accumulate along lanes, then reduce at the end point - This is probably the most important factor. A compiler could potentially optimize away temporaries and hoist loop-invariant stuff out of the loop, but unless Scala or Java has a -ffast-math equivalent option it can't reorder FP operations because FP add/mul aren't exactly associative due to rounding error. Commented yesterday
  • But actually I'd have expected either way to bottleneck on the latency of an FP addition, which is 4 cycles on Intel since Skylake, 3 cycles on Alder Lake and Zen-family. (uops.info) That's slow enough to hide the throughput cost of 2 shuffles + 2 adds to reduce a vector of 4 doubles to scalar, since out-of-order exec can do that independent work in parallel to have inputs ready for later iterations. Commented yesterday
  • Perhaps Scala / the JVM is unrolling your loop with multiple accumulators to hide FP latency? Normally you have to do that manually (Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)). What hardware (CPU model) are you testing on, and does it support 256-bit or 512-bit vectors natively? (e.g. x86 with AVX or AVX-512, or AArch64 with SVE?) So was forcing 256-bit making it do something slow if it only had 128-bit vectors? Commented yesterday
  • I'm a new (and self declared naive) user of this API, so some of your commentary is beyond me. I'm on an apple silicon mac, and to my surprise, the preferred species ended with a lane size of 2. When I ran the benchmarks in CI, the performance gains appeared to be more dramatic. In part, I concluded that the species selection is important, and that I should leave it to the platform to decide. > A compiler could potentially optimize away temporaries Although it may potentially do this, in my experiments... I think it did not. Intermediate allocatino was slower.
    – Simon
    Commented 19 hours ago
  • Apple M1 through M3 don't support SVE at all, only 128-bit vectors, i.e. 2x 64-byte double. And yeah, probably good to let the tools pick a vector width appropriate for the hardware. Apparently M4 supports Streaming SVE (scalable.uni-jena.de/opt/sme/micro.html), which I don't know much about, I think it's like normal SVE instructions but you (or the compiler) have to run a special instruction to enter streaming mode? Anyway, that gives you wider vectors, with the same machine code able to work with different hardware widths (useless for JIT, nice for ahead-of-time compilers.) Commented 18 hours ago
1

I've done similar testing in Java (JDK 22), also running on a Mac (M3 Max). In my test, I have an array of 8192 doubles, and I want to calculate the sum. The fastest way I found to do it is to combine the Vector API with using separate accumulators. I haven't benchmarked with JMH, because I'm not familiar with it - I've just timed it running 100,000 times over the array. So the timings from my tests are for summing a total of 819,200,000 doubles for each algorithm tested.

public void test() {
    // Set up array of doubles...
    int num = 8192;
    double[] arr = new double[num];
    for (int i = 0; i < num; i++) {
        arr[i] = Math.random();
    }

    // Calculate the sum using various methods
    final int iterations = 100_000;
    long t1, t2;
    double sum = 0;

    // Linear sum with single accumulator
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        for (int i = 0; i < num; i++) {
            sum += arr[i];
        }
    }
    t2 = System.currentTimeMillis();
    log.info("Linear (1 accumulator) took " + (t2 - t1) + " ms with result " + sum);

    // Linear sum with 4 accumulators
    t1 = System.currentTimeMillis();
    sum = 0;
    double sum1 = 0;
    double sum2 = 0;
    double sum3 = 0;
    double sum4 = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        for (int i = 0; i < num; i += 4) {
            sum1 += arr[i];
            sum2 += arr[i + 1];
            sum3 += arr[i + 2];
            sum4 += arr[i + 3];
        }
    }
    sum = sum1 + sum2 + sum3 + sum4;
    t2 = System.currentTimeMillis();
    log.info("Linear (4 accumulators) took " + (t2 - t1) + " ms with result " + sum);


    VectorSpecies<Double> dblSpecies = DoubleVector.SPECIES_PREFERRED;
    int speciesLength = dblSpecies.length();
    int loopBound = dblSpecies.loopBound(num);

    // Vector ReduceLanes
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        for (int i = 0; i < loopBound; i += speciesLength) {
            DoubleVector dv = DoubleVector.fromArray(dblSpecies, arr, i);
            sum += dv.reduceLanes(VectorOperators.ADD);
        }
    }
    t2 = System.currentTimeMillis();
    log.info("Vector (reduceLanes) took " + (t2 - t1) + " ms with result " + sum);

    // Vector Accumulator
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        DoubleVector acc = DoubleVector.zero(dblSpecies);
        for (int i = 0; i < loopBound; i += speciesLength) {
            DoubleVector dv = DoubleVector.fromArray(dblSpecies, arr, i);
            acc = acc.add(dv);
        }
        sum += acc.reduceLanes(VectorOperators.ADD);
    }
    t2 = System.currentTimeMillis();
    log.info("Vector (1 accumulator) took " + (t2 - t1) + " ms with result " + sum);


    // Hybrid - Vector accumulator with multiple linear accumulators
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        // Break into 4 arrays...
        int partLen = num / 4;
        int s1 = 0;
        int s2 = partLen;
        int s3 = 2 * partLen;
        int s4 = 3 * partLen;
        int partLoopBound = dblSpecies.loopBound(partLen);
        DoubleVector acc1 = DoubleVector.zero(dblSpecies);
        DoubleVector acc2 = DoubleVector.zero(dblSpecies);
        DoubleVector acc3 = DoubleVector.zero(dblSpecies);
        DoubleVector acc4 = DoubleVector.zero(dblSpecies);
        for (int i = 0; i < partLoopBound; i += speciesLength) {
            DoubleVector dv1 = DoubleVector.fromArray(dblSpecies, arr, i + s1);
            acc1 = acc1.add(dv1);
            DoubleVector dv2 = DoubleVector.fromArray(dblSpecies, arr, i + s2);
            acc2 = acc2.add(dv2);
            DoubleVector dv3 = DoubleVector.fromArray(dblSpecies, arr, i + s3);
            acc3 = acc3.add(dv3);
            DoubleVector dv4 = DoubleVector.fromArray(dblSpecies, arr, i + s4);
            acc4 = acc4.add(dv4);
        }
        sum += acc1.reduceLanes(VectorOperators.ADD)
                + acc2.reduceLanes(VectorOperators.ADD)
                + acc3.reduceLanes(VectorOperators.ADD)
                + acc4.reduceLanes(VectorOperators.ADD);
    }
    t2 = System.currentTimeMillis();
    log.info("Vector (4 accumulators) took " + (t2 - t1) + " ms with result " + sum);
}

The first time this test method is called, I get results:

Linear (1 accumulator) took 573 ms with result 4.098820819868104E8
Linear (4 accumulators) took 150 ms with result 4.0988208200299764E8
Vector (reduceLanes) took 290 ms with result 4.098820819863893E8
Vector (1 accumulator) took 1691 ms with result 4.0988208200494045E8
Vector (4 accumulators) took 1174 ms with result 4.0988208200494045E8

By the 4th or 5th time I call it, I assume the JVM has optimised the bytecode and I get more sensible timings for the last couple of Vector API calls:

Linear (1 accumulator) took 605 ms with result 4.101329885212625E8
Linear (4 accumulators) took 142 ms with result 4.10132988510815E8
Vector (reduceLanes) took 290 ms with result 4.1013298850172895E8
Vector (1 accumulator) took 279 ms with result 4.1013298851254654E8
Vector (4 accumulators) took 71 ms with result 4.1013298851254654E8

My takeaways from this test are:

  • The JVM seems to optimise the code to make best use of SIMD instructions after the code has been invoked lots of times. This is particularly evident when using Vector API add operations. I'm curious to know why, given the Vector API is designed to optimise the byte code from the initial compilation (e.g. it has @ForceInline all over the place).
  • Using a linear sum with multiple accumulators is faster than a simple Vector API implementation (at least on Apple silicon where the species length is limited to 128 bits, so it can only store 2 doubles).
  • For the Vector API, using an accumulator is slightly faster than reduceLanes (i.e. it takes ~96% of the time). This makes sense to me, because even with "only" 128 bits, my understanding is that it's adding 2 doubles simultaneously. Whereas reduceLanes with only 2 values is effectively adding 1 double to another in sequence. It's faster than a linear sum with a single accumulator (46%)... but it's still slower than doing a linear sum using 4 accumulators (196%)
  • The fastest way to sum a double array is to combine these techniques into a hybrid approach, where we use the Vector API to accumulate the values, but also use multiple accumulators (12% compared to linear with 1 accumulator, and 50% compared to linear with the same number (4) of accumulators).

I haven't been able to run this test on any other hardware, but my working assumption is that for CPUs with a larger "species size" (e.g. 256 or 512 bit vectors) the gain by using the Vector API would be greater than I found in my results. But even with 128 bit vectors, it's still worth using the Vector API - as long as you combine it with the multiple accumulators technique. This can give you nearly a 10x speedup compared to a naive linear sum.

A couple of other final notes:

  • The code above doesn't deal with trailing values where the array length isn't exactly divisible by the species size (or the number of accumulators). I expect that adding this should have a negligible impact on the speed. This is why I chose an array size of 8192, because it's divisible by both species length (2) and then again by the number of accumulators (4), so I'd still get accurate results.
  • Although my test code takes things like the species length out of the hot loop, I couldn't replicate your results in this area... putting it inside the hot loop makes little/no difference for me. I suspect that the JVM optimises this stuff anyway, or at least it does after it's run several times. I'm running each algorithm 10,000 times for each test, which may explain why I didn't see any noticeable difference.

In terms of throughput, the "best" algorithm here is summing 819,200,000 doubles in 71ms, which is ~11.5 billion values per second (compared to ~1.4 billion per second using a naive sum). Your mileage will probably vary significantly, depending on your CPU.

New contributor
JavaJ is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
4
  • Well - very interesting, thanks for posting! Would you be consider adding my final solution into your suite? I'd be very curious to see how it performed relatively - it's closest to your single accumulator, but only with a single loop and no intermediate allocations (I believe).
    – Simon
    Commented 14 hours ago
  • Sure, but it's basically the test I called Vector accumulator... Can't fit the code in a comment, see my next answer.
    – JavaJ
    Commented 13 hours ago
  • Good point about needing some warm-up iterations before the JIT compiler will make fully optimized machine code. Your second set of numbers looks like I'd expect from asm loops that look like their source code on Apple M1 (128-bit SIMD vectors with wide out-of-order exec to overlap reduce adds with scalar adds, and with the latency x throughput product of FP add instructions being at least 4. e.g. 4 cycle latency, 2/clock throughput would need at least 8 accumulators before you're keeping both FPUs busy every cycle, and those need to be 128-bit accumulators to use the full width of the FPUs.) Commented 7 hours ago
  • Related re: multiple accumulators (with x86-64 as an example, but should be similar for out-of-order exec on AArch64): Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) . On a modern Intel or Zen 5 with AVX-512, with a small array that fits in L1d cache, the speedup for an unrolled vector loop vs. a scalar serial loop should be 6x (3 cycle latency x 2 adds/clock) times 8x doubles per vector = 48x speedup, or a bit less since 512-bit vectors reduce clock speed some. Commented 6 hours ago
0

As requested, here's the Java equivalent of Simon's enhanced solution. This is basically the equivalent of the "Vector (1 accumulator)" algorithm in my previous post, but includes Simon's changes to not assign the DoubleVector to a variable, uses a while loop instead of a for loop, etc. to make it as close as possible to Simon's version. It still runs the code 100,000 times to allow comparison of timings with the other algorithms.

// Vector accumulator (original post equivalent)
t1 = System.currentTimeMillis();
sum = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
    DoubleVector acc = DoubleVector.zero(dblSpecies);
    int i = 0;
    while (i < loopBound) {
        acc = acc.add(DoubleVector.fromArray(dblSpecies, arr, i));
        i += speciesLength;
    }
    sum += acc.reduceLanes(VectorOperators.ADD);
}
t2 = System.currentTimeMillis();
log.info("Vector (original post) took " + (t2 - t1) + " ms with result " + sum);

The result for this was 274ms, so basically the same as the original "Vector (1 accumulator)" code (within a very small margin of error of ~1.4%).

I don't believe that "not assigning" the DoubleVector to a variable in the code makes any difference... Internally, the result of that instruction is assigned to something - either explicitly in your code, or just implicitly to a stack variable inside the JVM so that the result can be passed into the next function as a parameter. Remember, Java passes objects by reference, so there must be a pointer internally to the result of the DoubleVector.fromArray() method invocation in order to pass that into the acc.add() method. It's not like the actual data is being copied in either case - just the reference to that data.

As a final note, there will be very little (if any) heap object allocation in this code. It looks like there is, because of all the DoubleVector.fromArray() calls inside the hot loop... these return a DoubleVector, which is an object, so you'd think it would allocate on the heap for that. But in reality, the JVM is smart enough to allocate data which doesn't "escape" the enclosing method on the stack, which is super-fast (compared to heap allocation). So in case anyone was wondering, the GC overhead for creating all these objects should be nearly negligible, since it should all be on the stack, not the heap.

New contributor
JavaJ is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
5
  • Awesome. This is such a cool answer for me. Thankyou!
    – Simon
    Commented 12 hours ago
  • You're welcome, hope this helps you!
    – JavaJ
    Commented 12 hours ago
  • Java passes objects by reference, so there must be a pointer internally to the result of the DoubleVector.fromArray() method invocation in order to pass that into the acc.add() method - after the JIT is done optimizing this, one would hope there aren't any actual function calls in the machine code! Or any actual pointers to the DoubleVector. It should JIT to an asm loop similar to what a C++ compiler will do (godbolt.org/z/soxP7nrqc), loading a SIMD vector into a register (like ldr q1, [x0, x8]) and adding like fadd v0.2d, v0.2d, v1.2d. (@Simon) Commented 8 hours ago
  • The same escape analysis used by the JVM to allow stack allocation lets it actually optimize away any memory space for holding a variable if it can just keep it in a register. So yes in the sense of avoiding GC overhead, it makes sense to talk about stack vs. heap allocation, even if the object never actually gets stored in stack memory at all, just registers. Commented 8 hours ago
  • @PeterCordes Very interesting, thanks for the insights.
    – JavaJ
    Commented 6 hours ago

Not the answer you're looking for? Browse other questions tagged or ask your own question.