Reading the Matmul Benchmark: CPU Performance, in Numbers

Audience: SKaiNET maintainers and contributors. This is the companion-reader to Official Engine Benchmarks; it explains what the published numbers mean, not how to call SKaiNET from application code. If you are using SKaiNET as a library and want to understand the CPU backend conceptually, start at explanation/perf/jvm-cpu.adoc.

Matrix multiplication is essentially the entire compute budget of a modern transformer. Token embeddings, attention projections (Q, K, V, output), MLP gates, final logits — all matmuls. A 7B-parameter Llama-class model spends well over 95% of its decode time inside a matmul kernel; layer norms, softmax, and bookkeeping account for the rest.

This page is a guided reading of the engine matmul benchmark (Official Engine Benchmarks). The benchmark itself is straightforward — call a kernel, time it, divide ops by seconds. The interpretation is where most engineers slip:

  • Why does the same matmul, on the same machine, run at 1 GFLOPS in one kernel and 23 GFLOPS in another?

  • Why does a quantized matmul that dequantizes inline (extra work) beat a dense FP32 matmul?

  • When you stare at a column of GFLOPS numbers, which ones are "good," which are "near peak," and which are leaving compute on the floor?

We will answer those by anchoring everything to a single benchmark run on the reference host (Intel i7-9750H, AVX2, no AVX-512, Ubuntu 24.04, JDK 21, 8 warmups + 5 measured runs):

Scenario Panama Scalar Speedup Regime

engine-kernel-matmul (1024³ FP32)

23.0 GFLOPS

1.13 GFLOPS

20.4×

compute-bound

engine-bf16-matmul (1024³)

1.42 GFLOPS

0.80 GFLOPS

1.77×

mixed

engine-q8-matmul (4096² matvec)

4.42 GOP/s

1.81 GOP/s

2.44×

bandwidth-bound

engine-q4-gemm (4096² matvec)

19.3 GOP/s

n/a

n/a

bandwidth-bound

The Panama FP32 row reflects the tile-microkernel dispatch landed in commit 87589af8. The pre-mnpack baseline was 14.06 GFLOPS (13.4× scalar); see Going further: mnpack tile dispatch below for what changed and why the gain is reproducible.

Hold those four rows in your head — every section below points back at one of them.

Glossary

GEMM

General Matrix Multiply. The BLAS Level 3 operation C = α · A · B + β · C, defined by the BLAS spec in the late 1970s. In ML we set α = 1, β = 0 (overwrite, not accumulate) — that is what "matmul" means in practice. Cost: 2 · m · n · k floating-point ops for an (m × k) · (k × n) = (m × n) multiply. Half of those are multiplies, half are adds.

Panama

The OpenJDK incubator project that added two JVM APIs: the Vector API (jdk.incubator.vector) for portable SIMD, and the Foreign Function and Memory API (java.lang.foreign) for native calls without JNI. "Panama" in this codebase almost always means the Vector API path. The Panama matmul kernel uses FloatVector.fma() which compiles to a hardware VFMADD231PS (x86) or FMLA (NEON) instruction.

SIMD

Single Instruction Multiple Data. One CPU instruction operates on a vector register holding 4 / 8 / 16 float lanes at once. AVX2 = 8 FP32 lanes, AVX-512 = 16, NEON = 4.

Kernel

The innermost compute loop, written once per (precision, instruction set) pair. A kernel does the math; it does not allocate, dispatch, or schedule. The benchmark measures the kernel — nothing else is in the timed region.

Roofline

A back-of-the-envelope model that says: achievable throughput = min(peak_compute, arithmetic_intensity × peak_memory_bandwidth). Arithmetic intensity is flops per byte loaded. We will use this once at the end to explain why the numbers above are not all the same shape.

The two layers you actually write

Before getting to the kernels, anchor the two pieces of public API the benchmark drives.

Data DSL — describing the inputs

import sk.ainet.context.DirectCpuExecutionContext
import sk.ainet.lang.tensor.dsl.tensor
import sk.ainet.lang.types.FP32

val ctx = DirectCpuExecutionContext.create()    // Phase.EVAL by default

val a = tensor<FP32, Float>(ctx, FP32::class) {
    tensor { shape(1024, 1024) { randn(mean = 0f, std = 0.1f) } }
}
val b = tensor<FP32, Float>(ctx, FP32::class) {
    tensor { shape(1024, 1024) { randn(mean = 0f, std = 0.1f) } }
}

The DSL parameters carry the contract: <FP32, Float> ties the dtype tag to the native value type. For loaded model weights (GGUF, SafeTensors) the DSL is bypassed — loaders produce the same Tensor<T, V> directly. The kernel does not care where the bytes came from.

ExecutionContext — describing the computation

val c = ctx.ops.matmul(a, b)
// c.shape == Shape(1024, 1024)

That one line is the entire user-facing surface. The ExecutionContext owns the dispatch table and the registered set of kernel providers. One context per inference session; reuse it across calls.

The benchmark exercises this surface end-to-end (in engine-fp32-gemm) as well as the kernel directly with no dispatch overhead (in engine-kernel-matmul) — the gap between them tells you how expensive dispatch is on this code path.

From ctx.ops.matmul to a kernel call

At JVM start: ServiceLoader scan

best available

User code
ctx.ops.matmul(a, b)

TensorOps
(dtype dispatch)

KernelRegistry
(provider lookup)

Fp32MatmulKernel
(the inner loop)

Output tensor c

ScalarProvider
priority 0, always

PanamaVectorProvider
priority 50, JDK 21+

NativeProvider
priority 100, planned

  1. Dtype dispatch. TensorOps picks the matching kernel slot — matmulFp32, matmulBf16, matmulQ4K, matmulQ8_0.

  2. Provider lookup. KernelRegistry.bestAvailable() returns the highest-priority provider that reports itself available.

  3. Kernel invocation. Raw arrays, offsets, strides, shape — no allocations, no dispatch inside the timed region.

The benchmark scenarios cover both the dispatched path (engine-fp32-gemm goes through ctx.ops.matmul) and the direct kernel path (engine-kernel-matmul lets you pick panama or scalar explicitly). See How SIMD Kernels Are Built for the full SPI walk-through.

Three regimes the benchmark probes

The four rows in the headline table represent three distinct performance regimes. Each tells you something different about what the CPU is spending its time on.

Regime 1 — Scalar matmul: latency-bound

engine-kernel-matmul scalar measures the textbook three-loop kernel. One float multiply + one float add per inner iteration:

for (i in 0 until m) {
    for (j in 0 until n) {
        var acc = 0f
        for (l in 0 until k) {
            acc += a[i * k + l] * b[l * n + j]   // 2 flops
        }
        out[i * n + j] = acc
    }
}

1 CPU cycle

m·n·k iterations

a[i,l]

MULSS

b[l,j]

ADDSS
+ acc

acc

out[i,j]

Measured: 1.05 GFLOPS on the i7-9750H at 1024 × 1024 × 1024. The CPU’s theoretical FP32 peak for one core is ~37 GFLOPS (AVX2 + FMA + 4.5 GHz turbo). The scalar kernel achieves ~3% of single-core peak.

The kernel isn’t doing "extra" work — it does the exact 2 · 10⁹ flops the math requires. The waste comes from what the CPU could have done in those same cycles:

  • AVX2 gives it 8 float lanes; the scalar code uses 1.

  • FMA gives it (multiply + add) per cycle; the scalar code uses multiply, then wait for add to retire.

  • The accumulator dependency means the CPU cannot pipeline across iterations.

That is the "latency-bound" tax: each iteration waits for the previous one to finish. This is the regime where you are most obviously leaving compute on the floor — and it is the regime an untuned kernel always lands in.

Regime 2 — Panama dense matmul: compute-bound

engine-kernel-matmul panama measures the same operation through the JDK Vector API. The inner step becomes "process species.length() lanes per FMA":

private val species: VectorSpecies<Float> = FloatVector.SPECIES_PREFERRED
val step = species.length()                  // 8 on AVX2, 16 on AVX-512, 4 on NEON

for (i in 0 until m) {
    for (j in 0 until n) {
        var acc = FloatVector.zero(species)
        var l = 0
        while (l < species.loopBound(k)) {
            val av = FloatVector.fromArray(species, a, i * k + l)
            val bv = FloatVector.fromArray(species, bPacked, j * k + l)
            acc = av.fma(bv, acc)            // fused multiply-add ×8 lanes
            l += step
        }
        out[i * n + j] = acc.reduceLanes(VectorOperators.ADD)
    }
}

1 CPU cycle (AVX2 = 8 lanes)

m·n·(k/8) iterations

FloatVector a[l..l+8]

VFMADD231PS
(8 lanes, 1 cycle)

FloatVector b[l..l+8]

FloatVector acc

reduceLanes(ADD)

out[i,j]

Naive SIMD measurement (the snippet above): 14.06 GFLOPS13.4× the scalar version. That matches the napkin math closely:

  • SIMD width: 8 lanes (AVX2) → up to 8× more flops per cycle

  • FMA: multiply + add in one instruction → up to 2× the issue rate

  • Predicted ceiling: ~16×; observed: ~13×

The remaining gap (13× of 16×) came from the inner loop processing one (i, j) output cell at a time: every A-row load and B-column load fed exactly one accumulator. With 8 lanes of FMA capacity per cycle and only one accumulator in flight, the issue queue stalls on the dependency chain through acc.

Going further: mnpack tile dispatch

Closing that gap is what the production PanamaVectorMatmulKernel does today. Inside the existing TILE_M × TILE_N outer block, a recursive mnpack dispatcher picks the largest RM × RN microkernel that fits the residual and recurses on the rest:

Microkernel Accumulators Per-step loads Per-step FMAs

gemm4x3

12

4 A + 3 B

12

gemm2x2

4

2 A + 2 B

4

gemm2x1

2

2 A + 1 B

2

gemm1x2

2

1 A + 2 B

2

gemm1x1

1

1 A + 1 B

1

Each microkernel keeps RM × RN FloatVector accumulators live in registers; every A-row load amortizes across RN columns and every B-column load across RM rows. The 4 × 3 microkernel is the largest that fits inside AVX2’s 16 vector registers (12 accumulators + 4 A vectors + 1 B vector at any moment). Smaller microkernels clean up residual rows and columns that don’t divide evenly into the bigger tile shape.

Measured after the change: 23.0 GFLOPS — 20.4× scalar, ~1.65× the naive-SIMD result. Two independent runs reproduce within sub-2% CoV. The gain comes from a single mechanism — FMA-pipeline fill: 12 FMAs from 7 vector loads instead of 1 FMA from 2 loads, so the issue queue can stay saturated instead of waiting on accumulator-chain dependencies.

23 GFLOPS on AVX2 single-core is ~62% of the chip’s ~37 GFLOPS theoretical FP32 peak, up from 38% before. The pattern is borrowed verbatim from tinyBLAS’s mnpack switch in sgemm.cpp (Justine Tunney / llamafile); see How SIMD Kernels Are Built for how SKaiNET’s kernel SPI hosts the microkernels.

This kernel is now compute-bound on a single core, near the single-core ceiling. The remaining headroom requires architectural moves: AVX-512 (16 lanes — 2× headroom; needs a wider microkernel set), multi-threading (6 cores — 6× headroom; needs ith/nth SPI), or a hand-tuned native kernel via FFM.

If you take only one comparison away from this page: the scalar → Panama jump (1× → 13×) is the biggest single performance win available to a pure-JVM compute backend, and the Panama → tiled Panama jump (13× → 20×) is the next biggest. Both cost ~200 lines of Kotlin and zero native code.

Regime 3 — Quantized matmul: bandwidth-bound

For LLM-class matmuls the bottleneck stops being multiply-adds and becomes how fast the weight tensor can be pulled from DRAM into L1. A 7B-parameter Llama in FP32 is 28 GiB of weights; even an L3 cache miss costs ~100 ns. The kernel that ships less weight bandwidth wins, even if it has to do extra arithmetic to dequantize.

Quantized formats trade compute for bandwidth:

Format Bits / elem Bytes / block Pack layout

FP32

32

4 / elem

dense

BF16

16

2 / elem

dense, top 16 bits of FP32

Q8_0

~8.5

34 / 32 elems

FP16 scale + 32 signed int8 codes

Q4_K

~4.5

144 / 256 elems

FP16 scale + FP16 min + 8 sub-blocks of 32 × 4-bit codes

Q4_K block — 256 elements / 144 bytes

d (FP16) 2B

dequant → 32 FP32 weights

dmin (FP16) 2B

8 sub-blocks of 32 × 4-bit codes + sub-scales
140B

FP32 input vector

FMA per dequantized weight

FP32 output element

The Panama Q4_K and Q8_0 kernels follow the same SIMD recipe as FP32, plus a dequant prelude per block: load packed bytes as a ByteVector, unpack with shift / bit-and, widen to FloatVector, multiply by the broadcast block scale, then FMA into the accumulator. See How Quantized SIMD Kernels Are Built for the inner-loop walk-through.

Measured on the i7-9750H:

  • engine-q4-gemm (Q4_K matvec at 4096²): ~19 GOP/s panama (CoV often above 3% on this laptop — the Q4_K path is sensitive to thermal throttling at sustained load)

  • engine-q8-matmul (Q8_0 matvec at 4096²): 4.42 GOP/s panama vs 1.81 scalar — 2.44× speedup

  • engine-bf16-matmul (1024³): 1.42 vs 0.80 GFLOPS — 1.77× speedup

The non-obvious result is that engine-q4-gemm (~19 GOP/s) is close to dense engine-fp32-gemm (~23 GFLOPS after the mnpack work) even though it does extra dequant work. The reason is precisely bandwidth: Q4_K weights are ~9× smaller than FP32, so the kernel spends less time waiting on memory and more time inside the FMA pipeline. The dequant arithmetic happens in registers while the next weight block is still in flight from L2.

This is the regime where "more FLOPS" is the wrong success metric. The right metric is "useful per-token tokens-per-second," and Q4_K wins that metric on bandwidth-bound hosts even though it loses on raw flops/sec.

Where the bytes go — three memory diagrams

The three regimes correspond to three distinct shapes of memory traffic.

FP32 dense (Panama) — compute-bound

4 B/elem

prefetch

FloatVector loads

reduceLanes

DRAM
A: 4MB, B: 4MB, C: 4MB

L3 (12 MiB)

L2 (256 KB)

L1d (32 KB)

YMM registers

VFMADD231PS
8 lanes/cycle

A 1024² FP32 matrix is 4 MiB — fits in L3 but spills L2. With B pre-packed and contiguous the prefetcher keeps up; the FMA pipeline is the bottleneck. Throughput is set by cycles, not by bytes per second.

Quantized (Panama) — bandwidth-bound

0.5–2 B/elem

weight stream

ByteVector load

DRAM
A: 4MB FP32 input
W: 256KB–1MB packed weights

L3

L2

L1

unpack + scale

FMA

FP32 input row (in L1)

FP32 output

The weight tensor shrinks 4–9× vs. FP32. Less memory traffic means more time in the FMA pipeline. The dequant prelude costs a few extra instructions per iteration, but those instructions overlap with the load stalls that used to dominate. Throughput is set by DRAM bandwidth, not by cycles.

Scalar — latency-bound

1 float

m·n·k times

L1d

XMM scalar

MULSS

ADDSS

out[i,j]

One lane per cycle, no FMA, no overlap. The CPU spends most of its cycles waiting for the next dependent ADDSS to retire. Throughput is set by the accumulator dependency chain, not by cycles or bytes.

The roofline — one diagram that explains all four rows

The roofline model puts everything on one chart:

SIMD + FMA

mnpack tile dispatch

AVX-512 + multi-core + native

SIMD on dequant + FMA

scalar FP32
~1 GFLOPS
(latency-bound)

naive Panama FP32
~14 GFLOPS
(1×1 inner loop)

tiled Panama FP32
~23 GFLOPS
(62% single-core peak)

practical CPU peak
~120 GFLOPS
(6 cores, AVX2)

scalar Q8_0
~1.8 GOP/s
(latency-bound)

Panama Q8_0
~4.4 GOP/s
(bandwidth + dequant)

Panama Q4_K
~19 GOP/s
(bandwidth-bound)

Four things to internalize:

  1. The scalar→Panama jump (1× → 13×) is the same speedup mechanism for every dtype. SIMD width and FMA buy you that no matter what bits you’re shipping. If a benchmark shows < 5× scalar→Panama, the kernel has a problem — usually loop-carried dependencies or un-amortized packing.

  2. Once Panama is in play, the next jump (13× → 20×) comes from reusing the SIMD loads across multiple output cells — the mnpack tile dispatch. The mechanism is FMA-pipeline fill, not more bandwidth. A naive 1×1-cell SIMD kernel leaves this win on the table even though every individual cycle looks vectorized.

  3. Once tiled Panama is in play, the next limit depends on the dtype. FP32 is compute-bound at ~62% of single-core peak: more cores, AVX-512, or native code are the remaining levers. Q4_K and Q8_0 are bandwidth-bound: faster DRAM helps; more cores helps until you saturate the memory channel.

  4. Quantization is not "free flops." It is bandwidth amplification. Q4_K kernels do more arithmetic per byte than FP32 kernels, so on a bandwidth-bound machine they look like a compute speedup. On a GPU with HBM3 they look like a compute slowdown because the dense pipeline was never bandwidth-bound there in the first place.

How to read a benchmark number

Looking at any number from the engine matmul benchmark, ask in order:

  1. Which regime is this scenario in? Use the table at the top of this page. Scalar → latency. Dense Panama → compute. Quantized Panama → bandwidth.

  2. What is the local peak for that regime on this host? Single-core FP32 peak ≈ clock_GHz × SIMD_lanes × 2 (FMA). Memory bandwidth peak ≈ DRAM rating (e.g. ~30 GB/s for DDR4-2667).

  3. What fraction of peak does this kernel hit? > 50% is excellent for the compute-bound regime; > 80% of memory bandwidth is excellent for the bandwidth-bound regime; anything over 10% of peak for the scalar regime means something is broken.

  4. Is the CoV under 3%? If not, the number is noise. BenchmarkRunner flags this automatically as "unstable": true in the JSON record. Unstable runs are not publishable.

The benchmark publishes the raw samples, the mean, the stddev, the CoV, and a list of available providers — all the inputs you need to do that interpretation. See Official Engine Benchmarks for how the JSON record is laid out.

Where the CPU backend helps (and where it doesn’t)

The CPU backend earns its keep when:

  • You need to run inference where you cannot ship a GPU runtime — Apple Silicon laptops, ARM SBCs, sandboxed JVMs, Android, CI workers. SKaiNET CPU is --add-modules and a regular JAR.

  • The model fits in DRAM and the workload is interactive per-token. A 1B–7B model at Q4_K runs comfortably on a modern laptop CPU, with decode latency dominated by memory bandwidth — the regime where the quantized kernels shine.

  • You want deterministic, reproducible numbers across operating systems. No driver versions, no CUDA toolkit. The same JAR runs identically on Linux x86, macOS Arm64, and inside a container.

It is not the right answer when:

  • You are training, not inferring. Backprop multiplies the FLOP budget by 3–4× and immediately exceeds what a single CPU socket can deliver.

  • The model is large (30B+) and the workload is batched. GPU bandwidth and memory capacity both scale beyond what a desktop CPU reaches.

  • You need sub-millisecond first-token latency at high batch sizes. That is a TensorRT / vLLM / SGLang problem, not a CPU problem.