Reading the Matmul Benchmark: CPU Performance, in Numbers
|
Audience: SKaiNET maintainers and contributors. This is the companion-reader to Official Engine Benchmarks; it explains what the published numbers mean, not how to call SKaiNET from application code. If you are using SKaiNET as a library and want to understand the CPU backend conceptually, start at explanation/perf/jvm-cpu.adoc. |
Matrix multiplication is essentially the entire compute budget of a modern transformer. Token embeddings, attention projections (Q, K, V, output), MLP gates, final logits — all matmuls. A 7B-parameter Llama-class model spends well over 95% of its decode time inside a matmul kernel; layer norms, softmax, and bookkeeping account for the rest.
This page is a guided reading of the engine matmul benchmark (Official Engine Benchmarks). The benchmark itself is straightforward — call a kernel, time it, divide ops by seconds. The interpretation is where most engineers slip:
-
Why does the same matmul, on the same machine, run at 1 GFLOPS in one kernel and 23 GFLOPS in another?
-
Why does a quantized matmul that dequantizes inline (extra work) beat a dense FP32 matmul?
-
When you stare at a column of GFLOPS numbers, which ones are "good," which are "near peak," and which are leaving compute on the floor?
We will answer those by anchoring everything to a single benchmark run on the reference host (Intel i7-9750H, AVX2, no AVX-512, Ubuntu 24.04, JDK 21, 8 warmups + 5 measured runs):
| Scenario | Panama | Scalar | Speedup | Regime |
|---|---|---|---|---|
|
23.0 GFLOPS |
1.13 GFLOPS |
20.4× |
compute-bound |
|
1.42 GFLOPS |
0.80 GFLOPS |
1.77× |
mixed |
|
4.42 GOP/s |
1.81 GOP/s |
2.44× |
bandwidth-bound |
|
19.3 GOP/s |
n/a |
n/a |
bandwidth-bound |
The Panama FP32 row reflects the tile-microkernel dispatch landed in
commit 87589af8. The pre-mnpack baseline was 14.06 GFLOPS (13.4×
scalar); see Going further: mnpack tile dispatch below for what changed and why the
gain is reproducible.
Hold those four rows in your head — every section below points back at one of them.
Glossary
- GEMM
-
General Matrix Multiply. The BLAS Level 3 operation
C = α · A · B + β · C, defined by the BLAS spec in the late 1970s. In ML we setα = 1, β = 0(overwrite, not accumulate) — that is what "matmul" means in practice. Cost:2 · m · n · kfloating-point ops for an(m × k) · (k × n) = (m × n)multiply. Half of those are multiplies, half are adds. - Panama
-
The OpenJDK incubator project that added two JVM APIs: the Vector API (
jdk.incubator.vector) for portable SIMD, and the Foreign Function and Memory API (java.lang.foreign) for native calls without JNI. "Panama" in this codebase almost always means the Vector API path. The Panama matmul kernel usesFloatVector.fma()which compiles to a hardwareVFMADD231PS(x86) orFMLA(NEON) instruction. - SIMD
-
Single Instruction Multiple Data. One CPU instruction operates on a vector register holding 4 / 8 / 16 float lanes at once. AVX2 = 8 FP32 lanes, AVX-512 = 16, NEON = 4.
- Kernel
-
The innermost compute loop, written once per (precision, instruction set) pair. A kernel does the math; it does not allocate, dispatch, or schedule. The benchmark measures the kernel — nothing else is in the timed region.
- Roofline
-
A back-of-the-envelope model that says: achievable throughput =
min(peak_compute, arithmetic_intensity × peak_memory_bandwidth). Arithmetic intensity is flops per byte loaded. We will use this once at the end to explain why the numbers above are not all the same shape.
The two layers you actually write
Before getting to the kernels, anchor the two pieces of public API the benchmark drives.
Data DSL — describing the inputs
import sk.ainet.context.DirectCpuExecutionContext
import sk.ainet.lang.tensor.dsl.tensor
import sk.ainet.lang.types.FP32
val ctx = DirectCpuExecutionContext.create() // Phase.EVAL by default
val a = tensor<FP32, Float>(ctx, FP32::class) {
tensor { shape(1024, 1024) { randn(mean = 0f, std = 0.1f) } }
}
val b = tensor<FP32, Float>(ctx, FP32::class) {
tensor { shape(1024, 1024) { randn(mean = 0f, std = 0.1f) } }
}
The DSL parameters carry the contract: <FP32, Float> ties the dtype
tag to the native value type. For loaded model weights (GGUF,
SafeTensors) the DSL is bypassed — loaders produce the same
Tensor<T, V> directly. The kernel does not care where the bytes
came from.
ExecutionContext — describing the computation
val c = ctx.ops.matmul(a, b)
// c.shape == Shape(1024, 1024)
That one line is the entire user-facing surface. The
ExecutionContext owns the dispatch table and the registered set of
kernel providers. One context per inference session; reuse it
across calls.
The benchmark exercises this surface end-to-end (in
engine-fp32-gemm) as well as the kernel directly with no dispatch
overhead (in engine-kernel-matmul) — the gap between them tells you
how expensive dispatch is on this code path.
From ctx.ops.matmul to a kernel call
-
Dtype dispatch.
TensorOpspicks the matching kernel slot —matmulFp32,matmulBf16,matmulQ4K,matmulQ8_0. -
Provider lookup.
KernelRegistry.bestAvailable()returns the highest-priority provider that reports itself available. -
Kernel invocation. Raw arrays, offsets, strides, shape — no allocations, no dispatch inside the timed region.
The benchmark scenarios cover both the dispatched path
(engine-fp32-gemm goes through ctx.ops.matmul) and the direct
kernel path (engine-kernel-matmul lets you pick panama or
scalar explicitly). See How SIMD Kernels Are Built
for the full SPI walk-through.
Three regimes the benchmark probes
The four rows in the headline table represent three distinct performance regimes. Each tells you something different about what the CPU is spending its time on.
Regime 1 — Scalar matmul: latency-bound
engine-kernel-matmul scalar measures the textbook three-loop
kernel. One float multiply + one float add per inner iteration:
for (i in 0 until m) {
for (j in 0 until n) {
var acc = 0f
for (l in 0 until k) {
acc += a[i * k + l] * b[l * n + j] // 2 flops
}
out[i * n + j] = acc
}
}
Measured: 1.05 GFLOPS on the i7-9750H at 1024 × 1024 × 1024. The CPU’s theoretical FP32 peak for one core is ~37 GFLOPS (AVX2 + FMA + 4.5 GHz turbo). The scalar kernel achieves ~3% of single-core peak.
The kernel isn’t doing "extra" work — it does the exact 2 · 10⁹
flops the math requires. The waste comes from what the CPU could
have done in those same cycles:
-
AVX2 gives it 8 float lanes; the scalar code uses 1.
-
FMA gives it
(multiply + add) per cycle; the scalar code usesmultiply, then wait for add to retire. -
The accumulator dependency means the CPU cannot pipeline across iterations.
That is the "latency-bound" tax: each iteration waits for the previous one to finish. This is the regime where you are most obviously leaving compute on the floor — and it is the regime an untuned kernel always lands in.
Regime 2 — Panama dense matmul: compute-bound
engine-kernel-matmul panama measures the same operation through
the JDK Vector API. The inner step becomes "process
species.length() lanes per FMA":
private val species: VectorSpecies<Float> = FloatVector.SPECIES_PREFERRED
val step = species.length() // 8 on AVX2, 16 on AVX-512, 4 on NEON
for (i in 0 until m) {
for (j in 0 until n) {
var acc = FloatVector.zero(species)
var l = 0
while (l < species.loopBound(k)) {
val av = FloatVector.fromArray(species, a, i * k + l)
val bv = FloatVector.fromArray(species, bPacked, j * k + l)
acc = av.fma(bv, acc) // fused multiply-add ×8 lanes
l += step
}
out[i * n + j] = acc.reduceLanes(VectorOperators.ADD)
}
}
Naive SIMD measurement (the snippet above): 14.06 GFLOPS — 13.4× the scalar version. That matches the napkin math closely:
-
SIMD width: 8 lanes (AVX2) → up to 8× more flops per cycle
-
FMA: multiply + add in one instruction → up to 2× the issue rate
-
Predicted ceiling: ~16×; observed: ~13×
The remaining gap (13× of 16×) came from the inner loop processing
one (i, j) output cell at a time: every A-row load and B-column
load fed exactly one accumulator. With 8 lanes of FMA capacity per
cycle and only one accumulator in flight, the issue queue stalls on
the dependency chain through acc.
Going further: mnpack tile dispatch
Closing that gap is what the production PanamaVectorMatmulKernel
does today. Inside the existing TILE_M × TILE_N outer block, a
recursive mnpack dispatcher picks the largest RM × RN
microkernel that fits the residual and recurses on the rest:
| Microkernel | Accumulators | Per-step loads | Per-step FMAs |
|---|---|---|---|
|
12 |
4 A + 3 B |
12 |
|
4 |
2 A + 2 B |
4 |
|
2 |
2 A + 1 B |
2 |
|
2 |
1 A + 2 B |
2 |
|
1 |
1 A + 1 B |
1 |
Each microkernel keeps RM × RN FloatVector accumulators live in
registers; every A-row load amortizes across RN columns and every
B-column load across RM rows. The 4 × 3 microkernel is the
largest that fits inside AVX2’s 16 vector registers (12 accumulators
+ 4 A vectors + 1 B vector at any moment). Smaller microkernels
clean up residual rows and columns that don’t divide evenly into
the bigger tile shape.
Measured after the change: 23.0 GFLOPS — 20.4× scalar, ~1.65× the naive-SIMD result. Two independent runs reproduce within sub-2% CoV. The gain comes from a single mechanism — FMA-pipeline fill: 12 FMAs from 7 vector loads instead of 1 FMA from 2 loads, so the issue queue can stay saturated instead of waiting on accumulator-chain dependencies.
23 GFLOPS on AVX2 single-core is ~62% of the chip’s ~37 GFLOPS
theoretical FP32 peak, up from 38% before. The pattern is borrowed
verbatim from tinyBLAS’s mnpack switch in sgemm.cpp (Justine
Tunney / llamafile); see How SIMD Kernels Are Built
for how SKaiNET’s kernel SPI hosts the microkernels.
This kernel is now compute-bound on a single core, near the
single-core ceiling. The remaining headroom requires architectural
moves: AVX-512 (16 lanes — 2× headroom; needs a wider microkernel
set), multi-threading (6 cores — 6× headroom; needs ith/nth
SPI), or a hand-tuned native kernel via FFM.
If you take only one comparison away from this page: the scalar → Panama jump (1× → 13×) is the biggest single performance win available to a pure-JVM compute backend, and the Panama → tiled Panama jump (13× → 20×) is the next biggest. Both cost ~200 lines of Kotlin and zero native code.
Regime 3 — Quantized matmul: bandwidth-bound
For LLM-class matmuls the bottleneck stops being multiply-adds and becomes how fast the weight tensor can be pulled from DRAM into L1. A 7B-parameter Llama in FP32 is 28 GiB of weights; even an L3 cache miss costs ~100 ns. The kernel that ships less weight bandwidth wins, even if it has to do extra arithmetic to dequantize.
Quantized formats trade compute for bandwidth:
| Format | Bits / elem | Bytes / block | Pack layout |
|---|---|---|---|
|
32 |
4 / elem |
dense |
|
16 |
2 / elem |
dense, top 16 bits of FP32 |
|
~8.5 |
34 / 32 elems |
FP16 scale + 32 signed int8 codes |
|
~4.5 |
144 / 256 elems |
FP16 scale + FP16 min + 8 sub-blocks of 32 × 4-bit codes |
The Panama Q4_K and Q8_0 kernels follow the same SIMD recipe as FP32,
plus a dequant prelude per block: load packed bytes as a
ByteVector, unpack with shift / bit-and, widen to FloatVector,
multiply by the broadcast block scale, then FMA into the accumulator.
See How Quantized SIMD Kernels Are Built for the
inner-loop walk-through.
Measured on the i7-9750H:
-
engine-q4-gemm(Q4_K matvec at 4096²): ~19 GOP/s panama (CoV often above 3% on this laptop — the Q4_K path is sensitive to thermal throttling at sustained load) -
engine-q8-matmul(Q8_0 matvec at 4096²): 4.42 GOP/s panama vs 1.81 scalar — 2.44× speedup -
engine-bf16-matmul(1024³): 1.42 vs 0.80 GFLOPS — 1.77× speedup
The non-obvious result is that engine-q4-gemm (~19 GOP/s) is
close to dense engine-fp32-gemm (~23 GFLOPS after the mnpack
work) even though it does extra dequant work. The reason is
precisely bandwidth: Q4_K weights are ~9× smaller than FP32, so the
kernel spends less time waiting on memory and more time inside the
FMA pipeline. The dequant arithmetic happens in registers while the
next weight block is still in flight from L2.
This is the regime where "more FLOPS" is the wrong success metric. The right metric is "useful per-token tokens-per-second," and Q4_K wins that metric on bandwidth-bound hosts even though it loses on raw flops/sec.
Where the bytes go — three memory diagrams
The three regimes correspond to three distinct shapes of memory traffic.
FP32 dense (Panama) — compute-bound
A 1024² FP32 matrix is 4 MiB — fits in L3 but spills L2. With B pre-packed and contiguous the prefetcher keeps up; the FMA pipeline is the bottleneck. Throughput is set by cycles, not by bytes per second.
Quantized (Panama) — bandwidth-bound
The weight tensor shrinks 4–9× vs. FP32. Less memory traffic means more time in the FMA pipeline. The dequant prelude costs a few extra instructions per iteration, but those instructions overlap with the load stalls that used to dominate. Throughput is set by DRAM bandwidth, not by cycles.
The roofline — one diagram that explains all four rows
The roofline model puts everything on one chart:
Four things to internalize:
-
The scalar→Panama jump (1× → 13×) is the same speedup mechanism for every dtype. SIMD width and FMA buy you that no matter what bits you’re shipping. If a benchmark shows < 5× scalar→Panama, the kernel has a problem — usually loop-carried dependencies or un-amortized packing.
-
Once Panama is in play, the next jump (13× → 20×) comes from reusing the SIMD loads across multiple output cells — the
mnpacktile dispatch. The mechanism is FMA-pipeline fill, not more bandwidth. A naive 1×1-cell SIMD kernel leaves this win on the table even though every individual cycle looks vectorized. -
Once tiled Panama is in play, the next limit depends on the dtype. FP32 is compute-bound at ~62% of single-core peak: more cores, AVX-512, or native code are the remaining levers. Q4_K and Q8_0 are bandwidth-bound: faster DRAM helps; more cores helps until you saturate the memory channel.
-
Quantization is not "free flops." It is bandwidth amplification. Q4_K kernels do more arithmetic per byte than FP32 kernels, so on a bandwidth-bound machine they look like a compute speedup. On a GPU with HBM3 they look like a compute slowdown because the dense pipeline was never bandwidth-bound there in the first place.
How to read a benchmark number
Looking at any number from the engine matmul benchmark, ask in order:
-
Which regime is this scenario in? Use the table at the top of this page. Scalar → latency. Dense Panama → compute. Quantized Panama → bandwidth.
-
What is the local peak for that regime on this host? Single-core FP32 peak ≈
clock_GHz × SIMD_lanes × 2 (FMA). Memory bandwidth peak ≈ DRAM rating (e.g. ~30 GB/s for DDR4-2667). -
What fraction of peak does this kernel hit? > 50% is excellent for the compute-bound regime; > 80% of memory bandwidth is excellent for the bandwidth-bound regime; anything over 10% of peak for the scalar regime means something is broken.
-
Is the CoV under 3%? If not, the number is noise.
BenchmarkRunnerflags this automatically as"unstable": truein the JSON record. Unstable runs are not publishable.
The benchmark publishes the raw samples, the mean, the stddev, the CoV, and a list of available providers — all the inputs you need to do that interpretation. See Official Engine Benchmarks for how the JSON record is laid out.
Where the CPU backend helps (and where it doesn’t)
The CPU backend earns its keep when:
-
You need to run inference where you cannot ship a GPU runtime — Apple Silicon laptops, ARM SBCs, sandboxed JVMs, Android, CI workers. SKaiNET CPU is
--add-modulesand a regular JAR. -
The model fits in DRAM and the workload is interactive per-token. A 1B–7B model at Q4_K runs comfortably on a modern laptop CPU, with decode latency dominated by memory bandwidth — the regime where the quantized kernels shine.
-
You want deterministic, reproducible numbers across operating systems. No driver versions, no CUDA toolkit. The same JAR runs identically on Linux x86, macOS Arm64, and inside a container.
It is not the right answer when:
-
You are training, not inferring. Backprop multiplies the FLOP budget by 3–4× and immediately exceeds what a single CPU socket can deliver.
-
The model is large (30B+) and the workload is batched. GPU bandwidth and memory capacity both scale beyond what a desktop CPU reaches.
-
You need sub-millisecond first-token latency at high batch sizes. That is a TensorRT / vLLM / SGLang problem, not a CPU problem.
Related
-
Official engine benchmarks — the methodology, manifest, and CI lanes behind the numbers above.
-
How SIMD kernels are built — the FP32 Panama kernel walk-through and the full kernel-provider SPI rationale.
-
How quantized SIMD kernels are built — inner loops for Q4_0 / Q4_K / Q6_K / Q8_0.
-
JVM CPU performance — JVM flags, vector module enablement, JIT considerations.