How Quantized SIMD Kernels Are Built
For the broader kernel SPI story (KernelProvider, KernelRegistry,
ServiceLoader auto-discovery), see
How SIMD Kernels Are Built. This page focuses on the
inner loops of the quantized matmul kernels β Q4_0, Q4_K, Q6_K,
Q8_0 β which is where most of the wall-clock time of an LLM decode
goes once the FP32 path is fast.
The general pipeline
For a single output[o] = Ξ£_j input[j] Β· dequant(weight[o, j]) cell,
the SIMD recipe is:
-
Load codes.
Wpacked integers from the weight buffer as aByteVectorβ either viaByteVector.fromArray(heapByteArray) orByteVector.fromMemorySegment(mmap’d weights, FFM). -
Unpack. For 4-bit codes:
byteVec.and(0x0F)for low nibbles,byteVec.lanewise(LSHR, 4)for high nibbles. Sign-correct as needed (Q4_0 subtracts 8; Q4_K subtracts a per-sub-block min lazily; Q6_K combinesql + qhthen subtracts 32). -
Widen + convert.
castShape(floatSpecies, 0)lane-widens the byte vector to aFloatVectorin one shape conversion (under the hood: byte β int β float, but JIT’d as a single instruction sequence on most targets). -
Apply scale. Multiply by the broadcast block scale (and sub-scale, for K-quants).
-
Load input.
Wfloats from the FP32 input viaFloatVector.fromArray. -
FMA.
acc = inputVec.fma(weightFloatVec, acc). -
Repeat across the block. Reduce once per output cell at the end via
acc.reduceLanes(ADD).
The same skeleton drives every quantized kernel; the differences are all in steps 1β4 (block layout, sign convention, scale recovery).
The four format pipelines
Q8_0 β 32 elements / 34 bytes
Single FP16 scale + 32 signed int8 codes. The simplest case: codes
are already signed bytes, no nibble unpack. Pipeline:
val byteVec = ByteVector.fromArray(byteSpeciesForFloat, codes, codesOffset + idx)
val codeVec = byteVec.castShape(floatSpecies, 0) as FloatVector // sign-extends
accVec = inputVec.mul(codeVec).add(accVec)
// final: (accVec.reduceLanes(ADD) + scalarTail) * scale
Q8_0 is the "gold reference" β the cleanest expression of the
pipeline. Q8_0 MemSeg (matmulF32Q8_0MemSeg) is the same loop with
ByteVector.fromMemorySegment.
Q4_K β 256 elements / 144 bytes / 8 sub-blocks
The format that matters most for current LLMs (Gemma 4 Q4_K_M, Llama, Qwen). Block layout (canonical ggml):
-
bytes
[0, 2):d(super-block scale, FP16 LE) -
bytes
[2, 4):dMin(super-block min-scale, FP16 LE) -
bytes
[4, 16): 12 bytes packed(scaleIdx, minIdx)for 8 sub-blocks via ggml’sget_scale_min_k4mixing -
bytes
[16, 144): 128 bytes of 4-bit codes, strided in 4 groups of 32 β each byte’s lo nibble belongs to one sub-block, hi nibble to the next sub-block over the same intra-group index.
Per element: dequant = code Β· scale[s] β offset[s] where scale[s] =
d Β· scaleIdx[s] and offset[s] = dMin Β· minIdx[s].
The lazy-dmin trick
A naive implementation subtracts offset from every element. Better:
linearity lets us track two running sums per sub-block:
codeSum[s] = Ξ£_i input[i] Β· code[i] (scaled later by scale[s])
inputSum[s] = Ξ£_i input[i] (scaled later by offset[s])
and combine once per sub-block as acc += scale[s]Β·codeSum[s] β
offset[s]Β·inputSum[s]. ggml’s reference uses the same trick.
The fused lo+hi load
Because the canonical layout puts sub-block 2j lo nibbles and
sub-block 2j+1 hi nibbles in the same 32-byte slab, a single
ByteVector load feeds both sub-block accumulators per chunk:
val byteVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qsRegion + idx)
val loBytes = byteVec.and(0x0F.toByte())
val hiBytes = byteVec.lanewise(VectorOperators.LSHR, 4.toByte())
val codeVecLo = loBytes.castShape(floatSpecies, 0) as FloatVector
val codeVecHi = hiBytes.castShape(floatSpecies, 0) as FloatVector
val inVecLo = FloatVector.fromArray(floatSpecies, input, inputStartLo + idx)
val inVecHi = FloatVector.fromArray(floatSpecies, input, inputStartHi + idx)
codeAccLo = inVecLo.fma(codeVecLo, codeAccLo) // for sub-block 2j
inputAccLo = inVecLo.add(inputAccLo) // dmin-correction sum
codeAccHi = inVecHi.fma(codeVecHi, codeAccHi) // for sub-block 2j+1
inputAccHi = inVecHi.add(inputAccHi)
This halves the number of byte loads vs the prior helper that ran
once per nibble pass. Lives at
skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/kernel/PanamaVectorQ4KMatmulKernel.kt,
exposed via the Q4KMatmulKernel SPI sibling (kernel SPI:
KernelProvider.matmulQ4K(): Q4KMatmulKernel?).
The MemSeg variant (matmulF32Q4_KMemSeg) is the same algorithm with
ByteVector.fromMemorySegment β no SPI surface yet, just an inline
replacement.
Numbers
QuantizedMatmulBench, JDK 21.0.10, Apple Silicon:
| Shape (inputDim Γ outputDim) | Time | Throughput |
|---|---|---|
1024 Γ 1024 |
0.07 ms |
~30 GFLOPS |
4096 Γ 1024 |
0.15 ms |
~55 GFLOPS |
4096 Γ 4096 |
0.46 ms |
~73 GFLOPS |
Same throughput regime as the FP32 SIMD kernel β the fused dequant adds essentially zero cost on top of the FMA.
Q6_K β 256 elements / 210 bytes / 16 sub-blocks
Block layout:
-
bytes
[0, 128):qlβ low 4 bits of each 6-bit code (half-interleaved) -
bytes
[128, 192):qhβ high 2 bits of each 6-bit code (4 codes per byte) -
bytes
[192, 208): 16 signedint8sub-block scales -
bytes
[208, 210): FP16d(super-block scale)
Per element: 6-bit code = (ql_nibble) | ((qh_2bits) << 4) β 32,
dequant = d Β· sc[sub_block] Β· code.
Q6_K’s qh is the wrinkle: each qh byte carries the high 2 bits of
four codes, packed at bit positions 0β1, 2β3, 4β5, 6β7. The SIMD
recipe in dequantQ6_KBlock (file
JvmQuantizedVectorKernels.kt):
val ql0Vec = ByteVector.fromArray(byteSpeciesForFloat, weight, qlBase + l)
val ql32Vec = ByteVector.fromArray(byteSpeciesForFloat, weight, qlBase + l + 32)
val qhVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qhBase + l)
val q1Bytes = ql0Vec.and(0x0F.toByte())
.or(qhVec.and(0x03.toByte()).lanewise(LSHL, 4.toByte()))
val q2Bytes = ql32Vec.and(0x0F.toByte())
.or(qhVec.lanewise(LSHR, 2.toByte()).and(0x03.toByte()).lanewise(LSHL, 4.toByte()))
val q3Bytes = ql0Vec.lanewise(LSHR, 4.toByte())
.or(qhVec.lanewise(LSHR, 4.toByte()).and(0x03.toByte()).lanewise(LSHL, 4.toByte()))
val q4Bytes = ql32Vec.lanewise(LSHR, 4.toByte())
.or(qhVec.lanewise(LSHR, 6.toByte()).lanewise(LSHL, 4.toByte()))
Then each q*Bytes is cast to FloatVector, biased by β32, scaled
by d Β· sc[sub_block], and stored to one of four 32-element regions
of the per-block scratch. The per-cell SIMD dot product (FloatVector.fma)
already existed; this change only replaced the scalar dequant loop.
Q4_0 β 32 elements / 18 bytes
The simplest layout: a single FP16 scale + 16 packed nibble bytes.
Q4_0 uses the canonical ggml split layout β the low nibbles of bytes
0..15 decode elements 0..15, the high nibbles decode elements 16..31
((nibble - 8) * d). This is the layout real GGUF Q4_0 weights ship in
and what DequantOps.dequantQ4_0FromBytes produces.
As of the first-class promotion, Q4_0 is a full SPI format on par with
Q8_0 / Q4_K: a Q4_0MatmulKernel interface with scalar (commonMain),
Panama Vector, and native FFM implementations, selected via
KernelRegistry.bestAvailable(), plus a Q4_0Quantizer for producing
Q4_0 from dense FP32. The Panama kernel uses the partial-vec pattern
Q4_K used before its fully-fused rewrite β a scalar split-layout unpack
into a 32-element scratch buffer, then a SIMD FMA dot product:
// Stage 1: split-layout unpack into a 32-element scratch FloatArray.
for (j in 0 until 16) {
val b = weight[codesBase + j].toInt() and 0xFF
codeBuf[j] = ((b and 0x0F) - 8).toFloat() // elements 0..15
codeBuf[16 + j] = ((b ushr 4) - 8).toFloat() // elements 16..31
}
// Stage 2: SIMD FMA dot product.
var accVec = FloatVector.zero(floatSpecies)
while (idx < loopBound) {
val iv = FloatVector.fromArray(floatSpecies, input, inputBase + idx)
val cv = FloatVector.fromArray(floatSpecies, codeBuf, idx)
accVec = iv.fma(cv, accVec)
idx += step
}
return (accVec.reduceLanes(ADD) + scalarTail) * d
Q4_0 is rarely the hot path in modern weights (Q4_K_M / Q4_K_S dominate
Gemma 4, Llama, Qwen), so the scratch-then-SIMD shape is a deliberate
balance. A fully-fused ByteVector pipeline is a reasonable follow-up:
the split layout is friendlier than it looks β lo/hi nibble masks of a
16-byte ByteVector load yield elements 0..15 and 16..31 directly, no
lane-interleave shuffle required.
Per-format coverage matrix
| Format | SPI sibling? | MemSeg variant SIMD? | Inner loop strategy |
|---|---|---|---|
Q8_0 |
no |
yes |
Fully fused ( |
Q4_K |
yes ( |
yes (inline, same algorithm) |
Fully fused (single byte load β lo+hi nibble accumulators, lazy |
Q6_K |
no |
n/a |
SIMD dequant into scratch + SIMD dot (two-stage) |
Q4_0 |
yes ( |
yes |
Scalar split-layout unpack into scratch + SIMD dot (two-stage) |
Where to look in the code
| File | What it covers |
|---|---|
|
The Q4_K kernel SPI (commonMain). |
|
Fused-pipeline Q4_K implementation. |
|
All the per-format kernels β Q4_K MemSeg, Q6_K dequant, Q4_0, Q8_0 MemSeg variants. |
|
|
|
JMH harness for Q4_K Panama. |
|
Parity tests vs the reference partial-vec kernel. |
For the kernel SPI itself, see How SIMD Kernels Are Built.