How Quantized SIMD Kernels Are Built

For the broader kernel SPI story (KernelProvider, KernelRegistry, ServiceLoader auto-discovery), see How SIMD Kernels Are Built. This page focuses on the inner loops of the quantized matmul kernels β€” Q4_0, Q4_K, Q6_K, Q8_0 β€” which is where most of the wall-clock time of an LLM decode goes once the FP32 path is fast.

The general pipeline

For a single output[o] = Ξ£_j input[j] Β· dequant(weight[o, j]) cell, the SIMD recipe is:

  1. Load codes. W packed integers from the weight buffer as a ByteVector β€” either via ByteVector.fromArray (heap ByteArray) or ByteVector.fromMemorySegment (mmap’d weights, FFM).

  2. Unpack. For 4-bit codes: byteVec.and(0x0F) for low nibbles, byteVec.lanewise(LSHR, 4) for high nibbles. Sign-correct as needed (Q4_0 subtracts 8; Q4_K subtracts a per-sub-block min lazily; Q6_K combines ql + qh then subtracts 32).

  3. Widen + convert. castShape(floatSpecies, 0) lane-widens the byte vector to a FloatVector in one shape conversion (under the hood: byte β†’ int β†’ float, but JIT’d as a single instruction sequence on most targets).

  4. Apply scale. Multiply by the broadcast block scale (and sub-scale, for K-quants).

  5. Load input. W floats from the FP32 input via FloatVector.fromArray.

  6. FMA. acc = inputVec.fma(weightFloatVec, acc).

  7. Repeat across the block. Reduce once per output cell at the end via acc.reduceLanes(ADD).

The same skeleton drives every quantized kernel; the differences are all in steps 1–4 (block layout, sign convention, scale recovery).

The four format pipelines

Q8_0 β€” 32 elements / 34 bytes

Single FP16 scale + 32 signed int8 codes. The simplest case: codes are already signed bytes, no nibble unpack. Pipeline:

val byteVec = ByteVector.fromArray(byteSpeciesForFloat, codes, codesOffset + idx)
val codeVec = byteVec.castShape(floatSpecies, 0) as FloatVector  // sign-extends
accVec = inputVec.mul(codeVec).add(accVec)
// final: (accVec.reduceLanes(ADD) + scalarTail) * scale

Q8_0 is the "gold reference" β€” the cleanest expression of the pipeline. Q8_0 MemSeg (matmulF32Q8_0MemSeg) is the same loop with ByteVector.fromMemorySegment.

Q4_K β€” 256 elements / 144 bytes / 8 sub-blocks

The format that matters most for current LLMs (Gemma 4 Q4_K_M, Llama, Qwen). Block layout (canonical ggml):

  • bytes [0, 2): d (super-block scale, FP16 LE)

  • bytes [2, 4): dMin (super-block min-scale, FP16 LE)

  • bytes [4, 16): 12 bytes packed (scaleIdx, minIdx) for 8 sub-blocks via ggml’s get_scale_min_k4 mixing

  • bytes [16, 144): 128 bytes of 4-bit codes, strided in 4 groups of 32 β€” each byte’s lo nibble belongs to one sub-block, hi nibble to the next sub-block over the same intra-group index.

Per element: dequant = code Β· scale[s] βˆ’ offset[s] where scale[s] = d Β· scaleIdx[s] and offset[s] = dMin Β· minIdx[s].

The lazy-dmin trick

A naive implementation subtracts offset from every element. Better: linearity lets us track two running sums per sub-block:

codeSum[s] = Ξ£_i input[i] Β· code[i]    (scaled later by scale[s])
inputSum[s] = Ξ£_i input[i]              (scaled later by offset[s])

and combine once per sub-block as acc += scale[s]Β·codeSum[s] βˆ’ offset[s]Β·inputSum[s]. ggml’s reference uses the same trick.

The fused lo+hi load

Because the canonical layout puts sub-block 2j lo nibbles and sub-block 2j+1 hi nibbles in the same 32-byte slab, a single ByteVector load feeds both sub-block accumulators per chunk:

val byteVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qsRegion + idx)
val loBytes = byteVec.and(0x0F.toByte())
val hiBytes = byteVec.lanewise(VectorOperators.LSHR, 4.toByte())
val codeVecLo = loBytes.castShape(floatSpecies, 0) as FloatVector
val codeVecHi = hiBytes.castShape(floatSpecies, 0) as FloatVector
val inVecLo = FloatVector.fromArray(floatSpecies, input, inputStartLo + idx)
val inVecHi = FloatVector.fromArray(floatSpecies, input, inputStartHi + idx)
codeAccLo = inVecLo.fma(codeVecLo, codeAccLo)   // for sub-block 2j
inputAccLo = inVecLo.add(inputAccLo)            // dmin-correction sum
codeAccHi = inVecHi.fma(codeVecHi, codeAccHi)   // for sub-block 2j+1
inputAccHi = inVecHi.add(inputAccHi)

This halves the number of byte loads vs the prior helper that ran once per nibble pass. Lives at skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/kernel/PanamaVectorQ4KMatmulKernel.kt, exposed via the Q4KMatmulKernel SPI sibling (kernel SPI: KernelProvider.matmulQ4K(): Q4KMatmulKernel?).

The MemSeg variant (matmulF32Q4_KMemSeg) is the same algorithm with ByteVector.fromMemorySegment β€” no SPI surface yet, just an inline replacement.

Numbers

QuantizedMatmulBench, JDK 21.0.10, Apple Silicon:

Shape (inputDim Γ— outputDim) Time Throughput

1024 Γ— 1024

0.07 ms

~30 GFLOPS

4096 Γ— 1024

0.15 ms

~55 GFLOPS

4096 Γ— 4096

0.46 ms

~73 GFLOPS

Same throughput regime as the FP32 SIMD kernel β€” the fused dequant adds essentially zero cost on top of the FMA.

Q6_K β€” 256 elements / 210 bytes / 16 sub-blocks

Block layout:

  • bytes [0, 128): ql β€” low 4 bits of each 6-bit code (half-interleaved)

  • bytes [128, 192): qh β€” high 2 bits of each 6-bit code (4 codes per byte)

  • bytes [192, 208): 16 signed int8 sub-block scales

  • bytes [208, 210): FP16 d (super-block scale)

Per element: 6-bit code = (ql_nibble) | ((qh_2bits) << 4) βˆ’ 32, dequant = d Β· sc[sub_block] Β· code.

Q6_K’s qh is the wrinkle: each qh byte carries the high 2 bits of four codes, packed at bit positions 0–1, 2–3, 4–5, 6–7. The SIMD recipe in dequantQ6_KBlock (file JvmQuantizedVectorKernels.kt):

val ql0Vec = ByteVector.fromArray(byteSpeciesForFloat, weight, qlBase + l)
val ql32Vec = ByteVector.fromArray(byteSpeciesForFloat, weight, qlBase + l + 32)
val qhVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qhBase + l)

val q1Bytes = ql0Vec.and(0x0F.toByte())
    .or(qhVec.and(0x03.toByte()).lanewise(LSHL, 4.toByte()))
val q2Bytes = ql32Vec.and(0x0F.toByte())
    .or(qhVec.lanewise(LSHR, 2.toByte()).and(0x03.toByte()).lanewise(LSHL, 4.toByte()))
val q3Bytes = ql0Vec.lanewise(LSHR, 4.toByte())
    .or(qhVec.lanewise(LSHR, 4.toByte()).and(0x03.toByte()).lanewise(LSHL, 4.toByte()))
val q4Bytes = ql32Vec.lanewise(LSHR, 4.toByte())
    .or(qhVec.lanewise(LSHR, 6.toByte()).lanewise(LSHL, 4.toByte()))

Then each q*Bytes is cast to FloatVector, biased by βˆ’32, scaled by d Β· sc[sub_block], and stored to one of four 32-element regions of the per-block scratch. The per-cell SIMD dot product (FloatVector.fma) already existed; this change only replaced the scalar dequant loop.

Q4_0 β€” 32 elements / 18 bytes

The simplest layout: a single FP16 scale + 16 packed nibble bytes. Q4_0 uses the canonical ggml split layout β€” the low nibbles of bytes 0..15 decode elements 0..15, the high nibbles decode elements 16..31 ((nibble - 8) * d). This is the layout real GGUF Q4_0 weights ship in and what DequantOps.dequantQ4_0FromBytes produces.

As of the first-class promotion, Q4_0 is a full SPI format on par with Q8_0 / Q4_K: a Q4_0MatmulKernel interface with scalar (commonMain), Panama Vector, and native FFM implementations, selected via KernelRegistry.bestAvailable(), plus a Q4_0Quantizer for producing Q4_0 from dense FP32. The Panama kernel uses the partial-vec pattern Q4_K used before its fully-fused rewrite β€” a scalar split-layout unpack into a 32-element scratch buffer, then a SIMD FMA dot product:

// Stage 1: split-layout unpack into a 32-element scratch FloatArray.
for (j in 0 until 16) {
    val b = weight[codesBase + j].toInt() and 0xFF
    codeBuf[j] = ((b and 0x0F) - 8).toFloat()       // elements 0..15
    codeBuf[16 + j] = ((b ushr 4) - 8).toFloat()    // elements 16..31
}
// Stage 2: SIMD FMA dot product.
var accVec = FloatVector.zero(floatSpecies)
while (idx < loopBound) {
    val iv = FloatVector.fromArray(floatSpecies, input, inputBase + idx)
    val cv = FloatVector.fromArray(floatSpecies, codeBuf, idx)
    accVec = iv.fma(cv, accVec)
    idx += step
}
return (accVec.reduceLanes(ADD) + scalarTail) * d

Q4_0 is rarely the hot path in modern weights (Q4_K_M / Q4_K_S dominate Gemma 4, Llama, Qwen), so the scratch-then-SIMD shape is a deliberate balance. A fully-fused ByteVector pipeline is a reasonable follow-up: the split layout is friendlier than it looks β€” lo/hi nibble masks of a 16-byte ByteVector load yield elements 0..15 and 16..31 directly, no lane-interleave shuffle required.

Per-format coverage matrix

Format SPI sibling? MemSeg variant SIMD? Inner loop strategy

Q8_0

no

yes

Fully fused (ByteVector.castShape + scaled FMA)

Q4_K

yes (Q4KMatmulKernel)

yes (inline, same algorithm)

Fully fused (single byte load β†’ lo+hi nibble accumulators, lazy dmin)

Q6_K

no

n/a

SIMD dequant into scratch + SIMD dot (two-stage)

Q4_0

yes (Q4_0MatmulKernel)

yes

Scalar split-layout unpack into scratch + SIMD dot (two-stage)

Where to look in the code

File What it covers

skainet-backends/skainet-backend-api/…​/kernel/Q4KMatmulKernel.kt

The Q4_K kernel SPI (commonMain).

skainet-backends/skainet-backend-cpu/src/jvmMain/…​/kernel/PanamaVectorQ4KMatmulKernel.kt

Fused-pipeline Q4_K implementation.

skainet-backends/skainet-backend-cpu/src/jvmMain/…​/tensor/ops/JvmQuantizedVectorKernels.kt

All the per-format kernels β€” Q4_K MemSeg, Q6_K dequant, Q4_0, Q8_0 MemSeg variants.

skainet-backends/skainet-backend-cpu/src/jvmMain/…​/tensor/ops/DefaultCpuOpsJvm.kt

chooseQuantizedMatmul β€” production dispatch by tensor data type.

skainet-backends/benchmarks/jvm-cpu-jmh/src/jmh/kotlin/sk/ainet/bench/QuantizedMatmulBench.kt

JMH harness for Q4_K Panama.

skainet-backends/skainet-backend-cpu/src/jvmTest/…​/kernel/PanamaVectorQ4KMatmulKernelTest.kt

Parity tests vs the reference partial-vec kernel.

For the kernel SPI itself, see How SIMD Kernels Are Built.