skainet-lang-core/sk.ainet.lang.tensor.ops/QuantizedMatmul

QuantizedMatmul

Optimized matrix multiplication for quantized weight formats (Q8_0, Q4_K).

Direct quantized matmul avoids full dequantization to FP32, reducing memory bandwidth and improving cache efficiency. The computation fuses dequantization with the dot product:

Q8_0: outputi = sum(inputj * codej) * scale Q4_K: outputi = sum(inputj * codej) * scale + sum_of_mins

These kernels maintain numerical accuracy within acceptable tolerance compared to full dequant + FP32 matmul (typically ≤1e-4 relative error).

Functions

isQ4_KWeight

fun isQ4_KWeight(tensor: Tensor<*, *>): Boolean

Check if a tensor's underlying data is Q4_K quantized.

isQ8_0Weight

fun isQ8_0Weight(tensor: Tensor<*, *>): Boolean

Check if a tensor's underlying data is Q8_0 quantized.

isQuantizedWeight

fun isQuantizedWeight(tensor: Tensor<*, *>): Boolean

Check if a tensor's underlying data is any quantized format we support.

matmulAutoDispatch

fun matmulAutoDispatch(input: Tensor<FP32, Float>, weight: Tensor<*, *>, ctx: ExecutionContext): Tensor<FP32, Float>

Perform matmul with automatic dispatch based on weight type. Uses quantized-optimized path when weights are quantized, otherwise falls back to standard matmul.

matmulQ4_K

fun matmulQ4_K(input: Tensor<FP32, Float>, weights: Q4_KTensorData, ctx: ExecutionContext): Tensor<FP32, Float>

Matrix multiplication with Q4_K quantized weights.

matmulQ8_0

fun matmulQ8_0(input: Tensor<FP32, Float>, weights: Q8_0TensorData, ctx: ExecutionContext): Tensor<FP32, Float>

Matrix multiplication with Q8_0 quantized weights.