QuantizedMatmul
Optimized matrix multiplication for quantized weight formats (Q8_0, Q4_K).
Direct quantized matmul avoids full dequantization to FP32, reducing memory bandwidth and improving cache efficiency. The computation fuses dequantization with the dot product:
Q8_0: outputi = sum(inputj * codej) * scale Q4_K: outputi = sum(inputj * codej) * scale + sum_of_mins
These kernels maintain numerical accuracy within acceptable tolerance compared to full dequant + FP32 matmul (typically ≤1e-4 relative error).
Functions
Check if a tensor's underlying data is Q4_K quantized.
Check if a tensor's underlying data is Q8_0 quantized.
Check if a tensor's underlying data is any quantized format we support.
Perform matmul with automatic dispatch based on weight type. Uses quantized-optimized path when weights are quantized, otherwise falls back to standard matmul.
Matrix multiplication with Q4_K quantized weights.
Matrix multiplication with Q8_0 quantized weights.