TurboQuantUsage
TurboQuant integration guide for skainet-transformers.
TurboQuant compresses the KV cache at runtime — no model retraining or weight re-quantization needed. Any model (LLaMA, Mistral, Gemma, Qwen, etc.) benefits immediately.
What TurboQuant does
During autoregressive inference, the KV cache grows linearly with sequence length and dominates memory usage. TurboQuant compresses K/V projections on write and decompresses on read:
4-bit (balanced): ~8x compression vs FP32
3-bit (experimental-max): ~10x compression
safe-lowbit: Q8_0 keys + 4-bit values (conservative)
Quick start
1. One-line cache creation
// Replace your existing KV cache with TurboQuant:
val cache = KvCacheStore.turboQuant(
preset = "balanced",
numLayers = 32,
numHeads = 32,
headDim = 128,
maxSeqLen = 4096
)2. Use in attention layer
class MultiHeadAttention(
val numHeads: Int,
val headDim: Int,
val cache: KvCacheStore
) {
private val bridge = CompressedKvAttention(cache)
fun forward(query: FloatArray, key: FloatArray, value: FloatArray, layer: Int): FloatArray {
// Store K/V (compressed automatically)
bridge.storeKeyValue(layer, key, value)
// Read for attention (decompressed automatically)
val cachedKeys = bridge.loadKeysForAttention(layer)
val cachedValues = bridge.loadValuesForAttention(layer)
// Pass to scaledDotProductAttention as usual
return computeAttention(query, cachedKeys, cachedValues)
}
}3. Annotate layers (optional)
@KvCache(preset = "balanced")
class SelfAttention(...) { ... }
// Resolve at model init:
val cache = KvCacheAnnotationResolver.resolve(
preset = "balanced",
numLayers = config.numLayers,
numHeads = config.numKVHeads,
headDim = config.headDim,
maxSeqLen = config.maxSeqLen
)4. Monitor compression
val report = cache.memoryReport()
println("Compression: ${report.compressionRatio}x")
println("KV cache: ${report.totalPhysicalBytes / 1024 / 1024} MB")
println("Utilization: ${(report.utilizationRatio * 100).toInt()}%")Preset selection guide
| Preset | Key bits | Value bits | Compression | Quality | Use case |
|---|---|---|---|---|---|
| safe-lowbit | 8 (Q8_0) | 4 (TQ) | ~4-6x | Best | Production, quality-sensitive |
| balanced | 4 (TQ) | 4 (TQ) | ~8x | Good | General purpose, long context |
| experimental-max | 3 (TQ) | 3 (TQ) | ~10x | Fair | Memory-constrained, very long context |
Model compatibility
TurboQuant works with any model regardless of:
Weight quantization format (GGUF Q4_K, Q8_0, FP16, etc.)
Architecture (LLaMA, Mistral, Gemma, Qwen, BERT)
Model size (1B to 70B+)
Age (works with older models too)
The model weights are unchanged — only the KV cache storage is compressed.