TurboQuantUsage

TurboQuant integration guide for skainet-transformers.

TurboQuant compresses the KV cache at runtime — no model retraining or weight re-quantization needed. Any model (LLaMA, Mistral, Gemma, Qwen, etc.) benefits immediately.

What TurboQuant does

During autoregressive inference, the KV cache grows linearly with sequence length and dominates memory usage. TurboQuant compresses K/V projections on write and decompresses on read:

  • 4-bit (balanced): ~8x compression vs FP32

  • 3-bit (experimental-max): ~10x compression

  • safe-lowbit: Q8_0 keys + 4-bit values (conservative)

Quick start

1. One-line cache creation

// Replace your existing KV cache with TurboQuant:
val cache = KvCacheStore.turboQuant(
preset = "balanced",
numLayers = 32,
numHeads = 32,
headDim = 128,
maxSeqLen = 4096
)

2. Use in attention layer

class MultiHeadAttention(
val numHeads: Int,
val headDim: Int,
val cache: KvCacheStore
) {
private val bridge = CompressedKvAttention(cache)

fun forward(query: FloatArray, key: FloatArray, value: FloatArray, layer: Int): FloatArray {
// Store K/V (compressed automatically)
bridge.storeKeyValue(layer, key, value)

// Read for attention (decompressed automatically)
val cachedKeys = bridge.loadKeysForAttention(layer)
val cachedValues = bridge.loadValuesForAttention(layer)

// Pass to scaledDotProductAttention as usual
return computeAttention(query, cachedKeys, cachedValues)
}
}

3. Annotate layers (optional)

@KvCache(preset = "balanced")
class SelfAttention(...) { ... }

// Resolve at model init:
val cache = KvCacheAnnotationResolver.resolve(
preset = "balanced",
numLayers = config.numLayers,
numHeads = config.numKVHeads,
headDim = config.headDim,
maxSeqLen = config.maxSeqLen
)

4. Monitor compression

val report = cache.memoryReport()
println("Compression: ${report.compressionRatio}x")
println("KV cache: ${report.totalPhysicalBytes / 1024 / 1024} MB")
println("Utilization: ${(report.utilizationRatio * 100).toInt()}%")

Preset selection guide

PresetKey bitsValue bitsCompressionQualityUse case
safe-lowbit8 (Q8_0)4 (TQ)~4-6xBestProduction, quality-sensitive
balanced4 (TQ)4 (TQ)~8xGoodGeneral purpose, long context
experimental-max3 (TQ)3 (TQ)~10xFairMemory-constrained, very long context

Model compatibility

TurboQuant works with any model regardless of:

  • Weight quantization format (GGUF Q4_K, Q8_0, FP16, etc.)

  • Architecture (LLaMA, Mistral, Gemma, Qwen, BERT)

  • Model size (1B to 70B+)

  • Age (works with older models too)

The model weights are unchanged — only the KV cache storage is compressed.

Functions

Link copied to clipboard

Example: Asymmetric K/V compression (8-bit keys, 4-bit values).

Link copied to clipboard

Example: Full generation loop with TurboQuant KV cache.

Link copied to clipboard

Example: Create a balanced TurboQuant cache for a LLaMA-style model.