KvCache

annotation class KvCache(val preset: String = "none", val keyBits: Int = 4, val valueBits: Int = 4, val useQjl: Boolean = false, val maxSeqLen: Int = 0, val device: DeviceKind = DeviceKind.AUTO)(source)

Configures TurboQuant KV-cache compression for an attention layer.

Applied to attention layer properties to declare KV-cache compression settings. The runtime uses these annotations to configure the KvCacheStore and CompressedKvAttention for each layer.

Example:

@KvCache(preset = "balanced")
val selfAttention: MultiHeadAttention

@KvCache(keyBits = 8, valueBits = 4)
val crossAttention: MultiHeadAttention

@KvCache(preset = "safe-lowbit", maxSeqLen = 4096)
val longContextAttention: MultiHeadAttention

Properties

Link copied to clipboard

Preferred device for KV cache storage.

Link copied to clipboard
val keyBits: Int = 4

Bits per element for key compression (2, 3, 4, or 8). Only used when preset is "none" (custom config).

Link copied to clipboard
val maxSeqLen: Int = 0

Maximum sequence length for the KV cache. 0 means use the model's default.

Link copied to clipboard

Named preset: "safe-lowbit", "balanced", "experimental-max", or "none". When set to a named preset, keyBits and valueBits are ignored. Default "none" means no TurboQuant compression (dense FP32 cache).

Link copied to clipboard
val useQjl: Boolean = false

Whether to use QJL residual for improved inner-product accuracy. Only used when preset is "none" (custom config).

Link copied to clipboard
val valueBits: Int = 4

Bits per element for value compression (2, 3, 4, or 8). Only used when preset is "none" (custom config).