skainet-lang-core/sk.ainet.lang.tensor.storage/KvCache

KvCache

@Target(allowedTargets = [AnnotationTarget.PROPERTY, AnnotationTarget.VALUE_PARAMETER, AnnotationTarget.FIELD])

annotation class KvCache(val preset: String = "none", val keyBits: Int = 4, val valueBits: Int = 4, val useQjl: Boolean = false, val maxSeqLen: Int = 0, val device: DeviceKind = DeviceKind.AUTO)(source)

Configures TurboQuant KV-cache compression for an attention layer.

Applied to attention layer properties to declare KV-cache compression settings. The runtime uses these annotations to configure the KvCacheStore and CompressedKvAttention for each layer.

Example:

@KvCache(preset = "balanced")
val selfAttention: MultiHeadAttention

@KvCache(keyBits = 8, valueBits = 4)
val crossAttention: MultiHeadAttention

@KvCache(preset = "safe-lowbit", maxSeqLen = 4096)
val longContextAttention: MultiHeadAttention

Properties

device

val device: DeviceKind

Preferred device for KV cache storage.

keyBits

val keyBits: Int = 4

Bits per element for key compression (2, 3, 4, or 8). Only used when preset is "none" (custom config).

maxSeqLen

val maxSeqLen: Int = 0

Maximum sequence length for the KV cache. 0 means use the model's default.

preset

val preset: String

Named preset: "safe-lowbit", "balanced", "experimental-max", or "none". When set to a named preset, keyBits and valueBits are ignored. Default "none" means no TurboQuant compression (dense FP32 cache).

useQjl

val useQjl: Boolean = false

Whether to use QJL residual for improved inner-product accuracy. Only used when preset is "none" (custom config).

valueBits

val valueBits: Int = 4

Bits per element for value compression (2, 3, 4, or 8). Only used when preset is "none" (custom config).