CompressedKvAttention

class CompressedKvAttention(cache: KvCacheStore, dequantStrategy: CompressedKvAttention.DequantStrategy = DequantStrategy.FULL_TILE)(source)

Bridge between KvCacheStore and the SDPA execution path.

This abstraction provides the integration point for compressed K/V in the attention runtime. Instead of modifying the core TensorOps interface (which maps to backend-specific fused kernels), this component sits between the model layer and SDPA:

  1. Write path: Compresses K/V on token append via storeKeyValue

  2. Read path: Dequantizes only required tiles via loadKeysForAttention and loadValuesForAttention

  3. Extension point: Backends can override DequantStrategy to fuse decompression with attention math.

Usage in a transformer layer:

val bridge = CompressedKvAttention(kvCache)
bridge.storeKeyValue(layer, keyProjection, valueProjection)
val keys = bridge.loadKeysForAttention(layer)
val values = bridge.loadValuesForAttention(layer)
// pass keys, values to scaledDotProductAttention

Constructors

Link copied to clipboard
constructor(cache: KvCacheStore, dequantStrategy: CompressedKvAttention.DequantStrategy = DequantStrategy.FULL_TILE)

Types

Link copied to clipboard

Strategy for dequantizing compressed K/V during attention.

Properties

Link copied to clipboard

Whether the cache uses compressed (non-Dense) encoding for keys.

Link copied to clipboard

Whether the cache uses compressed (non-Dense) encoding for values.

Functions

Link copied to clipboard
fun loadKeysForAttention(layer: Int, startPos: Int = 0, endPos: Int = cache.currentSeqLen): FloatArray

Load cached keys for attention, dequantizing as needed.

Link copied to clipboard
fun loadKeyStorageRaw(layer: Int, startPos: Int = 0, endPos: Int = cache.currentSeqLen): TensorStorage

Load raw TensorStorage for keys, preserving the cache's native encoding.

Link copied to clipboard
fun loadValuesForAttention(layer: Int, startPos: Int = 0, endPos: Int = cache.currentSeqLen): FloatArray

Load cached values for attention, dequantizing as needed.

Link copied to clipboard
fun loadValueStorageRaw(layer: Int, startPos: Int = 0, endPos: Int = cache.currentSeqLen): TensorStorage

Load raw TensorStorage for values, preserving native encoding.

Link copied to clipboard
fun storeKeyValue(layer: Int, key: FloatArray, value: FloatArray)

Store K/V projections for a new token, compressing as configured.