CompressedKvAttention
Bridge between KvCacheStore and the SDPA execution path.
This abstraction provides the integration point for compressed K/V in the attention runtime. Instead of modifying the core TensorOps interface (which maps to backend-specific fused kernels), this component sits between the model layer and SDPA:
Write path: Compresses K/V on token append via storeKeyValue
Read path: Dequantizes only required tiles via loadKeysForAttention and loadValuesForAttention
Extension point: Backends can override DequantStrategy to fuse decompression with attention math.
Usage in a transformer layer:
val bridge = CompressedKvAttention(kvCache)
bridge.storeKeyValue(layer, keyProjection, valueProjection)
val keys = bridge.loadKeysForAttention(layer)
val values = bridge.loadValuesForAttention(layer)
// pass keys, values to scaledDotProductAttentionConstructors
Properties
Functions
Load cached keys for attention, dequantizing as needed.
Load raw TensorStorage for keys, preserving the cache's native encoding.
Load cached values for attention, dequantizing as needed.
Load raw TensorStorage for values, preserving native encoding.
Store K/V projections for a new token, compressing as configured.