KvCacheStore

interface KvCacheStore(source)

Dedicated KV-cache storage abstraction for inference.

Unlike generic TensorStorage, a KV cache is append-friendly and role-aware: keys and values may use different encodings and bit budgets. The cache is addressed by (layer, head, position) and supports compressed block storage for quantized formats (Q4_K, Q8_0, TurboQuant, etc.).

Backends and attention kernels interact with the cache through this interface rather than managing raw tensors directly. This allows:

  • Compressed K/V writes on token append

  • Tile-level dequantization on read (only the needed range)

  • Asymmetric K/V policies (e.g., Q8_0 for keys, 4-bit for values)

  • Backend-specific fused dequant+attention paths

Inheritors

Types

Link copied to clipboard
object Companion

Properties

Link copied to clipboard
abstract val currentSeqLen: Int

Current number of tokens stored in the cache.

Link copied to clipboard
abstract val headDim: Int

Dimension per head.

Link copied to clipboard

Encoding used for key storage.

Link copied to clipboard
abstract val maxSeqLen: Int

Maximum sequence length this cache can hold.

Link copied to clipboard
abstract val numHeads: Int

Number of KV heads per layer.

Link copied to clipboard
abstract val numLayers: Int

Number of transformer layers in this cache.

Link copied to clipboard
abstract val placement: Placement

Placement intent for the cache buffers.

Link copied to clipboard

Encoding used for value storage.

Functions

Link copied to clipboard
abstract fun appendToken(layer: Int, key: FloatArray, value: FloatArray)

Append a single token's K/V projections for one layer.

Link copied to clipboard
abstract fun clear()

Reset the cache, clearing all stored tokens.

Link copied to clipboard
abstract fun evict(fromPos: Int)

Evict all cached tokens from position fromPos onward.

Link copied to clipboard

Memory report for the entire cache.

Link copied to clipboard
abstract fun readKeys(layer: Int, startPos: Int = 0, endPos: Int = currentSeqLen): FloatArray

Read cached keys for a layer, dequantized to float.

Link copied to clipboard
abstract fun readKeyStorage(layer: Int, startPos: Int = 0, endPos: Int = currentSeqLen): TensorStorage

Read raw (possibly compressed) key storage for a layer as TensorStorage.

Link copied to clipboard
abstract fun readValues(layer: Int, startPos: Int = 0, endPos: Int = currentSeqLen): FloatArray

Read cached values for a layer, dequantized to float.

Link copied to clipboard
abstract fun readValueStorage(layer: Int, startPos: Int = 0, endPos: Int = currentSeqLen): TensorStorage

Read raw (possibly compressed) value storage for a layer as TensorStorage.