Q4_KBlockTensorData
Implementation of Q4_KTensorData backed by a packed byte array.
Memory layout per block (144 bytes):
bytes 0..1: f16 d (little-endian)
bytes 2..3: f16 dMin (little-endian)
bytes 4..15: packed 12-bit scale/min indices (12 bytes)
bytes 16..143: 4-bit quantized codes (128 bytes, 2 codes per byte)
Scale packing: Each sub-block uses 12 bits (6 for scaleIdx, 6 for minIdx). 8 sub-blocks × 12 bits = 96 bits = 12 bytes.
Parameters
initialShape
the logical shape of the tensor (in elements, not blocks)
packedData
the raw packed block data