skainet-io-gguf/sk.ainet.io.gguf.llama

Package-level declarations

Types

object LlamaGgufTensorNames : TensorNameMapper

Standard GGUF tensor naming for LLaMA-family models.

data class LlamaLayerWeights<T : DType>(val attnNorm: Tensor<T, Float>, val wq: Tensor<T, Float>, val wk: Tensor<T, Float>, val wv: Tensor<T, Float>, val wo: Tensor<T, Float>, val ffnNorm: Tensor<T, Float>, val ffnGate: Tensor<T, Float>, val ffnDown: Tensor<T, Float>, val ffnUp: Tensor<T, Float>)

LlamaModelMetadata

data class LlamaModelMetadata(val architecture: String, val embeddingLength: Int, val contextLength: Int, val blockCount: Int, val headCount: Int, val kvHeadCount: Int, val feedForwardLength: Int, val ropeDimensionCount: Int?, val vocabSize: Int)

LlamaRuntimeWeights

data class LlamaRuntimeWeights<T : DType>(val metadata: LlamaModelMetadata, val tokenEmbedding: Tensor<T, Float>, val ropeFreqReal: Tensor<T, Float>?, val ropeFreqImag: Tensor<T, Float>?, val layers: List<LlamaLayerWeights<T>>, val outputNorm: Tensor<T, Float>, val outputWeight: Tensor<T, Float>, val quantTypes: Map<String, GGMLQuantizationType> = emptyMap())

LlamaTensorNames

object LlamaTensorNames

LlamaWeightLoader

class LlamaWeightLoader

Adapter that loads LLaMA weights from GGUF files and emits them in the canonical GGUF tensor naming scheme. Validation covers metadata presence and basic shape consistency for the tensors we materialize.

LlamaWeightMapper

object LlamaWeightMapper

Converts loader-emitted tensors to a typed structure ready for runtime/module wiring. Enforces basic shape sanity against the metadata to fail early before graph construction.

LlamaWeights

data class LlamaWeights<T : DType, V>(val metadata: LlamaModelMetadata, val tensors: Map<String, Tensor<T, V>>, val quantTypes: Map<String, GGMLQuantizationType> = emptyMap())

MmapLlamaLoader

class MmapLlamaLoader(filePath: Path) : AutoCloseable

Memory-mapped GGUF loader that provides zero-copy tensor access.

QuantizedTensorFactory

object QuantizedTensorFactory

Factory for creating quantized tensor data from raw GGUF bytes.

QuantizedTensorFactoryJvm

object QuantizedTensorFactoryJvm

JVM extensions for QuantizedTensorFactory that produce MemorySegment-backed quantized tensor data for SIMD-friendly access patterns.

Functions

loadLlamaRuntimeWeights

suspend fun loadLlamaRuntimeWeights(ctx: ExecutionContext, sourceProvider: () -> Source, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<FP32>

Backward-compatible overload defaulting to FP32.

suspend fun <T : DType> loadLlamaRuntimeWeights(ctx: ExecutionContext, sourceProvider: () -> Source, dtype: KClass<T>, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<T>

Convenience loader: reads weights from GGUF source, maps them into runtime structure.

loadLlamaRuntimeWeightsDequantized

suspend fun loadLlamaRuntimeWeightsDequantized(ctx: ExecutionContext, sourceProvider: () -> Source): LlamaRuntimeWeights<FP32>

Backward-compatible overload defaulting to FP32.

suspend fun <T : DType> loadLlamaRuntimeWeightsDequantized(ctx: ExecutionContext, sourceProvider: () -> Source, dtype: KClass<T>): LlamaRuntimeWeights<T>

Convenience helper to force dequantization to FP32 (where supported) and fail if any unsupported quant types remain.

loadLlamaRuntimeWeightsDequantizedStreaming

suspend fun loadLlamaRuntimeWeightsDequantizedStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource): LlamaRuntimeWeights<FP32>

Backward-compatible overload defaulting to FP32.

suspend fun <T : DType> loadLlamaRuntimeWeightsDequantizedStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource, dtype: KClass<T>): LlamaRuntimeWeights<T>

Load LLaMA runtime weights using streaming API with dequantization. Suitable for large models >2GB.

loadLlamaRuntimeWeightsStreaming

suspend fun loadLlamaRuntimeWeightsStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<FP32>

Backward-compatible overload defaulting to FP32.

suspend fun <T : DType> loadLlamaRuntimeWeightsStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource, dtype: KClass<T>, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<T>

Load LLaMA runtime weights using streaming API. Parses metadata only (~1MB memory), loads tensors on-demand. Suitable for models of any size (100+ GB) that exceed Java array limits.

toQ4_0MemSeg

fun Tensor<Int8, Byte>.toQ4_0MemSeg(logicalShape: Shape, arena: Arena): Q4MemorySegmentTensorData

Extension: convert raw Int8 tensor to Q4_0 MemorySegment-backed data.

toQ4_KTensorData

fun Tensor<Int8, Byte>.toQ4_KTensorData(logicalShape: Shape): Q4_KTensorData

Extension function to convert raw tensor to Q4_KTensorData.

toQ8_0MemSeg

fun Tensor<Int8, Byte>.toQ8_0MemSeg(logicalShape: Shape, arena: Arena): Q8MemorySegmentTensorData

Extension: convert raw Int8 tensor to Q8_0 MemorySegment-backed data.

toQ8_0TensorData

fun Tensor<Int8, Byte>.toQ8_0TensorData(logicalShape: Shape): Q8_0TensorData

Extension function to convert raw tensor to Q8_0TensorData.